pyBIA.optimization

Created on Wed Sep 11 12:04:23 2021

@author: daniel

Classes

objective_xgb

Optuna objective class for optimizing an XGBoost classifier using cross-validation.

objective_nn

Optuna objective class for optimizing an MLP classifier using cross-validation.

objective_rf

Optuna objective class for optimizing a RF classifier using cross-validation.

Functions

hyper_opt([data_x, data_y, clf, n_iter, opt_cv, ...])

Optimize model hyperparameters with Optuna using stratified k-fold cross-validation.

borutashap_opt(data_x, data_y[, boruta_trials, model, ...])

Run BorutaSHAP feature selection (Boruta + SHAP) and return selected feature indices.

standardize_data(data_x[, method, return_scaler])

Scale features with a chosen strategy for models sensitive to input range.

impute_missing_values(data[, imputer, strategy, k, ...])

Impute missing values using mean/median/mode, a constant, or k-nearest neighbors.

Strawman_imputation(data)

Median (“strawman”) imputation for missing values.

Module Contents

class pyBIA.optimization.objective_xgb(data_x, data_y, limit_search=False, opt_cv=3, scoring_metric='f1', SEED_NO=1909)[source]

Bases: object

Optuna objective class for optimizing an XGBoost classifier using cross-validation.

This class defines the optimization logic for tuning XGBoost hyperparameters using the Optuna framework. It supports limited or broad search spaces depending on the limit_search flag, and returns the cross-validated performance metric for each trial.

Parameters:
  • data_x (ndarray) – Feature matrix of shape (n_samples, n_features).

  • data_y (ndarray or array-like) – Corresponding class labels of shape (n_samples,).

  • limit_search (bool, optional) – If True, restricts the hyperparameter search space to a narrower range. Defaults to False (broad search).

  • opt_cv (int, optional) – Number of cross-validation folds. Must be >= 2. Default is 3.

  • scoring_metric (str, optional) – Evaluation metric used during optimization. Options are: [‘accuracy’, ‘f1’, ‘precision’, ‘recall’, ‘roc_auc’]. Default is ‘f1’.

  • SEED_NO (int, optional) – Random seed for reproducibility. Default is 1909.

Returns:

Cross-validated score (mean across folds) for the given trial configuration.

Return type:

float

data_x[source]
data_y[source]
opt_cv = 3[source]
SEED_NO = 1909[source]
n_classes[source]
__call__(trial)[source]

Run a single optimization trial by training the XGBoost model on cross-validation folds and returning the mean performance metric.

Parameters:

trial (optuna.Trial) – A trial object provided by Optuna to suggest hyperparameters.

Returns:

Mean cross-validated score for the trial.

Return type:

float

class pyBIA.optimization.objective_nn(data_x, data_y, opt_cv, scoring_metric='f1', SEED_NO=1909)[source]

Bases: object

Optuna objective class for optimizing an MLP classifier using cross-validation.

This class defines the optimization logic for tuning XGBoost hyperparameters using the Optuna framework. It supports limited or broad search spaces depending on the limit_search flag, and returns the cross-validated performance metric for each trial.

Parameters:
  • data_x (ndarray) – Feature matrix of shape (n_samples, n_features).

  • data_y (ndarray or array-like) – Corresponding class labels of shape (n_samples,).

  • limit_search (bool, optional) – If True, restricts the hyperparameter search space to a narrower range. Defaults to False (broad search).

  • opt_cv (int, optional) – Number of cross-validation folds. Must be >= 2. Default is 3.

  • scoring_metric (str, optional) – Evaluation metric used during optimization. Options are: [‘accuracy’, ‘f1’, ‘precision’, ‘recall’, ‘roc_auc’]. Default is ‘f1’.

  • SEED_NO (int, optional) – Random seed for reproducibility. Default is 1909.

Returns:

Cross-validated score (mean across folds) for the given trial configuration.

Return type:

float

data_x[source]
data_y[source]
opt_cv[source]
SEED_NO = 1909[source]
__call__(trial)[source]

Run a single optimization trial by training the XGBoost model on cross-validation folds and returning the mean performance metric.

Parameters:

trial (optuna.Trial) – A trial object provided by Optuna to suggest hyperparameters.

Returns:

Mean cross-validated score for the trial.

Return type:

float

class pyBIA.optimization.objective_rf(data_x, data_y, opt_cv, scoring_metric='f1', SEED_NO=1909)[source]

Bases: object

Optuna objective class for optimizing a RF classifier using cross-validation.

This class defines the optimization logic for tuning XGBoost hyperparameters using the Optuna framework. It supports limited or broad search spaces depending on the limit_search flag, and returns the cross-validated performance metric for each trial.

Parameters:
  • data_x (ndarray) – Feature matrix of shape (n_samples, n_features).

  • data_y (ndarray or array-like) – Corresponding class labels of shape (n_samples,).

  • limit_search (bool, optional) – If True, restricts the hyperparameter search space to a narrower range. Defaults to False (broad search).

  • opt_cv (int, optional) – Number of cross-validation folds. Must be >= 2. Default is 3.

  • scoring_metric (str, optional) – Evaluation metric used during optimization. Options are: [‘accuracy’, ‘f1’, ‘precision’, ‘recall’, ‘roc_auc’]. Default is ‘f1’.

  • SEED_NO (int, optional) – Random seed for reproducibility. Default is 1909.

Returns:

Cross-validated score (mean across folds) for the given trial configuration.

Return type:

float

data_x[source]
data_y[source]
opt_cv[source]
SEED_NO = 1909[source]
__call__(trial)[source]

Run a single optimization trial by training the XGBoost model on cross-validation folds and returning the mean performance metric.

Parameters:

trial (optuna.Trial) – A trial object provided by Optuna to suggest hyperparameters.

Returns:

Mean cross-validated score for the trial.

Return type:

float

pyBIA.optimization.hyper_opt(data_x=None, data_y=None, clf='xgb', n_iter=25, opt_cv=10, balance=True, scoring_metric='f1', limit_search=True, return_study=True, SEED_NO=1909)[source]

Optimize model hyperparameters with Optuna using stratified k-fold cross-validation.

Parameters:
  • data_x (ndarray or None, optional) – 2D array with shape (n_samples, n_features) used to fit and evaluate the model; required for ‘rf’, ‘nn’, and ‘xgb’. Default is None.

  • data_y (array-like or None, optional) – 1D label array aligned with data_x; may be numeric or strings (strings are auto-mapped to integers for XGBoost). Default is None.

  • clf ({'rf','nn','xgb'}, optional) – Which classifier to tune: Random Forest (‘rf’), Scikit-learn MLP (‘nn’), or XGBoost (‘xgb’). Default is ‘xgb’.

  • n_iter (int, optional) – Number of Optuna trials; set to 0 to skip optimization and return the base (untuned) model. Default is 25.

  • opt_cv (int, optional) – Number of stratified cross-validation folds per trial. Default is 10.

  • balance (bool, optional) – If True, apply class weighting for binary tasks (RF: class_weight=’balanced’; XGB: scale_pos_weight; MLP does not support weights). Default is True.

  • scoring_metric (str, optional) – Scikit-learn scoring name used for CV evaluation; for multiclass, maps to macro/OVR variants (e.g., ‘f1’→’f1_macro’, ‘roc_auc’→’roc_auc_ovr’). Default is ‘f1’.

  • limit_search (bool, optional) – If True, restrict the XGBoost search space to a compact, safe region to reduce runtime and memory risk. Default is True.

  • return_study (bool, optional) – If True, return the Optuna Study object as a third output for downstream analysis/visualization. Default is True.

  • SEED_NO (int, optional) – Random seed for CV splitters and the TPE sampler to ensure reproducibility. Default is 1909.

Returns:

  • model (estimator) – Fitted estimator configured with the best hyperparameters found (or the base model if n_iter is 0).

  • params (dict) – Dictionary of the best hyperparameters from the Optuna study.

  • study (optuna.study.Study) – Returned only when return_study is True; contains all trials and results.

Examples

Fit a tuned Random Forest: >>> model, params = hyper_opt(data_x, data_y, clf=’rf’, n_iter=50)

Retrieve the Optuna study for visualization: >>> model, params, study = hyper_opt(data_x, data_y, clf=’xgb’, n_iter=50, return_study=True) >>> from optuna.visualization.matplotlib import plot_contour >>> plot_contour(study)

Raises:

ValueError – If clf is not one of {‘rf’, ‘nn’, ‘xgb’}.

pyBIA.optimization.borutashap_opt(data_x, data_y, boruta_trials=50, model='rf', importance_type='gain', SEED_NO=1909)[source]

Run BorutaSHAP feature selection (Boruta + SHAP) and return selected feature indices.

Parameters:
  • data_x (ndarray) – Feature matrix of shape (n_samples, n_features) used to compute importances; must contain no NaNs.

  • data_y (array-like) – 1D array of labels aligned with data_x; categorical labels are internally mapped to integers.

  • boruta_trials (int, optional) – Number of BorutaSHAP iterations to stabilize the acceptance/rejection distributions; default is 50.

  • model ({'rf','xgb'}, optional) – Base estimator used to compute importances: Random Forest (‘rf’) or XGBoost (‘xgb’); default is ‘rf’.

  • importance_type ({'gain','weight','cover','total_gain','total_cover'}, optional) – XGBoost importance metric to use when model=’xgb’; ignored for Random Forest; default is ‘gain’.

  • SEED_NO (int, optional) – Random seed for reproducibility of the estimator and BorutaSHAP sampling; default is 1909.

Returns:

  • index (ndarray) – Sorted array of selected feature indices (dtype=int) referring to columns in data_x.

  • feat_selector (BorutaSHAP) – Fitted BorutaSHAP selector object containing selection history and plotting utilities.

Raises:

ValueError – If model is not one of {‘rf’,’xgb’}, if data_x contains NaNs, or if BorutaSHAP fitting fails.

pyBIA.optimization.standardize_data(data_x, method='min-max', return_scaler=True)[source]

Scale features with a chosen strategy for models sensitive to input range.

Parameters:
  • data_x (ndarray) – Feature matrix of shape (n_samples, n_features) to be transformed.

  • method ({'min-max','robust','standard'}, optional) – Scaling strategy to apply: ‘min-max’ rescales each feature to [0, 1]; ‘robust’ centers by the median and scales by the IQR; ‘standard’ centers to mean 0 and scales to unit variance. Default is ‘min-max’.

  • return_scaler (bool, optional) – If True, return the fitted scaler object along with the transformed data; if False, return only the transformed data. Default is True.

Returns:

  • norm_data_x (ndarray) – Scaled feature matrix of shape (n_samples, n_features).

  • scaler (sklearn.base.TransformerMixin) – Fitted scaler instance (MinMaxScaler, RobustScaler, or StandardScaler); returned only when return_scaler is True.

pyBIA.optimization.impute_missing_values(data, imputer=None, strategy='knn', k=3, constant_value=0)[source]

Impute missing values using mean/median/mode, a constant, or k-nearest neighbors.

Parameters:
  • data (ndarray) – Array of shape (n_samples, n_features) containing NaNs to be imputed.

  • imputer (sklearn.impute.SimpleImputer | sklearn.impute.KNNImputer | None, optional) – Pre-fitted imputer to apply; if None, a new imputer is created and fitted on data. Default is None.

  • strategy ({'knn','mean','median','mode','constant'}, optional) – Imputation strategy to use. Default is ‘knn’.

  • k (int, optional) – Number of neighbors for KNN imputation (used only when strategy=’knn’). Default is 3.

  • constant_value (float or int, optional) – Fill value for constant imputation (used only when strategy=’constant’). Default is 0.

Returns:

  • imputed_data (ndarray) – Array with missing values filled.

  • imputer (sklearn.impute.SimpleImputer | sklearn.impute.KNNImputer) – Fitted imputer returned only when a new imputer is created (i.e., when input imputer is None).

pyBIA.optimization.Strawman_imputation(data)[source]

Median (“strawman”) imputation for missing values.

Parameters:

data (ndarray) – Input array of shape (n_samples, n_features) or (n_features,). Missing values are assumed to be encoded as NaN or ±inf. For 1D input, a single global median (over finite values) is used. For 2D input, medians are computed column-wise.

Returns:

imputed – Array with the same shape as data in which missing entries have been replaced by the corresponding median(s).

Return type:

ndarray