pyBIA.optimization

Created on Wed Sep 11 12:04:23 2021

@author: daniel

Classes

`objective_xgb`	Optuna objective class for optimizing an XGBoost classifier using cross-validation.
`objective_nn`	Optuna objective class for optimizing an MLP classifier using cross-validation.
`objective_rf`	Optuna objective class for optimizing a RF classifier using cross-validation.

Functions

`hyper_opt`([data_x, data_y, clf, n_iter, opt_cv, ...])	Optimize model hyperparameters with Optuna using stratified k-fold cross-validation.
`borutashap_opt`(data_x, data_y[, boruta_trials, model, ...])	Run BorutaSHAP feature selection (Boruta + SHAP) and return selected feature indices.
`standardize_data`(data_x[, method, return_scaler])	Scale features with a chosen strategy for models sensitive to input range.
`impute_missing_values`(data[, imputer, strategy, k, ...])	Impute missing values using mean/median/mode, a constant, or k-nearest neighbors.
`Strawman_imputation`(data)	Median (“strawman”) imputation for missing values.

Module Contents

class pyBIA.optimization.objective_xgb(data_x, data_y, limit_search=False, opt_cv=3, scoring_metric='f1', SEED_NO=1909)[source]

Bases: object

Optuna objective class for optimizing an XGBoost classifier using cross-validation.

This class defines the optimization logic for tuning XGBoost hyperparameters using the Optuna framework. It supports limited or broad search spaces depending on the limit_search flag, and returns the cross-validated performance metric for each trial.

Parameters:

data_x (ndarray) – Feature matrix of shape (n_samples, n_features).
data_y (ndarray or array-like) – Corresponding class labels of shape (n_samples,).
limit_search (bool, optional) – If True, restricts the hyperparameter search space to a narrower range. Defaults to False (broad search).
opt_cv (int, optional) – Number of cross-validation folds. Must be >= 2. Default is 3.
scoring_metric (str, optional) – Evaluation metric used during optimization. Options are: [‘accuracy’, ‘f1’, ‘precision’, ‘recall’, ‘roc_auc’]. Default is ‘f1’.
SEED_NO (int, optional) – Random seed for reproducibility. Default is 1909.

Returns:

Cross-validated score (mean across folds) for the given trial configuration.

Return type:

float

data_x[source]

data_y[source]

limit_search = False[source]

opt_cv = 3[source]

SEED_NO = 1909[source]

n_classes[source]

__call__(trial)[source]

Run a single optimization trial by training the XGBoost model on cross-validation folds and returning the mean performance metric.

Parameters:: trial (optuna.Trial) – A trial object provided by Optuna to suggest hyperparameters.
Returns:: Mean cross-validated score for the trial.
Return type:: float

class pyBIA.optimization.objective_nn(data_x, data_y, opt_cv, scoring_metric='f1', SEED_NO=1909)[source]

Bases: object

Optuna objective class for optimizing an MLP classifier using cross-validation.

This class defines the optimization logic for tuning XGBoost hyperparameters using the Optuna framework. It supports limited or broad search spaces depending on the limit_search flag, and returns the cross-validated performance metric for each trial.

Parameters:

data_x (ndarray) – Feature matrix of shape (n_samples, n_features).
data_y (ndarray or array-like) – Corresponding class labels of shape (n_samples,).
limit_search (bool, optional) – If True, restricts the hyperparameter search space to a narrower range. Defaults to False (broad search).
opt_cv (int, optional) – Number of cross-validation folds. Must be >= 2. Default is 3.
scoring_metric (str, optional) – Evaluation metric used during optimization. Options are: [‘accuracy’, ‘f1’, ‘precision’, ‘recall’, ‘roc_auc’]. Default is ‘f1’.
SEED_NO (int, optional) – Random seed for reproducibility. Default is 1909.

Returns:

Cross-validated score (mean across folds) for the given trial configuration.

Return type:

float

data_x[source]

data_y[source]

opt_cv[source]

SEED_NO = 1909[source]

__call__(trial)[source]

Run a single optimization trial by training the XGBoost model on cross-validation folds and returning the mean performance metric.

Parameters:: trial (optuna.Trial) – A trial object provided by Optuna to suggest hyperparameters.
Returns:: Mean cross-validated score for the trial.
Return type:: float

class pyBIA.optimization.objective_rf(data_x, data_y, opt_cv, scoring_metric='f1', SEED_NO=1909)[source]

Bases: object

Optuna objective class for optimizing a RF classifier using cross-validation.

This class defines the optimization logic for tuning XGBoost hyperparameters using the Optuna framework. It supports limited or broad search spaces depending on the limit_search flag, and returns the cross-validated performance metric for each trial.

Parameters:

data_x (ndarray) – Feature matrix of shape (n_samples, n_features).
data_y (ndarray or array-like) – Corresponding class labels of shape (n_samples,).
limit_search (bool, optional) – If True, restricts the hyperparameter search space to a narrower range. Defaults to False (broad search).
opt_cv (int, optional) – Number of cross-validation folds. Must be >= 2. Default is 3.
scoring_metric (str, optional) – Evaluation metric used during optimization. Options are: [‘accuracy’, ‘f1’, ‘precision’, ‘recall’, ‘roc_auc’]. Default is ‘f1’.
SEED_NO (int, optional) – Random seed for reproducibility. Default is 1909.

Returns:

Cross-validated score (mean across folds) for the given trial configuration.

Return type:

float

data_x[source]

data_y[source]

opt_cv[source]

SEED_NO = 1909[source]

__call__(trial)[source]

Run a single optimization trial by training the XGBoost model on cross-validation folds and returning the mean performance metric.

Parameters:: trial (optuna.Trial) – A trial object provided by Optuna to suggest hyperparameters.
Returns:: Mean cross-validated score for the trial.
Return type:: float

pyBIA.optimization.hyper_opt(data_x=None, data_y=None, clf='xgb', n_iter=25, opt_cv=10, balance=True, scoring_metric='f1', limit_search=True, return_study=True, SEED_NO=1909)[source]

Optimize model hyperparameters with Optuna using stratified k-fold cross-validation.

Parameters:

data_x (ndarray or None, optional) – 2D array with shape (n_samples, n_features) used to fit and evaluate the model; required for ‘rf’, ‘nn’, and ‘xgb’. Default is None.
data_y (array-like or None, optional) – 1D label array aligned with data_x; may be numeric or strings (strings are auto-mapped to integers for XGBoost). Default is None.
clf ({'rf','nn','xgb'}, optional) – Which classifier to tune: Random Forest (‘rf’), Scikit-learn MLP (‘nn’), or XGBoost (‘xgb’). Default is ‘xgb’.
n_iter (int, optional) – Number of Optuna trials; set to 0 to skip optimization and return the base (untuned) model. Default is 25.
opt_cv (int, optional) – Number of stratified cross-validation folds per trial. Default is 10.
balance (bool, optional) – If True, apply class weighting for binary tasks (RF: class_weight=’balanced’; XGB: scale_pos_weight; MLP does not support weights). Default is True.
scoring_metric (str, optional) – Scikit-learn scoring name used for CV evaluation; for multiclass, maps to macro/OVR variants (e.g., ‘f1’→’f1_macro’, ‘roc_auc’→’roc_auc_ovr’). Default is ‘f1’.
limit_search (bool, optional) – If True, restrict the XGBoost search space to a compact, safe region to reduce runtime and memory risk. Default is True.
return_study (bool, optional) – If True, return the Optuna Study object as a third output for downstream analysis/visualization. Default is True.
SEED_NO (int, optional) – Random seed for CV splitters and the TPE sampler to ensure reproducibility. Default is 1909.

Returns:

model (estimator) – Fitted estimator configured with the best hyperparameters found (or the base model if n_iter is 0).
params (dict) – Dictionary of the best hyperparameters from the Optuna study.
study (optuna.study.Study) – Returned only when return_study is True; contains all trials and results.

Examples

Fit a tuned Random Forest: >>> model, params = hyper_opt(data_x, data_y, clf=’rf’, n_iter=50)

Retrieve the Optuna study for visualization: >>> model, params, study = hyper_opt(data_x, data_y, clf=’xgb’, n_iter=50, return_study=True) >>> from optuna.visualization.matplotlib import plot_contour >>> plot_contour(study)

Raises:: ValueError – If clf is not one of {‘rf’, ‘nn’, ‘xgb’}.

pyBIA.optimization.borutashap_opt(data_x, data_y, boruta_trials=50, model='rf', importance_type='gain', SEED_NO=1909)[source]

Run BorutaSHAP feature selection (Boruta + SHAP) and return selected feature indices.

Parameters:

data_x (ndarray) – Feature matrix of shape (n_samples, n_features) used to compute importances; must contain no NaNs.
data_y (array-like) – 1D array of labels aligned with data_x; categorical labels are internally mapped to integers.
boruta_trials (int, optional) – Number of BorutaSHAP iterations to stabilize the acceptance/rejection distributions; default is 50.
model ({'rf','xgb'}, optional) – Base estimator used to compute importances: Random Forest (‘rf’) or XGBoost (‘xgb’); default is ‘rf’.
importance_type ({'gain','weight','cover','total_gain','total_cover'}, optional) – XGBoost importance metric to use when model=’xgb’; ignored for Random Forest; default is ‘gain’.
SEED_NO (int, optional) – Random seed for reproducibility of the estimator and BorutaSHAP sampling; default is 1909.

Returns:

index (ndarray) – Sorted array of selected feature indices (dtype=int) referring to columns in data_x.
feat_selector (BorutaSHAP) – Fitted BorutaSHAP selector object containing selection history and plotting utilities.

Raises:

ValueError – If model is not one of {‘rf’,’xgb’}, if data_x contains NaNs, or if BorutaSHAP fitting fails.

pyBIA.optimization.standardize_data(data_x, method='min-max', return_scaler=True)[source]

Scale features with a chosen strategy for models sensitive to input range.

Parameters:

data_x (ndarray) – Feature matrix of shape (n_samples, n_features) to be transformed.
method ({'min-max','robust','standard'}, optional) – Scaling strategy to apply: ‘min-max’ rescales each feature to [0, 1]; ‘robust’ centers by the median and scales by the IQR; ‘standard’ centers to mean 0 and scales to unit variance. Default is ‘min-max’.
return_scaler (bool, optional) – If True, return the fitted scaler object along with the transformed data; if False, return only the transformed data. Default is True.

Returns:

norm_data_x (ndarray) – Scaled feature matrix of shape (n_samples, n_features).
scaler (sklearn.base.TransformerMixin) – Fitted scaler instance (MinMaxScaler, RobustScaler, or StandardScaler); returned only when return_scaler is True.

pyBIA.optimization.impute_missing_values(data, imputer=None, strategy='knn', k=3, constant_value=0)[source]

Impute missing values using mean/median/mode, a constant, or k-nearest neighbors.

Parameters:

data (ndarray) – Array of shape (n_samples, n_features) containing NaNs to be imputed.
imputer (sklearn.impute.SimpleImputer | sklearn.impute.KNNImputer | None, optional) – Pre-fitted imputer to apply; if None, a new imputer is created and fitted on data. Default is None.
strategy ({'knn','mean','median','mode','constant'}, optional) – Imputation strategy to use. Default is ‘knn’.
k (int, optional) – Number of neighbors for KNN imputation (used only when strategy=’knn’). Default is 3.
constant_value (float or int, optional) – Fill value for constant imputation (used only when strategy=’constant’). Default is 0.

Returns:

imputed_data (ndarray) – Array with missing values filled.
imputer (sklearn.impute.SimpleImputer | sklearn.impute.KNNImputer) – Fitted imputer returned only when a new imputer is created (i.e., when input imputer is None).

pyBIA.optimization.Strawman_imputation(data)[source]

Median (“strawman”) imputation for missing values.

Parameters:: data (ndarray) – Input array of shape (n_samples, n_features) or (n_features,). Missing values are assumed to be encoded as NaN or ±inf. For 1D input, a single global median (over finite values) is used. For 2D input, medians are computed column-wise.
Returns:: imputed – Array with the same shape as data in which missing entries have been replaced by the corresponding median(s).
Return type:: ndarray