pyBIA.optimization
Created on Wed Sep 11 12:04:23 2021
@author: daniel
Classes
Optuna objective class for optimizing an XGBoost classifier using cross-validation. |
|
Optuna objective class for optimizing an MLP classifier using cross-validation. |
|
Optuna objective class for optimizing a RF classifier using cross-validation. |
Functions
|
Optimize model hyperparameters with Optuna using stratified k-fold cross-validation. |
|
Run BorutaSHAP feature selection (Boruta + SHAP) and return selected feature indices. |
|
Scale features with a chosen strategy for models sensitive to input range. |
|
Impute missing values using mean/median/mode, a constant, or k-nearest neighbors. |
|
Median (“strawman”) imputation for missing values. |
Module Contents
- class pyBIA.optimization.objective_xgb(data_x, data_y, limit_search=False, opt_cv=3, scoring_metric='f1', SEED_NO=1909)[source]
Bases:
objectOptuna objective class for optimizing an XGBoost classifier using cross-validation.
This class defines the optimization logic for tuning XGBoost hyperparameters using the Optuna framework. It supports limited or broad search spaces depending on the limit_search flag, and returns the cross-validated performance metric for each trial.
- Parameters:
data_x (ndarray) – Feature matrix of shape (n_samples, n_features).
data_y (ndarray or array-like) – Corresponding class labels of shape (n_samples,).
limit_search (bool, optional) – If True, restricts the hyperparameter search space to a narrower range. Defaults to False (broad search).
opt_cv (int, optional) – Number of cross-validation folds. Must be >= 2. Default is 3.
scoring_metric (str, optional) – Evaluation metric used during optimization. Options are: [‘accuracy’, ‘f1’, ‘precision’, ‘recall’, ‘roc_auc’]. Default is ‘f1’.
SEED_NO (int, optional) – Random seed for reproducibility. Default is 1909.
- Returns:
Cross-validated score (mean across folds) for the given trial configuration.
- Return type:
- __call__(trial)[source]
Run a single optimization trial by training the XGBoost model on cross-validation folds and returning the mean performance metric.
- Parameters:
trial (optuna.Trial) – A trial object provided by Optuna to suggest hyperparameters.
- Returns:
Mean cross-validated score for the trial.
- Return type:
- class pyBIA.optimization.objective_nn(data_x, data_y, opt_cv, scoring_metric='f1', SEED_NO=1909)[source]
Bases:
objectOptuna objective class for optimizing an MLP classifier using cross-validation.
This class defines the optimization logic for tuning XGBoost hyperparameters using the Optuna framework. It supports limited or broad search spaces depending on the limit_search flag, and returns the cross-validated performance metric for each trial.
- Parameters:
data_x (ndarray) – Feature matrix of shape (n_samples, n_features).
data_y (ndarray or array-like) – Corresponding class labels of shape (n_samples,).
limit_search (bool, optional) – If True, restricts the hyperparameter search space to a narrower range. Defaults to False (broad search).
opt_cv (int, optional) – Number of cross-validation folds. Must be >= 2. Default is 3.
scoring_metric (str, optional) – Evaluation metric used during optimization. Options are: [‘accuracy’, ‘f1’, ‘precision’, ‘recall’, ‘roc_auc’]. Default is ‘f1’.
SEED_NO (int, optional) – Random seed for reproducibility. Default is 1909.
- Returns:
Cross-validated score (mean across folds) for the given trial configuration.
- Return type:
- __call__(trial)[source]
Run a single optimization trial by training the XGBoost model on cross-validation folds and returning the mean performance metric.
- Parameters:
trial (optuna.Trial) – A trial object provided by Optuna to suggest hyperparameters.
- Returns:
Mean cross-validated score for the trial.
- Return type:
- class pyBIA.optimization.objective_rf(data_x, data_y, opt_cv, scoring_metric='f1', SEED_NO=1909)[source]
Bases:
objectOptuna objective class for optimizing a RF classifier using cross-validation.
This class defines the optimization logic for tuning XGBoost hyperparameters using the Optuna framework. It supports limited or broad search spaces depending on the limit_search flag, and returns the cross-validated performance metric for each trial.
- Parameters:
data_x (ndarray) – Feature matrix of shape (n_samples, n_features).
data_y (ndarray or array-like) – Corresponding class labels of shape (n_samples,).
limit_search (bool, optional) – If True, restricts the hyperparameter search space to a narrower range. Defaults to False (broad search).
opt_cv (int, optional) – Number of cross-validation folds. Must be >= 2. Default is 3.
scoring_metric (str, optional) – Evaluation metric used during optimization. Options are: [‘accuracy’, ‘f1’, ‘precision’, ‘recall’, ‘roc_auc’]. Default is ‘f1’.
SEED_NO (int, optional) – Random seed for reproducibility. Default is 1909.
- Returns:
Cross-validated score (mean across folds) for the given trial configuration.
- Return type:
- __call__(trial)[source]
Run a single optimization trial by training the XGBoost model on cross-validation folds and returning the mean performance metric.
- Parameters:
trial (optuna.Trial) – A trial object provided by Optuna to suggest hyperparameters.
- Returns:
Mean cross-validated score for the trial.
- Return type:
- pyBIA.optimization.hyper_opt(data_x=None, data_y=None, clf='xgb', n_iter=25, opt_cv=10, balance=True, scoring_metric='f1', limit_search=True, return_study=True, SEED_NO=1909)[source]
Optimize model hyperparameters with Optuna using stratified k-fold cross-validation.
- Parameters:
data_x (ndarray or None, optional) – 2D array with shape (n_samples, n_features) used to fit and evaluate the model; required for ‘rf’, ‘nn’, and ‘xgb’. Default is None.
data_y (array-like or None, optional) – 1D label array aligned with data_x; may be numeric or strings (strings are auto-mapped to integers for XGBoost). Default is None.
clf ({'rf','nn','xgb'}, optional) – Which classifier to tune: Random Forest (‘rf’), Scikit-learn MLP (‘nn’), or XGBoost (‘xgb’). Default is ‘xgb’.
n_iter (int, optional) – Number of Optuna trials; set to 0 to skip optimization and return the base (untuned) model. Default is 25.
opt_cv (int, optional) – Number of stratified cross-validation folds per trial. Default is 10.
balance (bool, optional) – If True, apply class weighting for binary tasks (RF: class_weight=’balanced’; XGB: scale_pos_weight; MLP does not support weights). Default is True.
scoring_metric (str, optional) – Scikit-learn scoring name used for CV evaluation; for multiclass, maps to macro/OVR variants (e.g., ‘f1’→’f1_macro’, ‘roc_auc’→’roc_auc_ovr’). Default is ‘f1’.
limit_search (bool, optional) – If True, restrict the XGBoost search space to a compact, safe region to reduce runtime and memory risk. Default is True.
return_study (bool, optional) – If True, return the Optuna Study object as a third output for downstream analysis/visualization. Default is True.
SEED_NO (int, optional) – Random seed for CV splitters and the TPE sampler to ensure reproducibility. Default is 1909.
- Returns:
model (estimator) – Fitted estimator configured with the best hyperparameters found (or the base model if n_iter is 0).
params (dict) – Dictionary of the best hyperparameters from the Optuna study.
study (optuna.study.Study) – Returned only when return_study is True; contains all trials and results.
Examples
Fit a tuned Random Forest: >>> model, params = hyper_opt(data_x, data_y, clf=’rf’, n_iter=50)
Retrieve the Optuna study for visualization: >>> model, params, study = hyper_opt(data_x, data_y, clf=’xgb’, n_iter=50, return_study=True) >>> from optuna.visualization.matplotlib import plot_contour >>> plot_contour(study)
- Raises:
ValueError – If clf is not one of {‘rf’, ‘nn’, ‘xgb’}.
- pyBIA.optimization.borutashap_opt(data_x, data_y, boruta_trials=50, model='rf', importance_type='gain', SEED_NO=1909)[source]
Run BorutaSHAP feature selection (Boruta + SHAP) and return selected feature indices.
- Parameters:
data_x (ndarray) – Feature matrix of shape (n_samples, n_features) used to compute importances; must contain no NaNs.
data_y (array-like) – 1D array of labels aligned with data_x; categorical labels are internally mapped to integers.
boruta_trials (int, optional) – Number of BorutaSHAP iterations to stabilize the acceptance/rejection distributions; default is 50.
model ({'rf','xgb'}, optional) – Base estimator used to compute importances: Random Forest (‘rf’) or XGBoost (‘xgb’); default is ‘rf’.
importance_type ({'gain','weight','cover','total_gain','total_cover'}, optional) – XGBoost importance metric to use when model=’xgb’; ignored for Random Forest; default is ‘gain’.
SEED_NO (int, optional) – Random seed for reproducibility of the estimator and BorutaSHAP sampling; default is 1909.
- Returns:
index (ndarray) – Sorted array of selected feature indices (dtype=int) referring to columns in data_x.
feat_selector (BorutaSHAP) – Fitted BorutaSHAP selector object containing selection history and plotting utilities.
- Raises:
ValueError – If model is not one of {‘rf’,’xgb’}, if data_x contains NaNs, or if BorutaSHAP fitting fails.
- pyBIA.optimization.standardize_data(data_x, method='min-max', return_scaler=True)[source]
Scale features with a chosen strategy for models sensitive to input range.
- Parameters:
data_x (ndarray) – Feature matrix of shape (n_samples, n_features) to be transformed.
method ({'min-max','robust','standard'}, optional) – Scaling strategy to apply: ‘min-max’ rescales each feature to [0, 1]; ‘robust’ centers by the median and scales by the IQR; ‘standard’ centers to mean 0 and scales to unit variance. Default is ‘min-max’.
return_scaler (bool, optional) – If True, return the fitted scaler object along with the transformed data; if False, return only the transformed data. Default is True.
- Returns:
norm_data_x (ndarray) – Scaled feature matrix of shape (n_samples, n_features).
scaler (sklearn.base.TransformerMixin) – Fitted scaler instance (MinMaxScaler, RobustScaler, or StandardScaler); returned only when return_scaler is True.
- pyBIA.optimization.impute_missing_values(data, imputer=None, strategy='knn', k=3, constant_value=0)[source]
Impute missing values using mean/median/mode, a constant, or k-nearest neighbors.
- Parameters:
data (ndarray) – Array of shape (n_samples, n_features) containing NaNs to be imputed.
imputer (sklearn.impute.SimpleImputer | sklearn.impute.KNNImputer | None, optional) – Pre-fitted imputer to apply; if None, a new imputer is created and fitted on data. Default is None.
strategy ({'knn','mean','median','mode','constant'}, optional) – Imputation strategy to use. Default is ‘knn’.
k (int, optional) – Number of neighbors for KNN imputation (used only when strategy=’knn’). Default is 3.
constant_value (float or int, optional) – Fill value for constant imputation (used only when strategy=’constant’). Default is 0.
- Returns:
imputed_data (ndarray) – Array with missing values filled.
imputer (sklearn.impute.SimpleImputer | sklearn.impute.KNNImputer) – Fitted imputer returned only when a new imputer is created (i.e., when input imputer is None).
- pyBIA.optimization.Strawman_imputation(data)[source]
Median (“strawman”) imputation for missing values.
- Parameters:
data (ndarray) – Input array of shape (n_samples, n_features) or (n_features,). Missing values are assumed to be encoded as NaN or ±inf. For 1D input, a single global median (over finite values) is used. For 2D input, medians are computed column-wise.
- Returns:
imputed – Array with the same shape as data in which missing entries have been replaced by the corresponding median(s).
- Return type:
ndarray