pyBIA.optimization ================== .. py:module:: pyBIA.optimization .. autoapi-nested-parse:: Created on Wed Sep 11 12:04:23 2021 @author: daniel Classes ------- .. autoapisummary:: pyBIA.optimization.objective_xgb pyBIA.optimization.objective_nn pyBIA.optimization.objective_rf Functions --------- .. autoapisummary:: pyBIA.optimization.hyper_opt pyBIA.optimization.borutashap_opt pyBIA.optimization.standardize_data pyBIA.optimization.impute_missing_values pyBIA.optimization.Strawman_imputation Module Contents --------------- .. py:class:: objective_xgb(data_x, data_y, limit_search=False, opt_cv=3, scoring_metric='f1', SEED_NO=1909) Bases: :py:obj:`object` Optuna objective class for optimizing an XGBoost classifier using cross-validation. This class defines the optimization logic for tuning XGBoost hyperparameters using the Optuna framework. It supports limited or broad search spaces depending on the `limit_search` flag, and returns the cross-validated performance metric for each trial. :param data_x: Feature matrix of shape (n_samples, n_features). :type data_x: ndarray :param data_y: Corresponding class labels of shape (n_samples,). :type data_y: ndarray or array-like :param limit_search: If True, restricts the hyperparameter search space to a narrower range. Defaults to False (broad search). :type limit_search: bool, optional :param opt_cv: Number of cross-validation folds. Must be >= 2. Default is 3. :type opt_cv: int, optional :param scoring_metric: Evaluation metric used during optimization. Options are: ['accuracy', 'f1', 'precision', 'recall', 'roc_auc']. Default is 'f1'. :type scoring_metric: str, optional :param SEED_NO: Random seed for reproducibility. Default is 1909. :type SEED_NO: int, optional :returns: Cross-validated score (mean across folds) for the given trial configuration. :rtype: float .. py:attribute:: data_x .. py:attribute:: data_y .. py:attribute:: limit_search :value: False .. py:attribute:: opt_cv :value: 3 .. py:attribute:: SEED_NO :value: 1909 .. py:attribute:: n_classes .. py:method:: __call__(trial) Run a single optimization trial by training the XGBoost model on cross-validation folds and returning the mean performance metric. :param trial: A trial object provided by Optuna to suggest hyperparameters. :type trial: optuna.Trial :returns: Mean cross-validated score for the trial. :rtype: float .. py:class:: objective_nn(data_x, data_y, opt_cv, scoring_metric='f1', SEED_NO=1909) Bases: :py:obj:`object` Optuna objective class for optimizing an MLP classifier using cross-validation. This class defines the optimization logic for tuning XGBoost hyperparameters using the Optuna framework. It supports limited or broad search spaces depending on the `limit_search` flag, and returns the cross-validated performance metric for each trial. :param data_x: Feature matrix of shape (n_samples, n_features). :type data_x: ndarray :param data_y: Corresponding class labels of shape (n_samples,). :type data_y: ndarray or array-like :param limit_search: If True, restricts the hyperparameter search space to a narrower range. Defaults to False (broad search). :type limit_search: bool, optional :param opt_cv: Number of cross-validation folds. Must be >= 2. Default is 3. :type opt_cv: int, optional :param scoring_metric: Evaluation metric used during optimization. Options are: ['accuracy', 'f1', 'precision', 'recall', 'roc_auc']. Default is 'f1'. :type scoring_metric: str, optional :param SEED_NO: Random seed for reproducibility. Default is 1909. :type SEED_NO: int, optional :returns: Cross-validated score (mean across folds) for the given trial configuration. :rtype: float .. py:attribute:: data_x .. py:attribute:: data_y .. py:attribute:: opt_cv .. py:attribute:: SEED_NO :value: 1909 .. py:method:: __call__(trial) Run a single optimization trial by training the XGBoost model on cross-validation folds and returning the mean performance metric. :param trial: A trial object provided by Optuna to suggest hyperparameters. :type trial: optuna.Trial :returns: Mean cross-validated score for the trial. :rtype: float .. py:class:: objective_rf(data_x, data_y, opt_cv, scoring_metric='f1', SEED_NO=1909) Bases: :py:obj:`object` Optuna objective class for optimizing a RF classifier using cross-validation. This class defines the optimization logic for tuning XGBoost hyperparameters using the Optuna framework. It supports limited or broad search spaces depending on the `limit_search` flag, and returns the cross-validated performance metric for each trial. :param data_x: Feature matrix of shape (n_samples, n_features). :type data_x: ndarray :param data_y: Corresponding class labels of shape (n_samples,). :type data_y: ndarray or array-like :param limit_search: If True, restricts the hyperparameter search space to a narrower range. Defaults to False (broad search). :type limit_search: bool, optional :param opt_cv: Number of cross-validation folds. Must be >= 2. Default is 3. :type opt_cv: int, optional :param scoring_metric: Evaluation metric used during optimization. Options are: ['accuracy', 'f1', 'precision', 'recall', 'roc_auc']. Default is 'f1'. :type scoring_metric: str, optional :param SEED_NO: Random seed for reproducibility. Default is 1909. :type SEED_NO: int, optional :returns: Cross-validated score (mean across folds) for the given trial configuration. :rtype: float .. py:attribute:: data_x .. py:attribute:: data_y .. py:attribute:: opt_cv .. py:attribute:: SEED_NO :value: 1909 .. py:method:: __call__(trial) Run a single optimization trial by training the XGBoost model on cross-validation folds and returning the mean performance metric. :param trial: A trial object provided by Optuna to suggest hyperparameters. :type trial: optuna.Trial :returns: Mean cross-validated score for the trial. :rtype: float .. py:function:: hyper_opt(data_x=None, data_y=None, clf='xgb', n_iter=25, opt_cv=10, balance=True, scoring_metric='f1', limit_search=True, return_study=True, SEED_NO=1909) Optimize model hyperparameters with Optuna using stratified k-fold cross-validation. :param data_x: 2D array with shape (n_samples, n_features) used to fit and evaluate the model; required for 'rf', 'nn', and 'xgb'. Default is None. :type data_x: ndarray or None, optional :param data_y: 1D label array aligned with `data_x`; may be numeric or strings (strings are auto-mapped to integers for XGBoost). Default is None. :type data_y: array-like or None, optional :param clf: Which classifier to tune: Random Forest ('rf'), Scikit-learn MLP ('nn'), or XGBoost ('xgb'). Default is 'xgb'. :type clf: {'rf','nn','xgb'}, optional :param n_iter: Number of Optuna trials; set to 0 to skip optimization and return the base (untuned) model. Default is 25. :type n_iter: int, optional :param opt_cv: Number of stratified cross-validation folds per trial. Default is 10. :type opt_cv: int, optional :param balance: If True, apply class weighting for binary tasks (RF: `class_weight='balanced'`; XGB: `scale_pos_weight`; MLP does not support weights). Default is True. :type balance: bool, optional :param scoring_metric: Scikit-learn scoring name used for CV evaluation; for multiclass, maps to macro/OVR variants (e.g., 'f1'→'f1_macro', 'roc_auc'→'roc_auc_ovr'). Default is 'f1'. :type scoring_metric: str, optional :param limit_search: If True, restrict the XGBoost search space to a compact, safe region to reduce runtime and memory risk. Default is True. :type limit_search: bool, optional :param return_study: If True, return the Optuna `Study` object as a third output for downstream analysis/visualization. Default is True. :type return_study: bool, optional :param SEED_NO: Random seed for CV splitters and the TPE sampler to ensure reproducibility. Default is 1909. :type SEED_NO: int, optional :returns: * **model** (*estimator*) -- Fitted estimator configured with the best hyperparameters found (or the base model if `n_iter` is 0). * **params** (*dict*) -- Dictionary of the best hyperparameters from the Optuna study. * **study** (*optuna.study.Study*) -- Returned only when `return_study` is True; contains all trials and results. .. rubric:: Examples Fit a tuned Random Forest: >>> model, params = hyper_opt(data_x, data_y, clf='rf', n_iter=50) Retrieve the Optuna study for visualization: >>> model, params, study = hyper_opt(data_x, data_y, clf='xgb', n_iter=50, return_study=True) >>> from optuna.visualization.matplotlib import plot_contour >>> plot_contour(study) :raises ValueError: If `clf` is not one of {'rf', 'nn', 'xgb'}. .. py:function:: borutashap_opt(data_x, data_y, boruta_trials=50, model='rf', importance_type='gain', SEED_NO=1909) Run BorutaSHAP feature selection (Boruta + SHAP) and return selected feature indices. :param data_x: Feature matrix of shape (n_samples, n_features) used to compute importances; must contain no NaNs. :type data_x: ndarray :param data_y: 1D array of labels aligned with `data_x`; categorical labels are internally mapped to integers. :type data_y: array-like :param boruta_trials: Number of BorutaSHAP iterations to stabilize the acceptance/rejection distributions; default is 50. :type boruta_trials: int, optional :param model: Base estimator used to compute importances: Random Forest ('rf') or XGBoost ('xgb'); default is 'rf'. :type model: {'rf','xgb'}, optional :param importance_type: XGBoost importance metric to use when `model='xgb'`; ignored for Random Forest; default is 'gain'. :type importance_type: {'gain','weight','cover','total_gain','total_cover'}, optional :param SEED_NO: Random seed for reproducibility of the estimator and BorutaSHAP sampling; default is 1909. :type SEED_NO: int, optional :returns: * **index** (*ndarray*) -- Sorted array of selected feature indices (dtype=int) referring to columns in `data_x`. * **feat_selector** (*BorutaSHAP*) -- Fitted BorutaSHAP selector object containing selection history and plotting utilities. :raises ValueError: If `model` is not one of {'rf','xgb'}, if `data_x` contains NaNs, or if BorutaSHAP fitting fails. .. py:function:: standardize_data(data_x, method='min-max', return_scaler=True) Scale features with a chosen strategy for models sensitive to input range. :param data_x: Feature matrix of shape (n_samples, n_features) to be transformed. :type data_x: ndarray :param method: Scaling strategy to apply: 'min-max' rescales each feature to [0, 1]; 'robust' centers by the median and scales by the IQR; 'standard' centers to mean 0 and scales to unit variance. Default is 'min-max'. :type method: {'min-max','robust','standard'}, optional :param return_scaler: If True, return the fitted scaler object along with the transformed data; if False, return only the transformed data. Default is True. :type return_scaler: bool, optional :returns: * **norm_data_x** (*ndarray*) -- Scaled feature matrix of shape (n_samples, n_features). * **scaler** (*sklearn.base.TransformerMixin*) -- Fitted scaler instance (MinMaxScaler, RobustScaler, or StandardScaler); returned only when `return_scaler` is True. .. py:function:: impute_missing_values(data, imputer=None, strategy='knn', k=3, constant_value=0) Impute missing values using mean/median/mode, a constant, or k-nearest neighbors. :param data: Array of shape (n_samples, n_features) containing NaNs to be imputed. :type data: ndarray :param imputer: Pre-fitted imputer to apply; if None, a new imputer is created and fitted on `data`. Default is None. :type imputer: sklearn.impute.SimpleImputer | sklearn.impute.KNNImputer | None, optional :param strategy: Imputation strategy to use. Default is 'knn'. :type strategy: {'knn','mean','median','mode','constant'}, optional :param k: Number of neighbors for KNN imputation (used only when `strategy='knn'`). Default is 3. :type k: int, optional :param constant_value: Fill value for constant imputation (used only when `strategy='constant'`). Default is 0. :type constant_value: float or int, optional :returns: * **imputed_data** (*ndarray*) -- Array with missing values filled. * **imputer** (*sklearn.impute.SimpleImputer | sklearn.impute.KNNImputer*) -- Fitted imputer returned only when a new imputer is created (i.e., when input `imputer` is None). .. py:function:: Strawman_imputation(data) Median (“strawman”) imputation for missing values. :param data: Input array of shape (n_samples, n_features) or (n_features,). Missing values are assumed to be encoded as NaN or ±inf. For 1D input, a single global median (over finite values) is used. For 2D input, medians are computed column-wise. :type data: ndarray :returns: **imputed** -- Array with the same shape as `data` in which missing entries have been replaced by the corresponding median(s). :rtype: ndarray