pyBIA.optimization
==================

.. py:module:: pyBIA.optimization

.. autoapi-nested-parse::

   Created on Wed Sep  11 12:04:23 2021

   @author: daniel


Classes
-------

.. autoapisummary::

   pyBIA.optimization.objective_xgb
   pyBIA.optimization.objective_nn
   pyBIA.optimization.objective_rf


Functions
---------

.. autoapisummary::

   pyBIA.optimization.hyper_opt
   pyBIA.optimization.borutashap_opt
   pyBIA.optimization.standardize_data
   pyBIA.optimization.impute_missing_values
   pyBIA.optimization.Strawman_imputation


Module Contents
---------------

.. py:class:: objective_xgb(data_x, data_y, limit_search=False, opt_cv=3, scoring_metric='f1', SEED_NO=1909)

   Bases: :py:obj:`object`


   Optuna objective class for optimizing an XGBoost classifier using cross-validation.

   This class defines the optimization logic for tuning XGBoost hyperparameters using
   the Optuna framework. It supports limited or broad search spaces depending on
   the `limit_search` flag, and returns the cross-validated performance metric for
   each trial.

   :param data_x: Feature matrix of shape (n_samples, n_features).
   :type data_x: ndarray
   :param data_y: Corresponding class labels of shape (n_samples,).
   :type data_y: ndarray or array-like
   :param limit_search: If True, restricts the hyperparameter search space to a narrower range.
                        Defaults to False (broad search).
   :type limit_search: bool, optional
   :param opt_cv: Number of cross-validation folds. Must be >= 2. Default is 3.
   :type opt_cv: int, optional
   :param scoring_metric: Evaluation metric used during optimization. Options are:
                          ['accuracy', 'f1', 'precision', 'recall', 'roc_auc']. Default is 'f1'.
   :type scoring_metric: str, optional
   :param SEED_NO: Random seed for reproducibility. Default is 1909.
   :type SEED_NO: int, optional

   :returns: Cross-validated score (mean across folds) for the given trial configuration.
   :rtype: float


   .. py:attribute:: data_x


   .. py:attribute:: data_y


   .. py:attribute:: limit_search
      :value: False


   .. py:attribute:: opt_cv
      :value: 3


   .. py:attribute:: SEED_NO
      :value: 1909


   .. py:attribute:: n_classes


   .. py:method:: __call__(trial)

      Run a single optimization trial by training the XGBoost model on cross-validation folds
      and returning the mean performance metric.

      :param trial: A trial object provided by Optuna to suggest hyperparameters.
      :type trial: optuna.Trial

      :returns: Mean cross-validated score for the trial.
      :rtype: float


.. py:class:: objective_nn(data_x, data_y, opt_cv, scoring_metric='f1', SEED_NO=1909)

   Bases: :py:obj:`object`


   Optuna objective class for optimizing an MLP classifier using cross-validation.

   This class defines the optimization logic for tuning XGBoost hyperparameters using
   the Optuna framework. It supports limited or broad search spaces depending on
   the `limit_search` flag, and returns the cross-validated performance metric for
   each trial.

   :param data_x: Feature matrix of shape (n_samples, n_features).
   :type data_x: ndarray
   :param data_y: Corresponding class labels of shape (n_samples,).
   :type data_y: ndarray or array-like
   :param limit_search: If True, restricts the hyperparameter search space to a narrower range.
                        Defaults to False (broad search).
   :type limit_search: bool, optional
   :param opt_cv: Number of cross-validation folds. Must be >= 2. Default is 3.
   :type opt_cv: int, optional
   :param scoring_metric: Evaluation metric used during optimization. Options are:
                          ['accuracy', 'f1', 'precision', 'recall', 'roc_auc']. Default is 'f1'.
   :type scoring_metric: str, optional
   :param SEED_NO: Random seed for reproducibility. Default is 1909.
   :type SEED_NO: int, optional

   :returns: Cross-validated score (mean across folds) for the given trial configuration.
   :rtype: float


   .. py:attribute:: data_x


   .. py:attribute:: data_y


   .. py:attribute:: opt_cv


   .. py:attribute:: SEED_NO
      :value: 1909


   .. py:method:: __call__(trial)

      Run a single optimization trial by training the XGBoost model on cross-validation folds
      and returning the mean performance metric.

      :param trial: A trial object provided by Optuna to suggest hyperparameters.
      :type trial: optuna.Trial

      :returns: Mean cross-validated score for the trial.
      :rtype: float


.. py:class:: objective_rf(data_x, data_y, opt_cv, scoring_metric='f1', SEED_NO=1909)

   Bases: :py:obj:`object`


   Optuna objective class for optimizing a RF classifier using cross-validation.

   This class defines the optimization logic for tuning XGBoost hyperparameters using
   the Optuna framework. It supports limited or broad search spaces depending on
   the `limit_search` flag, and returns the cross-validated performance metric for
   each trial.

   :param data_x: Feature matrix of shape (n_samples, n_features).
   :type data_x: ndarray
   :param data_y: Corresponding class labels of shape (n_samples,).
   :type data_y: ndarray or array-like
   :param limit_search: If True, restricts the hyperparameter search space to a narrower range.
                        Defaults to False (broad search).
   :type limit_search: bool, optional
   :param opt_cv: Number of cross-validation folds. Must be >= 2. Default is 3.
   :type opt_cv: int, optional
   :param scoring_metric: Evaluation metric used during optimization. Options are:
                          ['accuracy', 'f1', 'precision', 'recall', 'roc_auc']. Default is 'f1'.
   :type scoring_metric: str, optional
   :param SEED_NO: Random seed for reproducibility. Default is 1909.
   :type SEED_NO: int, optional

   :returns: Cross-validated score (mean across folds) for the given trial configuration.
   :rtype: float


   .. py:attribute:: data_x


   .. py:attribute:: data_y


   .. py:attribute:: opt_cv


   .. py:attribute:: SEED_NO
      :value: 1909


   .. py:method:: __call__(trial)

      Run a single optimization trial by training the XGBoost model on cross-validation folds
      and returning the mean performance metric.

      :param trial: A trial object provided by Optuna to suggest hyperparameters.
      :type trial: optuna.Trial

      :returns: Mean cross-validated score for the trial.
      :rtype: float


.. py:function:: hyper_opt(data_x=None, data_y=None, clf='xgb', n_iter=25, opt_cv=10, balance=True, scoring_metric='f1', limit_search=True, return_study=True, SEED_NO=1909)

   Optimize model hyperparameters with Optuna using stratified k-fold cross-validation.

   :param data_x: 2D array with shape (n_samples, n_features) used to fit and evaluate the model; required for 'rf', 'nn', and 'xgb'. Default is None.
   :type data_x: ndarray or None, optional
   :param data_y: 1D label array aligned with `data_x`; may be numeric or strings (strings are auto-mapped to integers for XGBoost). Default is None.
   :type data_y: array-like or None, optional
   :param clf: Which classifier to tune: Random Forest ('rf'), Scikit-learn MLP ('nn'), or XGBoost ('xgb'). Default is 'xgb'.
   :type clf: {'rf','nn','xgb'}, optional
   :param n_iter: Number of Optuna trials; set to 0 to skip optimization and return the base (untuned) model. Default is 25.
   :type n_iter: int, optional
   :param opt_cv: Number of stratified cross-validation folds per trial. Default is 10.
   :type opt_cv: int, optional
   :param balance: If True, apply class weighting for binary tasks (RF: `class_weight='balanced'`; XGB: `scale_pos_weight`; MLP does not support weights). Default is True.
   :type balance: bool, optional
   :param scoring_metric: Scikit-learn scoring name used for CV evaluation; for multiclass, maps to macro/OVR variants (e.g., 'f1'→'f1_macro', 'roc_auc'→'roc_auc_ovr'). Default is 'f1'.
   :type scoring_metric: str, optional
   :param limit_search: If True, restrict the XGBoost search space to a compact, safe region to reduce runtime and memory risk. Default is True.
   :type limit_search: bool, optional
   :param return_study: If True, return the Optuna `Study` object as a third output for downstream analysis/visualization. Default is True.
   :type return_study: bool, optional
   :param SEED_NO: Random seed for CV splitters and the TPE sampler to ensure reproducibility. Default is 1909.
   :type SEED_NO: int, optional

   :returns: * **model** (*estimator*) -- Fitted estimator configured with the best hyperparameters found (or the base model if `n_iter` is 0).
             * **params** (*dict*) -- Dictionary of the best hyperparameters from the Optuna study.
             * **study** (*optuna.study.Study*) -- Returned only when `return_study` is True; contains all trials and results.

   .. rubric:: Examples

   Fit a tuned Random Forest:
   >>> model, params = hyper_opt(data_x, data_y, clf='rf', n_iter=50)

   Retrieve the Optuna study for visualization:
   >>> model, params, study = hyper_opt(data_x, data_y, clf='xgb', n_iter=50, return_study=True)
   >>> from optuna.visualization.matplotlib import plot_contour
   >>> plot_contour(study)

   :raises ValueError: If `clf` is not one of {'rf', 'nn', 'xgb'}.


.. py:function:: borutashap_opt(data_x, data_y, boruta_trials=50, model='rf', importance_type='gain', SEED_NO=1909)

   Run BorutaSHAP feature selection (Boruta + SHAP) and return selected feature indices.

   :param data_x: Feature matrix of shape (n_samples, n_features) used to compute importances; must contain no NaNs.
   :type data_x: ndarray
   :param data_y: 1D array of labels aligned with `data_x`; categorical labels are internally mapped to integers.
   :type data_y: array-like
   :param boruta_trials: Number of BorutaSHAP iterations to stabilize the acceptance/rejection distributions; default is 50.
   :type boruta_trials: int, optional
   :param model: Base estimator used to compute importances: Random Forest ('rf') or XGBoost ('xgb'); default is 'rf'.
   :type model: {'rf','xgb'}, optional
   :param importance_type: XGBoost importance metric to use when `model='xgb'`; ignored for Random Forest; default is 'gain'.
   :type importance_type: {'gain','weight','cover','total_gain','total_cover'}, optional
   :param SEED_NO: Random seed for reproducibility of the estimator and BorutaSHAP sampling; default is 1909.
   :type SEED_NO: int, optional

   :returns: * **index** (*ndarray*) -- Sorted array of selected feature indices (dtype=int) referring to columns in `data_x`.
             * **feat_selector** (*BorutaSHAP*) -- Fitted BorutaSHAP selector object containing selection history and plotting utilities.

   :raises ValueError: If `model` is not one of {'rf','xgb'}, if `data_x` contains NaNs, or if BorutaSHAP fitting fails.


.. py:function:: standardize_data(data_x, method='min-max', return_scaler=True)

   Scale features with a chosen strategy for models sensitive to input range.

   :param data_x: Feature matrix of shape (n_samples, n_features) to be transformed.
   :type data_x: ndarray
   :param method: Scaling strategy to apply: 'min-max' rescales each feature to [0, 1];
                  'robust' centers by the median and scales by the IQR; 'standard' centers
                  to mean 0 and scales to unit variance. Default is 'min-max'.
   :type method: {'min-max','robust','standard'}, optional
   :param return_scaler: If True, return the fitted scaler object along with the transformed data;
                         if False, return only the transformed data. Default is True.
   :type return_scaler: bool, optional

   :returns: * **norm_data_x** (*ndarray*) -- Scaled feature matrix of shape (n_samples, n_features).
             * **scaler** (*sklearn.base.TransformerMixin*) -- Fitted scaler instance (MinMaxScaler, RobustScaler, or StandardScaler);
               returned only when `return_scaler` is True.


.. py:function:: impute_missing_values(data, imputer=None, strategy='knn', k=3, constant_value=0)

   Impute missing values using mean/median/mode, a constant, or k-nearest neighbors.

   :param data: Array of shape (n_samples, n_features) containing NaNs to be imputed.
   :type data: ndarray
   :param imputer: Pre-fitted imputer to apply; if None, a new imputer is created and fitted on `data`. Default is None.
   :type imputer: sklearn.impute.SimpleImputer | sklearn.impute.KNNImputer | None, optional
   :param strategy: Imputation strategy to use. Default is 'knn'.
   :type strategy: {'knn','mean','median','mode','constant'}, optional
   :param k: Number of neighbors for KNN imputation (used only when `strategy='knn'`). Default is 3.
   :type k: int, optional
   :param constant_value: Fill value for constant imputation (used only when `strategy='constant'`). Default is 0.
   :type constant_value: float or int, optional

   :returns: * **imputed_data** (*ndarray*) -- Array with missing values filled.
             * **imputer** (*sklearn.impute.SimpleImputer | sklearn.impute.KNNImputer*) -- Fitted imputer returned only when a new imputer is created (i.e., when input `imputer` is None).


.. py:function:: Strawman_imputation(data)

   Median (“strawman”) imputation for missing values.

   :param data: Input array of shape (n_samples, n_features) or (n_features,). Missing values
                are assumed to be encoded as NaN or ±inf. For 1D input, a single global median
                (over finite values) is used. For 2D input, medians are computed column-wise.
   :type data: ndarray

   :returns: **imputed** -- Array with the same shape as `data` in which missing entries have been
             replaced by the corresponding median(s).
   :rtype: ndarray