pyBIA.ensemble_model
====================

.. py:module:: pyBIA.ensemble_model

.. autoapi-nested-parse::

   Created on Wed Sep 8 10:04:23 2021

   @author: daniel


Classes
-------

.. autoapisummary::

   pyBIA.ensemble_model.Classifier


Functions
---------

.. autoapisummary::

   pyBIA.ensemble_model.format_labels
   pyBIA.ensemble_model.evaluate_model
   pyBIA.ensemble_model.generate_matrix
   pyBIA.ensemble_model.generate_plot
   pyBIA.ensemble_model._set_style_


Module Contents
---------------

.. py:class:: Classifier(data_x=None, data_y=None, clf='rf', optimize=False, opt_cv=10, scoring_metric='f1', limit_search=True, impute=True, imp_method='knn', n_iter=25, boruta_trials=50, boruta_model='rf', balance=True, csv_file=None, SEED_NO=1909)

   Creates a machine-learning classifier with optional imputation, BorutaSHAP
   feature selection, and Optuna hyperparameter optimization. Utilities are
   provided to save/load artifacts and to plot diagnostics (t-SNE, confusion
   matrix, ROC, optimization history, and importances).

   :param data_x: Feature matrix of shape (n_samples, n_features).
   :type data_x: ndarray
   :param data_y: 1D array of labels aligned to `data_x`.
   :type data_y: array-like
   :param clf: Estimator to build. One of {'rf','nn','xgb','histgb','adaboost','svc',
               'logreg','bdt','gaussian_nb','knn','extratrees','tree','ocsvm'}.
               Defaults to 'rf'.
   :type clf: str
   :param optimize: Run BorutaSHAP (when `boruta_trials` > 0) and Optuna search before fitting.
                    Defaults to False.
   :type optimize: bool
   :param opt_cv: Number of cross-validation folds used during optimization. Defaults to 10.
   :type opt_cv: int
   :param scoring_metric: Metric optimized by Optuna. One of {'accuracy','f1','precision','recall','roc_auc'}.
                          Defaults to 'f1'.
   :type scoring_metric: str
   :param limit_search: Constrain very wide hyperparameter ranges for practicality. Defaults to True.
   :type limit_search: bool
   :param impute: Impute missing values prior to fitting. Defaults to True.
   :type impute: bool
   :param imp_method: Imputation strategy. One of {'knn','mean','median','mode','constant'}.
                      Defaults to 'knn'.
   :type imp_method: str
   :param n_iter: Number of Optuna trials; use 0 to skip search. Defaults to 25.
   :type n_iter: int
   :param boruta_trials: Number of BorutaSHAP trials; use 0 to skip feature selection. Defaults to 50.
   :type boruta_trials: int
   :param boruta_model: Base estimator for BorutaSHAP, independent of `clf`. One of {'rf','xgb'}.
                        Defaults to 'rf'.
   :type boruta_model: str
   :param balance: Apply class weighting for imbalanced binary tasks where supported.
                   Defaults to True.
   :type balance: bool
   :param csv_file: Alternative to (`data_x`, `data_y`). Must include a 'label' column.
                    Defaults to None.
   :type csv_file: DataFrame, optional
   :param SEED_NO: Random seed used across components. Defaults to 1909.
   :type SEED_NO: int

   .. attribute:: data_x

      Possibly imputed/processed feature matrix.

      :type: ndarray or None

   .. attribute:: data_y

      Numeric labels used for fitting (may be encoded).

      :type: ndarray or None

   .. attribute:: data_y_

      Copy of original labels (pre-encoding) for plots.

      :type: ndarray or None

   .. attribute:: clf

      Name of the chosen estimator.

      :type: str

   .. attribute:: model

      Trained estimator instance.

      :type: estimator or None

   .. attribute:: imputer

      Fitted imputer used for transformations.

      :type: object or None

   .. attribute:: feats_to_use

      Indices of selected features (BorutaSHAP).

      :type: ndarray or None

   .. attribute:: feature_history

      BorutaSHAP selection history.

      :type: object or None

   .. attribute:: optimization_results

      Study from hyperparameter search.

      :type: optuna.study.Study or None

   .. attribute:: best_params

      Best hyperparameters from Optuna.

      :type: dict or None

   .. attribute:: path

      Directory used when saving artifacts.

      :type: str or None

   .. attribute:: SEED_NO

      Seed propagated to internal routines.

      :type: int


   .. py:attribute:: data_x
      :value: None


   .. py:attribute:: data_y
      :value: None


   .. py:attribute:: clf
      :value: 'rf'


   .. py:attribute:: optimize
      :value: False


   .. py:attribute:: opt_cv
      :value: 10


   .. py:attribute:: scoring_metric
      :value: 'f1'


   .. py:attribute:: limit_search
      :value: True


   .. py:attribute:: impute
      :value: True


   .. py:attribute:: imp_method
      :value: 'knn'


   .. py:attribute:: n_iter
      :value: 25


   .. py:attribute:: boruta_trials
      :value: 50


   .. py:attribute:: boruta_model
      :value: 'rf'


   .. py:attribute:: balance
      :value: True


   .. py:attribute:: csv_file
      :value: None


   .. py:attribute:: SEED_NO
      :value: 1909


   .. py:attribute:: model
      :value: None


   .. py:attribute:: imputer
      :value: None


   .. py:attribute:: feats_to_use
      :value: None


   .. py:attribute:: feature_history
      :value: None


   .. py:attribute:: optimization_results
      :value: None


   .. py:attribute:: best_params
      :value: None


   .. py:method:: create(overwrite_training=True)

      Builds the pipeline (optional feature selection and optimization), fits the
      estimator, and stores artifacts.

      :param overwrite_training: When True, replace `self.data_x` with the processed matrix used for
                                 fitting. Defaults to True.
      :type overwrite_training: bool

      :rtype: None


   .. py:method:: save(dirname=None, path=None, overwrite=False)

      Saves the trained model and auxiliary artifacts.

      .. rubric:: Notes

      Creates a `pyBIA_ensemble_model/` folder containing, when available:
      `Model`, `Imputer`, `Feats_Index`, `HyperOpt_Results`, `Best_Params`,
      and `FeatureOpt_Results`.

      :param dirname: Subdirectory name created under `path`. Defaults to None.
      :type dirname: str, optional
      :param path: Base directory for saving. The user home is used when not provided.
                   Defaults to None.
      :type path: str, optional
      :param overwrite: Remove any existing `pyBIA_ensemble_model` at the target before saving.
                        Defaults to False.
      :type overwrite: bool

      :rtype: None

      :raises ValueError: If nothing has been created (run `.create()` first) or if the target
          exists and `overwrite` is False.


   .. py:method:: load(path=None)

      Loads model and auxiliary artifacts from a `pyBIA_ensemble_model/` folder.

      :param path: Base directory containing the folder. The user home is used when not
                   provided. Defaults to None.
      :type path: str, optional

      :rtype: None


   .. py:method:: predict(data)

      Predicts class labels and top-class probabilities for new samples.

      :param data: Feature matrix of shape (n_samples, n_features). If feature selection
                   was used, only the selected columns are required.
      :type data: ndarray

      :returns: Array of shape (n_samples, 2) with rows
                [predicted_label, probability_of_predicted_label].
      :rtype: ndarray


   .. py:method:: plot_tsne(data_y=None, special_class=None, norm=True, pca=False, return_data=False, xlim=None, ylim=None, legend_loc='upper center', title='Feature Parameter Space', savefig=False)

      Plots a 2D t-SNE embedding of the feature space.

      :param data_y: Labels for coloring. The classifier’s labels are used when not provided.
                     Defaults to None.
      :type data_y: array-like, optional
      :param special_class: Class label to highlight. Defaults to None.
      :type special_class: hashable, optional
      :param norm: Standardize features before t-SNE. Defaults to True.
      :type norm: bool
      :param pca: Apply PCA (all components) before t-SNE. Defaults to False.
      :type pca: bool
      :param return_data: Return the (x, y) coordinates instead of only plotting. Defaults to False.
      :type return_data: bool
      :param xlim: X-axis limits. Defaults to None.
      :type xlim: tuple, optional
      :param ylim: Y-axis limits. Defaults to None.
      :type ylim: tuple, optional
      :param legend_loc: Legend location. Defaults to 'upper center'.
      :type legend_loc: str
      :param title: Figure title. Defaults to 'Feature Parameter Space'.
      :type title: str
      :param savefig: Save a PNG instead of showing. Defaults to False.
      :type savefig: bool

      :returns: When `return_data` is False, returns the plotted artist. When True,
                returns `(x, y)` coordinates.
      :rtype: AxesImage or tuple


   .. py:method:: plot_conf_matrix(data_y=None, norm=False, pca=False, k_fold=10, normalize=True, title='Confusion Matrix', savefig=False)

      Plots a confusion matrix under k-fold cross-validation.

      :param data_y: Human-readable labels aligned to the model’s internal labels. The
                     classifier’s labels are used when not provided. Defaults to None.
      :type data_y: array-like, optional
      :param norm: Min-max normalize features before evaluation. Defaults to False.
      :type norm: bool
      :param pca: Evaluate on PCA-projected features. Defaults to False.
      :type pca: bool
      :param k_fold: Number of cross-validation folds. Defaults to 10.
      :type k_fold: int
      :param normalize: Show rates (True) or counts (False). Defaults to True.
      :type normalize: bool
      :param title: Figure title. Defaults to 'Confusion Matrix'.
      :type title: str
      :param savefig: Save a PNG instead of showing. Defaults to False.
      :type savefig: bool

      :rtype: AxesImage


   .. py:method:: plot_roc_curve(k_fold=10, pca=False, title='Receiver Operating Characteristic Curve', savefig=False)

      Plots the mean ROC curve with ±1σ band under k-fold cross-validation for
      binary classification.

      :param k_fold: Number of cross-validation folds. Defaults to 10.
      :type k_fold: int
      :param pca: Evaluate on PCA-projected features. Defaults to False.
      :type pca: bool
      :param title: Figure title. Defaults to "Receiver Operating Characteristic Curve".
      :type title: str
      :param savefig: Save a PNG instead of showing. Defaults to False.
      :type savefig: bool

      :rtype: AxesImage


   .. py:method:: plot_hyper_opt(baseline=None, xlim=None, ylim=None, xlog=True, ylog=False, ylabel=None, title=None, loc='upper left', ncol=1, savefig=False)

      Visualizes Optuna optimization history: trial values and running best.

      :param baseline: Horizontal baseline to compare against. Defaults to None.
      :type baseline: float, optional
      :param xlim: X-axis limits. Defaults to None.
      :type xlim: tuple, optional
      :param ylim: Y-axis limits. Defaults to None.
      :type ylim: tuple, optional
      :param xlog: Log-scale the x-axis. Defaults to True.
      :type xlog: bool
      :param ylog: Log-scale the y-axis. Defaults to False.
      :type ylog: bool
      :param ylabel: Custom y-axis label. Defaults to None.
      :type ylabel: str, optional
      :param title: Custom title; inferred from `clf` when not set. Defaults to None.
      :type title: str, optional
      :param loc: Legend location. Defaults to 'upper left'.
      :type loc: str
      :param ncol: Number of legend columns. Defaults to 1.
      :type ncol: int
      :param savefig: Save a PNG instead of showing. Defaults to False.
      :type savefig: bool

      :rtype: AxesImage


   .. py:method:: plot_feature_opt(feat_names=None, top='all', include_other=True, include_shadow=True, include_rejected=False, flip_axes=True, title='Feature Importance', save_data=False, savefig=False)

      Displays BorutaSHAP z-score distributions per feature across trials.

      :param feat_names: Names for features in `data_x`. Defaults to None.
      :type feat_names: array-like, optional
      :param top: Number of accepted features to show; 'all' shows every accepted feature.
                  Defaults to 'all'.
      :type top: int or 'all'
      :param include_other: Aggregate remaining accepted features into an "Other Accepted" entry.
                            Defaults to True.
      :type include_other: bool
      :param include_shadow: Include the Max Shadow baseline. Defaults to True.
      :type include_shadow: bool
      :param include_rejected: Append averaged rejected features. Defaults to False.
      :type include_rejected: bool
      :param flip_axes: Plot horizontally (True) or vertically (False). Defaults to True.
      :type flip_axes: bool
      :param title: Figure title. Defaults to 'Feature Importance'.
      :type title: str
      :param save_data: Keep the temporary CSV written by BorutaSHAP for this plot. Defaults to False.
      :type save_data: bool
      :param savefig: Save a PNG instead of showing. Defaults to False.
      :type savefig: bool

      :rtype: AxesImage


   .. py:method:: plot_hyper_param_importance(plot_time=True, savefig=False)

      Plots hyperparameter importance and, optionally, duration importance.

      :param plot_time: Include the impact on optimization duration. Defaults to True.
      :type plot_time: bool
      :param savefig: Save a PNG instead of showing. Defaults to False.
      :type savefig: bool

      :rtype: AxesImage


   .. py:method:: save_hyper_importance()

      Computes and saves dictionaries of hyperparameter importance and duration
      importance for later plotting.

      .. rubric:: Notes

      Writes two files into the model directory: `Hyperparameter_Importance`
      and `Duration_Importance`. This step can be time-consuming.

      :rtype: None


.. py:function:: format_labels(labels: list) -> list

   Format hyperparameter/feature labels for display.

   Replaces underscores with spaces, title-cases words, and applies a few
   readable-friendly aliases.

   :param labels: Raw label strings to format.
   :type labels: list of str

   :returns: Reformatted labels, same length as the input.
   :rtype: list of str


.. py:function:: evaluate_model(classifier, data_x, data_y, normalize=True, k_fold=10, random_state=1909)

   Cross-validates a classifier and returns out-of-fold predictions together with the
   corresponding ground-truth labels.

   :param classifier: Any scikit-learn–compatible model implementing `fit` and `predict`.
   :type classifier: estimator
   :param data_x: Feature matrix.
   :type data_x: ndarray of shape (n_samples, n_features)
   :param data_y: Target labels.
   :type data_y: array-like of shape (n_samples,)
   :param normalize: Unused in this function; retained for API compatibility with plotting utilities.
                     Defaults to True.
   :type normalize: bool, optional
   :param k_fold: Number of K-fold splits. Defaults to 10.
   :type k_fold: int, optional
   :param random_state: Seed for shuffling within the cross-validation splitter. Defaults to 1909.
   :type random_state: int, optional

   :returns: * **predicted_targets** (*ndarray of shape (n_samples,)*) -- Out-of-fold predicted labels concatenated across folds.
             * **actual_targets** (*ndarray of shape (n_samples,)*) -- True labels ordered identically to `predicted_targets`.


.. py:function:: generate_matrix(predicted_labels_list, actual_targets, classes, normalize=True, title='Confusion Matrix', savefig=False)

   Generate and render a confusion matrix from predicted and true labels.

   :param predicted_labels_list: Predicted class labels, typically the out-of-fold predictions returned by `evaluate_model()`.
   :type predicted_labels_list: array-like of shape (n_samples,)
   :param actual_targets: Ground-truth class labels in the same order as `predicted_labels_list`.
   :type actual_targets: array-like of shape (n_samples,)
   :param classes: Class names used to label the matrix axes. The order must match the label encoding in the inputs.
   :type classes: list of str
   :param normalize: If True the confusion matrix is normalized (row-wise) before plotting. Defaults to True.
   :type normalize: bool, optional
   :param title: Figure title. Defaults to 'Confusion Matrix'.
   :type title: str, optional
   :param savefig: If True the figure is saved to 'Ensemble_Confusion_Matrix.png' and not displayed. Defaults to False.
   :type savefig: bool, optional

   :returns: Displays the figure or saves it to disk.
   :rtype: None


.. py:function:: generate_plot(conf_matrix, classes, normalize=False, title='Confusion Matrix', include_cbar=False, savefig=False)

   Generate a confusion-matrix figure and axes without calling `plt.show()`.

   :param conf_matrix: Confusion matrix (counts) produced upstream (e.g., via `confusion_matrix`).
   :type conf_matrix: array-like of shape (n_classes, n_classes)
   :param classes: Class names used for tick labels. Order must match the matrix axes.
   :type classes: list of str
   :param normalize: If True the matrix is normalized row-wise to proportions. Defaults to False.
   :type normalize: bool, optional
   :param title: Figure title. Defaults to 'Confusion Matrix'.
   :type title: str, optional
   :param include_cbar: If True a colorbar is added to the figure. Defaults to False.
   :type include_cbar: bool, optional
   :param savefig: Included for API symmetry; saving is typically handled by the caller. Defaults to False.
   :type savefig: bool, optional

   :returns: * **fig** (*matplotlib.figure.Figure*) -- The created figure.
             * **ax** (*matplotlib.axes.Axes*) -- The axes containing the confusion matrix.


.. py:function:: _set_style_()

   Function to configure the matplotlib.pyplot style. This function is called before any images are saved,
   after which the style is reset to the default.