pyBIA.ensemble_model ==================== .. py:module:: pyBIA.ensemble_model .. autoapi-nested-parse:: Created on Wed Sep 8 10:04:23 2021 @author: daniel Classes ------- .. autoapisummary:: pyBIA.ensemble_model.Classifier Functions --------- .. autoapisummary:: pyBIA.ensemble_model.format_labels pyBIA.ensemble_model.evaluate_model pyBIA.ensemble_model.generate_matrix pyBIA.ensemble_model.generate_plot pyBIA.ensemble_model._set_style_ Module Contents --------------- .. py:class:: Classifier(data_x=None, data_y=None, clf='rf', optimize=False, opt_cv=10, scoring_metric='f1', limit_search=True, impute=True, imp_method='knn', n_iter=25, boruta_trials=50, boruta_model='rf', balance=True, csv_file=None, SEED_NO=1909) Creates a machine-learning classifier with optional imputation, BorutaSHAP feature selection, and Optuna hyperparameter optimization. Utilities are provided to save/load artifacts and to plot diagnostics (t-SNE, confusion matrix, ROC, optimization history, and importances). :param data_x: Feature matrix of shape (n_samples, n_features). :type data_x: ndarray :param data_y: 1D array of labels aligned to `data_x`. :type data_y: array-like :param clf: Estimator to build. One of {'rf','nn','xgb','histgb','adaboost','svc', 'logreg','bdt','gaussian_nb','knn','extratrees','tree','ocsvm'}. Defaults to 'rf'. :type clf: str :param optimize: Run BorutaSHAP (when `boruta_trials` > 0) and Optuna search before fitting. Defaults to False. :type optimize: bool :param opt_cv: Number of cross-validation folds used during optimization. Defaults to 10. :type opt_cv: int :param scoring_metric: Metric optimized by Optuna. One of {'accuracy','f1','precision','recall','roc_auc'}. Defaults to 'f1'. :type scoring_metric: str :param limit_search: Constrain very wide hyperparameter ranges for practicality. Defaults to True. :type limit_search: bool :param impute: Impute missing values prior to fitting. Defaults to True. :type impute: bool :param imp_method: Imputation strategy. One of {'knn','mean','median','mode','constant'}. Defaults to 'knn'. :type imp_method: str :param n_iter: Number of Optuna trials; use 0 to skip search. Defaults to 25. :type n_iter: int :param boruta_trials: Number of BorutaSHAP trials; use 0 to skip feature selection. Defaults to 50. :type boruta_trials: int :param boruta_model: Base estimator for BorutaSHAP, independent of `clf`. One of {'rf','xgb'}. Defaults to 'rf'. :type boruta_model: str :param balance: Apply class weighting for imbalanced binary tasks where supported. Defaults to True. :type balance: bool :param csv_file: Alternative to (`data_x`, `data_y`). Must include a 'label' column. Defaults to None. :type csv_file: DataFrame, optional :param SEED_NO: Random seed used across components. Defaults to 1909. :type SEED_NO: int .. attribute:: data_x Possibly imputed/processed feature matrix. :type: ndarray or None .. attribute:: data_y Numeric labels used for fitting (may be encoded). :type: ndarray or None .. attribute:: data_y_ Copy of original labels (pre-encoding) for plots. :type: ndarray or None .. attribute:: clf Name of the chosen estimator. :type: str .. attribute:: model Trained estimator instance. :type: estimator or None .. attribute:: imputer Fitted imputer used for transformations. :type: object or None .. attribute:: feats_to_use Indices of selected features (BorutaSHAP). :type: ndarray or None .. attribute:: feature_history BorutaSHAP selection history. :type: object or None .. attribute:: optimization_results Study from hyperparameter search. :type: optuna.study.Study or None .. attribute:: best_params Best hyperparameters from Optuna. :type: dict or None .. attribute:: path Directory used when saving artifacts. :type: str or None .. attribute:: SEED_NO Seed propagated to internal routines. :type: int .. py:attribute:: data_x :value: None .. py:attribute:: data_y :value: None .. py:attribute:: clf :value: 'rf' .. py:attribute:: optimize :value: False .. py:attribute:: opt_cv :value: 10 .. py:attribute:: scoring_metric :value: 'f1' .. py:attribute:: limit_search :value: True .. py:attribute:: impute :value: True .. py:attribute:: imp_method :value: 'knn' .. py:attribute:: n_iter :value: 25 .. py:attribute:: boruta_trials :value: 50 .. py:attribute:: boruta_model :value: 'rf' .. py:attribute:: balance :value: True .. py:attribute:: csv_file :value: None .. py:attribute:: SEED_NO :value: 1909 .. py:attribute:: model :value: None .. py:attribute:: imputer :value: None .. py:attribute:: feats_to_use :value: None .. py:attribute:: feature_history :value: None .. py:attribute:: optimization_results :value: None .. py:attribute:: best_params :value: None .. py:method:: create(overwrite_training=True) Builds the pipeline (optional feature selection and optimization), fits the estimator, and stores artifacts. :param overwrite_training: When True, replace `self.data_x` with the processed matrix used for fitting. Defaults to True. :type overwrite_training: bool :rtype: None .. py:method:: save(dirname=None, path=None, overwrite=False) Saves the trained model and auxiliary artifacts. .. rubric:: Notes Creates a `pyBIA_ensemble_model/` folder containing, when available: `Model`, `Imputer`, `Feats_Index`, `HyperOpt_Results`, `Best_Params`, and `FeatureOpt_Results`. :param dirname: Subdirectory name created under `path`. Defaults to None. :type dirname: str, optional :param path: Base directory for saving. The user home is used when not provided. Defaults to None. :type path: str, optional :param overwrite: Remove any existing `pyBIA_ensemble_model` at the target before saving. Defaults to False. :type overwrite: bool :rtype: None :raises ValueError: If nothing has been created (run `.create()` first) or if the target exists and `overwrite` is False. .. py:method:: load(path=None) Loads model and auxiliary artifacts from a `pyBIA_ensemble_model/` folder. :param path: Base directory containing the folder. The user home is used when not provided. Defaults to None. :type path: str, optional :rtype: None .. py:method:: predict(data) Predicts class labels and top-class probabilities for new samples. :param data: Feature matrix of shape (n_samples, n_features). If feature selection was used, only the selected columns are required. :type data: ndarray :returns: Array of shape (n_samples, 2) with rows [predicted_label, probability_of_predicted_label]. :rtype: ndarray .. py:method:: plot_tsne(data_y=None, special_class=None, norm=True, pca=False, return_data=False, xlim=None, ylim=None, legend_loc='upper center', title='Feature Parameter Space', savefig=False) Plots a 2D t-SNE embedding of the feature space. :param data_y: Labels for coloring. The classifier’s labels are used when not provided. Defaults to None. :type data_y: array-like, optional :param special_class: Class label to highlight. Defaults to None. :type special_class: hashable, optional :param norm: Standardize features before t-SNE. Defaults to True. :type norm: bool :param pca: Apply PCA (all components) before t-SNE. Defaults to False. :type pca: bool :param return_data: Return the (x, y) coordinates instead of only plotting. Defaults to False. :type return_data: bool :param xlim: X-axis limits. Defaults to None. :type xlim: tuple, optional :param ylim: Y-axis limits. Defaults to None. :type ylim: tuple, optional :param legend_loc: Legend location. Defaults to 'upper center'. :type legend_loc: str :param title: Figure title. Defaults to 'Feature Parameter Space'. :type title: str :param savefig: Save a PNG instead of showing. Defaults to False. :type savefig: bool :returns: When `return_data` is False, returns the plotted artist. When True, returns `(x, y)` coordinates. :rtype: AxesImage or tuple .. py:method:: plot_conf_matrix(data_y=None, norm=False, pca=False, k_fold=10, normalize=True, title='Confusion Matrix', savefig=False) Plots a confusion matrix under k-fold cross-validation. :param data_y: Human-readable labels aligned to the model’s internal labels. The classifier’s labels are used when not provided. Defaults to None. :type data_y: array-like, optional :param norm: Min-max normalize features before evaluation. Defaults to False. :type norm: bool :param pca: Evaluate on PCA-projected features. Defaults to False. :type pca: bool :param k_fold: Number of cross-validation folds. Defaults to 10. :type k_fold: int :param normalize: Show rates (True) or counts (False). Defaults to True. :type normalize: bool :param title: Figure title. Defaults to 'Confusion Matrix'. :type title: str :param savefig: Save a PNG instead of showing. Defaults to False. :type savefig: bool :rtype: AxesImage .. py:method:: plot_roc_curve(k_fold=10, pca=False, title='Receiver Operating Characteristic Curve', savefig=False) Plots the mean ROC curve with ±1σ band under k-fold cross-validation for binary classification. :param k_fold: Number of cross-validation folds. Defaults to 10. :type k_fold: int :param pca: Evaluate on PCA-projected features. Defaults to False. :type pca: bool :param title: Figure title. Defaults to "Receiver Operating Characteristic Curve". :type title: str :param savefig: Save a PNG instead of showing. Defaults to False. :type savefig: bool :rtype: AxesImage .. py:method:: plot_hyper_opt(baseline=None, xlim=None, ylim=None, xlog=True, ylog=False, ylabel=None, title=None, loc='upper left', ncol=1, savefig=False) Visualizes Optuna optimization history: trial values and running best. :param baseline: Horizontal baseline to compare against. Defaults to None. :type baseline: float, optional :param xlim: X-axis limits. Defaults to None. :type xlim: tuple, optional :param ylim: Y-axis limits. Defaults to None. :type ylim: tuple, optional :param xlog: Log-scale the x-axis. Defaults to True. :type xlog: bool :param ylog: Log-scale the y-axis. Defaults to False. :type ylog: bool :param ylabel: Custom y-axis label. Defaults to None. :type ylabel: str, optional :param title: Custom title; inferred from `clf` when not set. Defaults to None. :type title: str, optional :param loc: Legend location. Defaults to 'upper left'. :type loc: str :param ncol: Number of legend columns. Defaults to 1. :type ncol: int :param savefig: Save a PNG instead of showing. Defaults to False. :type savefig: bool :rtype: AxesImage .. py:method:: plot_feature_opt(feat_names=None, top='all', include_other=True, include_shadow=True, include_rejected=False, flip_axes=True, title='Feature Importance', save_data=False, savefig=False) Displays BorutaSHAP z-score distributions per feature across trials. :param feat_names: Names for features in `data_x`. Defaults to None. :type feat_names: array-like, optional :param top: Number of accepted features to show; 'all' shows every accepted feature. Defaults to 'all'. :type top: int or 'all' :param include_other: Aggregate remaining accepted features into an "Other Accepted" entry. Defaults to True. :type include_other: bool :param include_shadow: Include the Max Shadow baseline. Defaults to True. :type include_shadow: bool :param include_rejected: Append averaged rejected features. Defaults to False. :type include_rejected: bool :param flip_axes: Plot horizontally (True) or vertically (False). Defaults to True. :type flip_axes: bool :param title: Figure title. Defaults to 'Feature Importance'. :type title: str :param save_data: Keep the temporary CSV written by BorutaSHAP for this plot. Defaults to False. :type save_data: bool :param savefig: Save a PNG instead of showing. Defaults to False. :type savefig: bool :rtype: AxesImage .. py:method:: plot_hyper_param_importance(plot_time=True, savefig=False) Plots hyperparameter importance and, optionally, duration importance. :param plot_time: Include the impact on optimization duration. Defaults to True. :type plot_time: bool :param savefig: Save a PNG instead of showing. Defaults to False. :type savefig: bool :rtype: AxesImage .. py:method:: save_hyper_importance() Computes and saves dictionaries of hyperparameter importance and duration importance for later plotting. .. rubric:: Notes Writes two files into the model directory: `Hyperparameter_Importance` and `Duration_Importance`. This step can be time-consuming. :rtype: None .. py:function:: format_labels(labels: list) -> list Format hyperparameter/feature labels for display. Replaces underscores with spaces, title-cases words, and applies a few readable-friendly aliases. :param labels: Raw label strings to format. :type labels: list of str :returns: Reformatted labels, same length as the input. :rtype: list of str .. py:function:: evaluate_model(classifier, data_x, data_y, normalize=True, k_fold=10, random_state=1909) Cross-validates a classifier and returns out-of-fold predictions together with the corresponding ground-truth labels. :param classifier: Any scikit-learn–compatible model implementing `fit` and `predict`. :type classifier: estimator :param data_x: Feature matrix. :type data_x: ndarray of shape (n_samples, n_features) :param data_y: Target labels. :type data_y: array-like of shape (n_samples,) :param normalize: Unused in this function; retained for API compatibility with plotting utilities. Defaults to True. :type normalize: bool, optional :param k_fold: Number of K-fold splits. Defaults to 10. :type k_fold: int, optional :param random_state: Seed for shuffling within the cross-validation splitter. Defaults to 1909. :type random_state: int, optional :returns: * **predicted_targets** (*ndarray of shape (n_samples,)*) -- Out-of-fold predicted labels concatenated across folds. * **actual_targets** (*ndarray of shape (n_samples,)*) -- True labels ordered identically to `predicted_targets`. .. py:function:: generate_matrix(predicted_labels_list, actual_targets, classes, normalize=True, title='Confusion Matrix', savefig=False) Generate and render a confusion matrix from predicted and true labels. :param predicted_labels_list: Predicted class labels, typically the out-of-fold predictions returned by `evaluate_model()`. :type predicted_labels_list: array-like of shape (n_samples,) :param actual_targets: Ground-truth class labels in the same order as `predicted_labels_list`. :type actual_targets: array-like of shape (n_samples,) :param classes: Class names used to label the matrix axes. The order must match the label encoding in the inputs. :type classes: list of str :param normalize: If True the confusion matrix is normalized (row-wise) before plotting. Defaults to True. :type normalize: bool, optional :param title: Figure title. Defaults to 'Confusion Matrix'. :type title: str, optional :param savefig: If True the figure is saved to 'Ensemble_Confusion_Matrix.png' and not displayed. Defaults to False. :type savefig: bool, optional :returns: Displays the figure or saves it to disk. :rtype: None .. py:function:: generate_plot(conf_matrix, classes, normalize=False, title='Confusion Matrix', include_cbar=False, savefig=False) Generate a confusion-matrix figure and axes without calling `plt.show()`. :param conf_matrix: Confusion matrix (counts) produced upstream (e.g., via `confusion_matrix`). :type conf_matrix: array-like of shape (n_classes, n_classes) :param classes: Class names used for tick labels. Order must match the matrix axes. :type classes: list of str :param normalize: If True the matrix is normalized row-wise to proportions. Defaults to False. :type normalize: bool, optional :param title: Figure title. Defaults to 'Confusion Matrix'. :type title: str, optional :param include_cbar: If True a colorbar is added to the figure. Defaults to False. :type include_cbar: bool, optional :param savefig: Included for API symmetry; saving is typically handled by the caller. Defaults to False. :type savefig: bool, optional :returns: * **fig** (*matplotlib.figure.Figure*) -- The created figure. * **ax** (*matplotlib.axes.Axes*) -- The axes containing the confusion matrix. .. py:function:: _set_style_() Function to configure the matplotlib.pyplot style. This function is called before any images are saved, after which the style is reset to the default.