pyBIA.ensemble_model

Created on Wed Sep 8 10:04:23 2021

@author: daniel

Classes

Classifier

Creates a machine-learning classifier with optional imputation, BorutaSHAP

Functions

format_labels(→ list)

Format hyperparameter/feature labels for display.

evaluate_model(classifier, data_x, data_y[, ...])

Cross-validates a classifier and returns out-of-fold predictions together with the

generate_matrix(predicted_labels_list, actual_targets, ...)

Generate and render a confusion matrix from predicted and true labels.

generate_plot(conf_matrix, classes[, normalize, ...])

Generate a confusion-matrix figure and axes without calling plt.show().

_set_style_()

Function to configure the matplotlib.pyplot style. This function is called before any images are saved,

Module Contents

class pyBIA.ensemble_model.Classifier(data_x=None, data_y=None, clf='rf', optimize=False, opt_cv=10, scoring_metric='f1', limit_search=True, impute=True, imp_method='knn', n_iter=25, boruta_trials=50, boruta_model='rf', balance=True, csv_file=None, SEED_NO=1909)[source]

Creates a machine-learning classifier with optional imputation, BorutaSHAP feature selection, and Optuna hyperparameter optimization. Utilities are provided to save/load artifacts and to plot diagnostics (t-SNE, confusion matrix, ROC, optimization history, and importances).

Parameters:
  • data_x (ndarray) – Feature matrix of shape (n_samples, n_features).

  • data_y (array-like) – 1D array of labels aligned to data_x.

  • clf (str) – Estimator to build. One of {‘rf’,’nn’,’xgb’,’histgb’,’adaboost’,’svc’, ‘logreg’,’bdt’,’gaussian_nb’,’knn’,’extratrees’,’tree’,’ocsvm’}. Defaults to ‘rf’.

  • optimize (bool) – Run BorutaSHAP (when boruta_trials > 0) and Optuna search before fitting. Defaults to False.

  • opt_cv (int) – Number of cross-validation folds used during optimization. Defaults to 10.

  • scoring_metric (str) – Metric optimized by Optuna. One of {‘accuracy’,’f1’,’precision’,’recall’,’roc_auc’}. Defaults to ‘f1’.

  • limit_search (bool) – Constrain very wide hyperparameter ranges for practicality. Defaults to True.

  • impute (bool) – Impute missing values prior to fitting. Defaults to True.

  • imp_method (str) – Imputation strategy. One of {‘knn’,’mean’,’median’,’mode’,’constant’}. Defaults to ‘knn’.

  • n_iter (int) – Number of Optuna trials; use 0 to skip search. Defaults to 25.

  • boruta_trials (int) – Number of BorutaSHAP trials; use 0 to skip feature selection. Defaults to 50.

  • boruta_model (str) – Base estimator for BorutaSHAP, independent of clf. One of {‘rf’,’xgb’}. Defaults to ‘rf’.

  • balance (bool) – Apply class weighting for imbalanced binary tasks where supported. Defaults to True.

  • csv_file (DataFrame, optional) – Alternative to (data_x, data_y). Must include a ‘label’ column. Defaults to None.

  • SEED_NO (int) – Random seed used across components. Defaults to 1909.

data_x[source]

Possibly imputed/processed feature matrix.

Type:

ndarray or None

data_y[source]

Numeric labels used for fitting (may be encoded).

Type:

ndarray or None

data_y_

Copy of original labels (pre-encoding) for plots.

Type:

ndarray or None

clf[source]

Name of the chosen estimator.

Type:

str

model[source]

Trained estimator instance.

Type:

estimator or None

imputer[source]

Fitted imputer used for transformations.

Type:

object or None

feats_to_use[source]

Indices of selected features (BorutaSHAP).

Type:

ndarray or None

feature_history[source]

BorutaSHAP selection history.

Type:

object or None

optimization_results[source]

Study from hyperparameter search.

Type:

optuna.study.Study or None

best_params[source]

Best hyperparameters from Optuna.

Type:

dict or None

path

Directory used when saving artifacts.

Type:

str or None

SEED_NO[source]

Seed propagated to internal routines.

Type:

int

data_x = None[source]
data_y = None[source]
clf = 'rf'[source]
optimize = False[source]
opt_cv = 10[source]
scoring_metric = 'f1'[source]
impute = True[source]
imp_method = 'knn'[source]
n_iter = 25[source]
boruta_trials = 50[source]
boruta_model = 'rf'[source]
balance = True[source]
csv_file = None[source]
SEED_NO = 1909[source]
model = None[source]
imputer = None[source]
feats_to_use = None[source]
feature_history = None[source]
optimization_results = None[source]
best_params = None[source]
create(overwrite_training=True)[source]

Builds the pipeline (optional feature selection and optimization), fits the estimator, and stores artifacts.

Parameters:

overwrite_training (bool) – When True, replace self.data_x with the processed matrix used for fitting. Defaults to True.

Return type:

None

save(dirname=None, path=None, overwrite=False)[source]

Saves the trained model and auxiliary artifacts.

Notes

Creates a pyBIA_ensemble_model/ folder containing, when available: Model, Imputer, Feats_Index, HyperOpt_Results, Best_Params, and FeatureOpt_Results.

Parameters:
  • dirname (str, optional) – Subdirectory name created under path. Defaults to None.

  • path (str, optional) – Base directory for saving. The user home is used when not provided. Defaults to None.

  • overwrite (bool) – Remove any existing pyBIA_ensemble_model at the target before saving. Defaults to False.

Return type:

None

Raises:

ValueError – If nothing has been created (run .create() first) or if the target exists and overwrite is False.

load(path=None)[source]

Loads model and auxiliary artifacts from a pyBIA_ensemble_model/ folder.

Parameters:

path (str, optional) – Base directory containing the folder. The user home is used when not provided. Defaults to None.

Return type:

None

predict(data)[source]

Predicts class labels and top-class probabilities for new samples.

Parameters:

data (ndarray) – Feature matrix of shape (n_samples, n_features). If feature selection was used, only the selected columns are required.

Returns:

Array of shape (n_samples, 2) with rows [predicted_label, probability_of_predicted_label].

Return type:

ndarray

plot_tsne(data_y=None, special_class=None, norm=True, pca=False, return_data=False, xlim=None, ylim=None, legend_loc='upper center', title='Feature Parameter Space', savefig=False)[source]

Plots a 2D t-SNE embedding of the feature space.

Parameters:
  • data_y (array-like, optional) – Labels for coloring. The classifier’s labels are used when not provided. Defaults to None.

  • special_class (hashable, optional) – Class label to highlight. Defaults to None.

  • norm (bool) – Standardize features before t-SNE. Defaults to True.

  • pca (bool) – Apply PCA (all components) before t-SNE. Defaults to False.

  • return_data (bool) – Return the (x, y) coordinates instead of only plotting. Defaults to False.

  • xlim (tuple, optional) – X-axis limits. Defaults to None.

  • ylim (tuple, optional) – Y-axis limits. Defaults to None.

  • legend_loc (str) – Legend location. Defaults to ‘upper center’.

  • title (str) – Figure title. Defaults to ‘Feature Parameter Space’.

  • savefig (bool) – Save a PNG instead of showing. Defaults to False.

Returns:

When return_data is False, returns the plotted artist. When True, returns (x, y) coordinates.

Return type:

AxesImage or tuple

plot_conf_matrix(data_y=None, norm=False, pca=False, k_fold=10, normalize=True, title='Confusion Matrix', savefig=False)[source]

Plots a confusion matrix under k-fold cross-validation.

Parameters:
  • data_y (array-like, optional) – Human-readable labels aligned to the model’s internal labels. The classifier’s labels are used when not provided. Defaults to None.

  • norm (bool) – Min-max normalize features before evaluation. Defaults to False.

  • pca (bool) – Evaluate on PCA-projected features. Defaults to False.

  • k_fold (int) – Number of cross-validation folds. Defaults to 10.

  • normalize (bool) – Show rates (True) or counts (False). Defaults to True.

  • title (str) – Figure title. Defaults to ‘Confusion Matrix’.

  • savefig (bool) – Save a PNG instead of showing. Defaults to False.

Return type:

AxesImage

plot_roc_curve(k_fold=10, pca=False, title='Receiver Operating Characteristic Curve', savefig=False)[source]

Plots the mean ROC curve with ±1σ band under k-fold cross-validation for binary classification.

Parameters:
  • k_fold (int) – Number of cross-validation folds. Defaults to 10.

  • pca (bool) – Evaluate on PCA-projected features. Defaults to False.

  • title (str) – Figure title. Defaults to “Receiver Operating Characteristic Curve”.

  • savefig (bool) – Save a PNG instead of showing. Defaults to False.

Return type:

AxesImage

plot_hyper_opt(baseline=None, xlim=None, ylim=None, xlog=True, ylog=False, ylabel=None, title=None, loc='upper left', ncol=1, savefig=False)[source]

Visualizes Optuna optimization history: trial values and running best.

Parameters:
  • baseline (float, optional) – Horizontal baseline to compare against. Defaults to None.

  • xlim (tuple, optional) – X-axis limits. Defaults to None.

  • ylim (tuple, optional) – Y-axis limits. Defaults to None.

  • xlog (bool) – Log-scale the x-axis. Defaults to True.

  • ylog (bool) – Log-scale the y-axis. Defaults to False.

  • ylabel (str, optional) – Custom y-axis label. Defaults to None.

  • title (str, optional) – Custom title; inferred from clf when not set. Defaults to None.

  • loc (str) – Legend location. Defaults to ‘upper left’.

  • ncol (int) – Number of legend columns. Defaults to 1.

  • savefig (bool) – Save a PNG instead of showing. Defaults to False.

Return type:

AxesImage

plot_feature_opt(feat_names=None, top='all', include_other=True, include_shadow=True, include_rejected=False, flip_axes=True, title='Feature Importance', save_data=False, savefig=False)[source]

Displays BorutaSHAP z-score distributions per feature across trials.

Parameters:
  • feat_names (array-like, optional) – Names for features in data_x. Defaults to None.

  • top (int or 'all') – Number of accepted features to show; ‘all’ shows every accepted feature. Defaults to ‘all’.

  • include_other (bool) – Aggregate remaining accepted features into an “Other Accepted” entry. Defaults to True.

  • include_shadow (bool) – Include the Max Shadow baseline. Defaults to True.

  • include_rejected (bool) – Append averaged rejected features. Defaults to False.

  • flip_axes (bool) – Plot horizontally (True) or vertically (False). Defaults to True.

  • title (str) – Figure title. Defaults to ‘Feature Importance’.

  • save_data (bool) – Keep the temporary CSV written by BorutaSHAP for this plot. Defaults to False.

  • savefig (bool) – Save a PNG instead of showing. Defaults to False.

Return type:

AxesImage

plot_hyper_param_importance(plot_time=True, savefig=False)[source]

Plots hyperparameter importance and, optionally, duration importance.

Parameters:
  • plot_time (bool) – Include the impact on optimization duration. Defaults to True.

  • savefig (bool) – Save a PNG instead of showing. Defaults to False.

Return type:

AxesImage

save_hyper_importance()[source]

Computes and saves dictionaries of hyperparameter importance and duration importance for later plotting.

Notes

Writes two files into the model directory: Hyperparameter_Importance and Duration_Importance. This step can be time-consuming.

Return type:

None

pyBIA.ensemble_model.format_labels(labels: list) list[source]

Format hyperparameter/feature labels for display.

Replaces underscores with spaces, title-cases words, and applies a few readable-friendly aliases.

Parameters:

labels (list of str) – Raw label strings to format.

Returns:

Reformatted labels, same length as the input.

Return type:

list of str

pyBIA.ensemble_model.evaluate_model(classifier, data_x, data_y, normalize=True, k_fold=10, random_state=1909)[source]

Cross-validates a classifier and returns out-of-fold predictions together with the corresponding ground-truth labels.

Parameters:
  • classifier (estimator) – Any scikit-learn–compatible model implementing fit and predict.

  • data_x (ndarray of shape (n_samples, n_features)) – Feature matrix.

  • data_y (array-like of shape (n_samples,)) – Target labels.

  • normalize (bool, optional) – Unused in this function; retained for API compatibility with plotting utilities. Defaults to True.

  • k_fold (int, optional) – Number of K-fold splits. Defaults to 10.

  • random_state (int, optional) – Seed for shuffling within the cross-validation splitter. Defaults to 1909.

Returns:

  • predicted_targets (ndarray of shape (n_samples,)) – Out-of-fold predicted labels concatenated across folds.

  • actual_targets (ndarray of shape (n_samples,)) – True labels ordered identically to predicted_targets.

pyBIA.ensemble_model.generate_matrix(predicted_labels_list, actual_targets, classes, normalize=True, title='Confusion Matrix', savefig=False)[source]

Generate and render a confusion matrix from predicted and true labels.

Parameters:
  • predicted_labels_list (array-like of shape (n_samples,)) – Predicted class labels, typically the out-of-fold predictions returned by evaluate_model().

  • actual_targets (array-like of shape (n_samples,)) – Ground-truth class labels in the same order as predicted_labels_list.

  • classes (list of str) – Class names used to label the matrix axes. The order must match the label encoding in the inputs.

  • normalize (bool, optional) – If True the confusion matrix is normalized (row-wise) before plotting. Defaults to True.

  • title (str, optional) – Figure title. Defaults to ‘Confusion Matrix’.

  • savefig (bool, optional) – If True the figure is saved to ‘Ensemble_Confusion_Matrix.png’ and not displayed. Defaults to False.

Returns:

Displays the figure or saves it to disk.

Return type:

None

pyBIA.ensemble_model.generate_plot(conf_matrix, classes, normalize=False, title='Confusion Matrix', include_cbar=False, savefig=False)[source]

Generate a confusion-matrix figure and axes without calling plt.show().

Parameters:
  • conf_matrix (array-like of shape (n_classes, n_classes)) – Confusion matrix (counts) produced upstream (e.g., via confusion_matrix).

  • classes (list of str) – Class names used for tick labels. Order must match the matrix axes.

  • normalize (bool, optional) – If True the matrix is normalized row-wise to proportions. Defaults to False.

  • title (str, optional) – Figure title. Defaults to ‘Confusion Matrix’.

  • include_cbar (bool, optional) – If True a colorbar is added to the figure. Defaults to False.

  • savefig (bool, optional) – Included for API symmetry; saving is typically handled by the caller. Defaults to False.

Returns:

  • fig (matplotlib.figure.Figure) – The created figure.

  • ax (matplotlib.axes.Axes) – The axes containing the confusion matrix.

pyBIA.ensemble_model._set_style_()[source]

Function to configure the matplotlib.pyplot style. This function is called before any images are saved, after which the style is reset to the default.