pyBIA.ensemble_model

Created on Wed Sep 8 10:04:23 2021

@author: daniel

Classes

Classifier

Creates a machine-learning classifier with optional imputation, BorutaSHAP

Functions

`format_labels`(→ list)	Format hyperparameter/feature labels for display.
`evaluate_model`(classifier, data_x, data_y[, ...])	Cross-validates a classifier and returns out-of-fold predictions together with the
`generate_matrix`(predicted_labels_list, actual_targets, ...)	Generate and render a confusion matrix from predicted and true labels.
`generate_plot`(conf_matrix, classes[, normalize, ...])	Generate a confusion-matrix figure and axes without calling plt.show().
`_set_style_`()	Function to configure the matplotlib.pyplot style. This function is called before any images are saved,

Module Contents

class pyBIA.ensemble_model.Classifier(data_x=None, data_y=None, clf='rf', optimize=False, opt_cv=10, scoring_metric='f1', limit_search=True, impute=True, imp_method='knn', n_iter=25, boruta_trials=50, boruta_model='rf', balance=True, csv_file=None, SEED_NO=1909)[source]

Creates a machine-learning classifier with optional imputation, BorutaSHAP feature selection, and Optuna hyperparameter optimization. Utilities are provided to save/load artifacts and to plot diagnostics (t-SNE, confusion matrix, ROC, optimization history, and importances).

Parameters:

data_x (ndarray) – Feature matrix of shape (n_samples, n_features).
data_y (array-like) – 1D array of labels aligned to data_x.
clf (str) – Estimator to build. One of {‘rf’,’nn’,’xgb’,’histgb’,’adaboost’,’svc’, ‘logreg’,’bdt’,’gaussian_nb’,’knn’,’extratrees’,’tree’,’ocsvm’}. Defaults to ‘rf’.
optimize (bool) – Run BorutaSHAP (when boruta_trials > 0) and Optuna search before fitting. Defaults to False.
opt_cv (int) – Number of cross-validation folds used during optimization. Defaults to 10.
scoring_metric (str) – Metric optimized by Optuna. One of {‘accuracy’,’f1’,’precision’,’recall’,’roc_auc’}. Defaults to ‘f1’.
limit_search (bool) – Constrain very wide hyperparameter ranges for practicality. Defaults to True.
impute (bool) – Impute missing values prior to fitting. Defaults to True.
imp_method (str) – Imputation strategy. One of {‘knn’,’mean’,’median’,’mode’,’constant’}. Defaults to ‘knn’.
n_iter (int) – Number of Optuna trials; use 0 to skip search. Defaults to 25.
boruta_trials (int) – Number of BorutaSHAP trials; use 0 to skip feature selection. Defaults to 50.
boruta_model (str) – Base estimator for BorutaSHAP, independent of clf. One of {‘rf’,’xgb’}. Defaults to ‘rf’.
balance (bool) – Apply class weighting for imbalanced binary tasks where supported. Defaults to True.
csv_file (DataFrame, optional) – Alternative to (data_x, data_y). Must include a ‘label’ column. Defaults to None.
SEED_NO (int) – Random seed used across components. Defaults to 1909.

data_x[source]

Possibly imputed/processed feature matrix.

Type:: ndarray or None

data_y[source]

Numeric labels used for fitting (may be encoded).

Type:: ndarray or None

data_y_

Copy of original labels (pre-encoding) for plots.

Type:: ndarray or None

clf[source]

Name of the chosen estimator.

Type:: str

model[source]

Trained estimator instance.

Type:: estimator or None

imputer[source]

Fitted imputer used for transformations.

Type:: object or None

feats_to_use[source]

Indices of selected features (BorutaSHAP).

Type:: ndarray or None

feature_history[source]

BorutaSHAP selection history.

Type:: object or None

optimization_results[source]

Study from hyperparameter search.

Type:: optuna.study.Study or None

best_params[source]

Best hyperparameters from Optuna.

Type:: dict or None

path

Directory used when saving artifacts.

Type:: str or None

SEED_NO[source]

Seed propagated to internal routines.

Type:: int

data_x = None[source]

data_y = None[source]

clf = 'rf'[source]

optimize = False[source]

opt_cv = 10[source]

scoring_metric = 'f1'[source]

limit_search = True[source]

impute = True[source]

imp_method = 'knn'[source]

n_iter = 25[source]

boruta_trials = 50[source]

boruta_model = 'rf'[source]

balance = True[source]

csv_file = None[source]

SEED_NO = 1909[source]

model = None[source]

imputer = None[source]

feats_to_use = None[source]

feature_history = None[source]

optimization_results = None[source]

best_params = None[source]

create(overwrite_training=True)[source]

Builds the pipeline (optional feature selection and optimization), fits the estimator, and stores artifacts.

Parameters:: overwrite_training (bool) – When True, replace self.data_x with the processed matrix used for fitting. Defaults to True.
Return type:: None

save(dirname=None, path=None, overwrite=False)[source]

Saves the trained model and auxiliary artifacts.

Notes

Creates a pyBIA_ensemble_model/ folder containing, when available: Model, Imputer, Feats_Index, HyperOpt_Results, Best_Params, and FeatureOpt_Results.

Parameters:

dirname (str, optional) – Subdirectory name created under path. Defaults to None.
path (str, optional) – Base directory for saving. The user home is used when not provided. Defaults to None.
overwrite (bool) – Remove any existing pyBIA_ensemble_model at the target before saving. Defaults to False.

Return type:

None

Raises:

ValueError – If nothing has been created (run .create() first) or if the target exists and overwrite is False.

load(path=None)[source]

Loads model and auxiliary artifacts from a pyBIA_ensemble_model/ folder.

Parameters:: path (str, optional) – Base directory containing the folder. The user home is used when not provided. Defaults to None.
Return type:: None

predict(data)[source]

Predicts class labels and top-class probabilities for new samples.

Parameters:: data (ndarray) – Feature matrix of shape (n_samples, n_features). If feature selection was used, only the selected columns are required.
Returns:: Array of shape (n_samples, 2) with rows [predicted_label, probability_of_predicted_label].
Return type:: ndarray

plot_tsne(data_y=None, special_class=None, norm=True, pca=False, return_data=False, xlim=None, ylim=None, legend_loc='upper center', title='Feature Parameter Space', savefig=False)[source]

Plots a 2D t-SNE embedding of the feature space.

Parameters:

data_y (array-like, optional) – Labels for coloring. The classifier’s labels are used when not provided. Defaults to None.
special_class (hashable, optional) – Class label to highlight. Defaults to None.
norm (bool) – Standardize features before t-SNE. Defaults to True.
pca (bool) – Apply PCA (all components) before t-SNE. Defaults to False.
return_data (bool) – Return the (x, y) coordinates instead of only plotting. Defaults to False.
xlim (tuple, optional) – X-axis limits. Defaults to None.
ylim (tuple, optional) – Y-axis limits. Defaults to None.
legend_loc (str) – Legend location. Defaults to ‘upper center’.
title (str) – Figure title. Defaults to ‘Feature Parameter Space’.
savefig (bool) – Save a PNG instead of showing. Defaults to False.

Returns:

When return_data is False, returns the plotted artist. When True, returns (x, y) coordinates.

Return type:

AxesImage or tuple

plot_conf_matrix(data_y=None, norm=False, pca=False, k_fold=10, normalize=True, title='Confusion Matrix', savefig=False)[source]

Plots a confusion matrix under k-fold cross-validation.

Parameters:

data_y (array-like, optional) – Human-readable labels aligned to the model’s internal labels. The classifier’s labels are used when not provided. Defaults to None.
norm (bool) – Min-max normalize features before evaluation. Defaults to False.
pca (bool) – Evaluate on PCA-projected features. Defaults to False.
k_fold (int) – Number of cross-validation folds. Defaults to 10.
normalize (bool) – Show rates (True) or counts (False). Defaults to True.
title (str) – Figure title. Defaults to ‘Confusion Matrix’.
savefig (bool) – Save a PNG instead of showing. Defaults to False.

Return type:

AxesImage

plot_roc_curve(k_fold=10, pca=False, title='Receiver Operating Characteristic Curve', savefig=False)[source]

Plots the mean ROC curve with ±1σ band under k-fold cross-validation for binary classification.

Parameters:

k_fold (int) – Number of cross-validation folds. Defaults to 10.
pca (bool) – Evaluate on PCA-projected features. Defaults to False.
title (str) – Figure title. Defaults to “Receiver Operating Characteristic Curve”.
savefig (bool) – Save a PNG instead of showing. Defaults to False.

Return type:

AxesImage

plot_hyper_opt(baseline=None, xlim=None, ylim=None, xlog=True, ylog=False, ylabel=None, title=None, loc='upper left', ncol=1, savefig=False)[source]

Visualizes Optuna optimization history: trial values and running best.

Parameters:

baseline (float, optional) – Horizontal baseline to compare against. Defaults to None.
xlim (tuple, optional) – X-axis limits. Defaults to None.
ylim (tuple, optional) – Y-axis limits. Defaults to None.
xlog (bool) – Log-scale the x-axis. Defaults to True.
ylog (bool) – Log-scale the y-axis. Defaults to False.
ylabel (str, optional) – Custom y-axis label. Defaults to None.
title (str, optional) – Custom title; inferred from clf when not set. Defaults to None.
loc (str) – Legend location. Defaults to ‘upper left’.
ncol (int) – Number of legend columns. Defaults to 1.
savefig (bool) – Save a PNG instead of showing. Defaults to False.

Return type:

AxesImage

plot_feature_opt(feat_names=None, top='all', include_other=True, include_shadow=True, include_rejected=False, flip_axes=True, title='Feature Importance', save_data=False, savefig=False)[source]

Displays BorutaSHAP z-score distributions per feature across trials.

Parameters:

feat_names (array-like, optional) – Names for features in data_x. Defaults to None.
top (int or 'all') – Number of accepted features to show; ‘all’ shows every accepted feature. Defaults to ‘all’.
include_other (bool) – Aggregate remaining accepted features into an “Other Accepted” entry. Defaults to True.
include_shadow (bool) – Include the Max Shadow baseline. Defaults to True.
include_rejected (bool) – Append averaged rejected features. Defaults to False.
flip_axes (bool) – Plot horizontally (True) or vertically (False). Defaults to True.
title (str) – Figure title. Defaults to ‘Feature Importance’.
save_data (bool) – Keep the temporary CSV written by BorutaSHAP for this plot. Defaults to False.
savefig (bool) – Save a PNG instead of showing. Defaults to False.

Return type:

AxesImage

plot_hyper_param_importance(plot_time=True, savefig=False)[source]

Plots hyperparameter importance and, optionally, duration importance.

Parameters:

plot_time (bool) – Include the impact on optimization duration. Defaults to True.
savefig (bool) – Save a PNG instead of showing. Defaults to False.

Return type:

AxesImage

save_hyper_importance()[source]

Computes and saves dictionaries of hyperparameter importance and duration importance for later plotting.

Notes

Writes two files into the model directory: Hyperparameter_Importance and Duration_Importance. This step can be time-consuming.

Return type:: None

pyBIA.ensemble_model.format_labels(labels: list) → list[source]

Format hyperparameter/feature labels for display.

Replaces underscores with spaces, title-cases words, and applies a few readable-friendly aliases.

Parameters:: labels (list of str) – Raw label strings to format.
Returns:: Reformatted labels, same length as the input.
Return type:: list of str

pyBIA.ensemble_model.evaluate_model(classifier, data_x, data_y, normalize=True, k_fold=10, random_state=1909)[source]

Cross-validates a classifier and returns out-of-fold predictions together with the corresponding ground-truth labels.

Parameters:

classifier (estimator) – Any scikit-learn–compatible model implementing fit and predict.
data_x (ndarray of shape (n_samples, n_features)) – Feature matrix.
data_y (array-like of shape (n_samples,)) – Target labels.
normalize (bool, optional) – Unused in this function; retained for API compatibility with plotting utilities. Defaults to True.
k_fold (int, optional) – Number of K-fold splits. Defaults to 10.
random_state (int, optional) – Seed for shuffling within the cross-validation splitter. Defaults to 1909.

Returns:

predicted_targets (ndarray of shape (n_samples,)) – Out-of-fold predicted labels concatenated across folds.
actual_targets (ndarray of shape (n_samples,)) – True labels ordered identically to predicted_targets.

pyBIA.ensemble_model.generate_matrix(predicted_labels_list, actual_targets, classes, normalize=True, title='Confusion Matrix', savefig=False)[source]

Generate and render a confusion matrix from predicted and true labels.

Parameters:

predicted_labels_list (array-like of shape (n_samples,)) – Predicted class labels, typically the out-of-fold predictions returned by evaluate_model().
actual_targets (array-like of shape (n_samples,)) – Ground-truth class labels in the same order as predicted_labels_list.
classes (list of str) – Class names used to label the matrix axes. The order must match the label encoding in the inputs.
normalize (bool, optional) – If True the confusion matrix is normalized (row-wise) before plotting. Defaults to True.
title (str, optional) – Figure title. Defaults to ‘Confusion Matrix’.
savefig (bool, optional) – If True the figure is saved to ‘Ensemble_Confusion_Matrix.png’ and not displayed. Defaults to False.

Returns:

Displays the figure or saves it to disk.

Return type:

None

pyBIA.ensemble_model.generate_plot(conf_matrix, classes, normalize=False, title='Confusion Matrix', include_cbar=False, savefig=False)[source]

Generate a confusion-matrix figure and axes without calling plt.show().

Parameters:

conf_matrix (array-like of shape (n_classes, n_classes)) – Confusion matrix (counts) produced upstream (e.g., via confusion_matrix).
classes (list of str) – Class names used for tick labels. Order must match the matrix axes.
normalize (bool, optional) – If True the matrix is normalized row-wise to proportions. Defaults to False.
title (str, optional) – Figure title. Defaults to ‘Confusion Matrix’.
include_cbar (bool, optional) – If True a colorbar is added to the figure. Defaults to False.
savefig (bool, optional) – Included for API symmetry; saving is typically handled by the caller. Defaults to False.

Returns:

fig (matplotlib.figure.Figure) – The created figure.
ax (matplotlib.axes.Axes) – The axes containing the confusion matrix.

pyBIA.ensemble_model._set_style_()[source]: Function to configure the matplotlib.pyplot style. This function is called before any images are saved, after which the style is reset to the default.