pyBIA.ensemble_model
Created on Wed Sep 8 10:04:23 2021
@author: daniel
Classes
Creates a machine-learning classifier with optional imputation, BorutaSHAP |
Functions
|
Format hyperparameter/feature labels for display. |
|
Cross-validates a classifier and returns out-of-fold predictions together with the |
|
Generate and render a confusion matrix from predicted and true labels. |
|
Generate a confusion-matrix figure and axes without calling plt.show(). |
Function to configure the matplotlib.pyplot style. This function is called before any images are saved, |
Module Contents
- class pyBIA.ensemble_model.Classifier(data_x=None, data_y=None, clf='rf', optimize=False, opt_cv=10, scoring_metric='f1', limit_search=True, impute=True, imp_method='knn', n_iter=25, boruta_trials=50, boruta_model='rf', balance=True, csv_file=None, SEED_NO=1909)[source]
Creates a machine-learning classifier with optional imputation, BorutaSHAP feature selection, and Optuna hyperparameter optimization. Utilities are provided to save/load artifacts and to plot diagnostics (t-SNE, confusion matrix, ROC, optimization history, and importances).
- Parameters:
data_x (ndarray) – Feature matrix of shape (n_samples, n_features).
data_y (array-like) – 1D array of labels aligned to data_x.
clf (str) – Estimator to build. One of {‘rf’,’nn’,’xgb’,’histgb’,’adaboost’,’svc’, ‘logreg’,’bdt’,’gaussian_nb’,’knn’,’extratrees’,’tree’,’ocsvm’}. Defaults to ‘rf’.
optimize (bool) – Run BorutaSHAP (when boruta_trials > 0) and Optuna search before fitting. Defaults to False.
opt_cv (int) – Number of cross-validation folds used during optimization. Defaults to 10.
scoring_metric (str) – Metric optimized by Optuna. One of {‘accuracy’,’f1’,’precision’,’recall’,’roc_auc’}. Defaults to ‘f1’.
limit_search (bool) – Constrain very wide hyperparameter ranges for practicality. Defaults to True.
impute (bool) – Impute missing values prior to fitting. Defaults to True.
imp_method (str) – Imputation strategy. One of {‘knn’,’mean’,’median’,’mode’,’constant’}. Defaults to ‘knn’.
n_iter (int) – Number of Optuna trials; use 0 to skip search. Defaults to 25.
boruta_trials (int) – Number of BorutaSHAP trials; use 0 to skip feature selection. Defaults to 50.
boruta_model (str) – Base estimator for BorutaSHAP, independent of clf. One of {‘rf’,’xgb’}. Defaults to ‘rf’.
balance (bool) – Apply class weighting for imbalanced binary tasks where supported. Defaults to True.
csv_file (DataFrame, optional) – Alternative to (data_x, data_y). Must include a ‘label’ column. Defaults to None.
SEED_NO (int) – Random seed used across components. Defaults to 1909.
- data_y_
Copy of original labels (pre-encoding) for plots.
- Type:
ndarray or None
- create(overwrite_training=True)[source]
Builds the pipeline (optional feature selection and optimization), fits the estimator, and stores artifacts.
- Parameters:
overwrite_training (bool) – When True, replace self.data_x with the processed matrix used for fitting. Defaults to True.
- Return type:
None
- save(dirname=None, path=None, overwrite=False)[source]
Saves the trained model and auxiliary artifacts.
Notes
Creates a pyBIA_ensemble_model/ folder containing, when available: Model, Imputer, Feats_Index, HyperOpt_Results, Best_Params, and FeatureOpt_Results.
- Parameters:
- Return type:
None
- Raises:
ValueError – If nothing has been created (run .create() first) or if the target exists and overwrite is False.
- load(path=None)[source]
Loads model and auxiliary artifacts from a pyBIA_ensemble_model/ folder.
- Parameters:
path (str, optional) – Base directory containing the folder. The user home is used when not provided. Defaults to None.
- Return type:
None
- predict(data)[source]
Predicts class labels and top-class probabilities for new samples.
- Parameters:
data (ndarray) – Feature matrix of shape (n_samples, n_features). If feature selection was used, only the selected columns are required.
- Returns:
Array of shape (n_samples, 2) with rows [predicted_label, probability_of_predicted_label].
- Return type:
ndarray
- plot_tsne(data_y=None, special_class=None, norm=True, pca=False, return_data=False, xlim=None, ylim=None, legend_loc='upper center', title='Feature Parameter Space', savefig=False)[source]
Plots a 2D t-SNE embedding of the feature space.
- Parameters:
data_y (array-like, optional) – Labels for coloring. The classifier’s labels are used when not provided. Defaults to None.
special_class (hashable, optional) – Class label to highlight. Defaults to None.
norm (bool) – Standardize features before t-SNE. Defaults to True.
pca (bool) – Apply PCA (all components) before t-SNE. Defaults to False.
return_data (bool) – Return the (x, y) coordinates instead of only plotting. Defaults to False.
xlim (tuple, optional) – X-axis limits. Defaults to None.
ylim (tuple, optional) – Y-axis limits. Defaults to None.
legend_loc (str) – Legend location. Defaults to ‘upper center’.
title (str) – Figure title. Defaults to ‘Feature Parameter Space’.
savefig (bool) – Save a PNG instead of showing. Defaults to False.
- Returns:
When return_data is False, returns the plotted artist. When True, returns (x, y) coordinates.
- Return type:
AxesImage or tuple
- plot_conf_matrix(data_y=None, norm=False, pca=False, k_fold=10, normalize=True, title='Confusion Matrix', savefig=False)[source]
Plots a confusion matrix under k-fold cross-validation.
- Parameters:
data_y (array-like, optional) – Human-readable labels aligned to the model’s internal labels. The classifier’s labels are used when not provided. Defaults to None.
norm (bool) – Min-max normalize features before evaluation. Defaults to False.
pca (bool) – Evaluate on PCA-projected features. Defaults to False.
k_fold (int) – Number of cross-validation folds. Defaults to 10.
normalize (bool) – Show rates (True) or counts (False). Defaults to True.
title (str) – Figure title. Defaults to ‘Confusion Matrix’.
savefig (bool) – Save a PNG instead of showing. Defaults to False.
- Return type:
AxesImage
- plot_roc_curve(k_fold=10, pca=False, title='Receiver Operating Characteristic Curve', savefig=False)[source]
Plots the mean ROC curve with ±1σ band under k-fold cross-validation for binary classification.
- Parameters:
- Return type:
AxesImage
- plot_hyper_opt(baseline=None, xlim=None, ylim=None, xlog=True, ylog=False, ylabel=None, title=None, loc='upper left', ncol=1, savefig=False)[source]
Visualizes Optuna optimization history: trial values and running best.
- Parameters:
baseline (float, optional) – Horizontal baseline to compare against. Defaults to None.
xlim (tuple, optional) – X-axis limits. Defaults to None.
ylim (tuple, optional) – Y-axis limits. Defaults to None.
xlog (bool) – Log-scale the x-axis. Defaults to True.
ylog (bool) – Log-scale the y-axis. Defaults to False.
ylabel (str, optional) – Custom y-axis label. Defaults to None.
title (str, optional) – Custom title; inferred from clf when not set. Defaults to None.
loc (str) – Legend location. Defaults to ‘upper left’.
ncol (int) – Number of legend columns. Defaults to 1.
savefig (bool) – Save a PNG instead of showing. Defaults to False.
- Return type:
AxesImage
- plot_feature_opt(feat_names=None, top='all', include_other=True, include_shadow=True, include_rejected=False, flip_axes=True, title='Feature Importance', save_data=False, savefig=False)[source]
Displays BorutaSHAP z-score distributions per feature across trials.
- Parameters:
feat_names (array-like, optional) – Names for features in data_x. Defaults to None.
top (int or 'all') – Number of accepted features to show; ‘all’ shows every accepted feature. Defaults to ‘all’.
include_other (bool) – Aggregate remaining accepted features into an “Other Accepted” entry. Defaults to True.
include_shadow (bool) – Include the Max Shadow baseline. Defaults to True.
include_rejected (bool) – Append averaged rejected features. Defaults to False.
flip_axes (bool) – Plot horizontally (True) or vertically (False). Defaults to True.
title (str) – Figure title. Defaults to ‘Feature Importance’.
save_data (bool) – Keep the temporary CSV written by BorutaSHAP for this plot. Defaults to False.
savefig (bool) – Save a PNG instead of showing. Defaults to False.
- Return type:
AxesImage
- pyBIA.ensemble_model.format_labels(labels: list) list[source]
Format hyperparameter/feature labels for display.
Replaces underscores with spaces, title-cases words, and applies a few readable-friendly aliases.
- pyBIA.ensemble_model.evaluate_model(classifier, data_x, data_y, normalize=True, k_fold=10, random_state=1909)[source]
Cross-validates a classifier and returns out-of-fold predictions together with the corresponding ground-truth labels.
- Parameters:
classifier (estimator) – Any scikit-learn–compatible model implementing fit and predict.
data_x (ndarray of shape (n_samples, n_features)) – Feature matrix.
data_y (array-like of shape (n_samples,)) – Target labels.
normalize (bool, optional) – Unused in this function; retained for API compatibility with plotting utilities. Defaults to True.
k_fold (int, optional) – Number of K-fold splits. Defaults to 10.
random_state (int, optional) – Seed for shuffling within the cross-validation splitter. Defaults to 1909.
- Returns:
predicted_targets (ndarray of shape (n_samples,)) – Out-of-fold predicted labels concatenated across folds.
actual_targets (ndarray of shape (n_samples,)) – True labels ordered identically to predicted_targets.
- pyBIA.ensemble_model.generate_matrix(predicted_labels_list, actual_targets, classes, normalize=True, title='Confusion Matrix', savefig=False)[source]
Generate and render a confusion matrix from predicted and true labels.
- Parameters:
predicted_labels_list (array-like of shape (n_samples,)) – Predicted class labels, typically the out-of-fold predictions returned by evaluate_model().
actual_targets (array-like of shape (n_samples,)) – Ground-truth class labels in the same order as predicted_labels_list.
classes (list of str) – Class names used to label the matrix axes. The order must match the label encoding in the inputs.
normalize (bool, optional) – If True the confusion matrix is normalized (row-wise) before plotting. Defaults to True.
title (str, optional) – Figure title. Defaults to ‘Confusion Matrix’.
savefig (bool, optional) – If True the figure is saved to ‘Ensemble_Confusion_Matrix.png’ and not displayed. Defaults to False.
- Returns:
Displays the figure or saves it to disk.
- Return type:
None
- pyBIA.ensemble_model.generate_plot(conf_matrix, classes, normalize=False, title='Confusion Matrix', include_cbar=False, savefig=False)[source]
Generate a confusion-matrix figure and axes without calling plt.show().
- Parameters:
conf_matrix (array-like of shape (n_classes, n_classes)) – Confusion matrix (counts) produced upstream (e.g., via confusion_matrix).
classes (list of str) – Class names used for tick labels. Order must match the matrix axes.
normalize (bool, optional) – If True the matrix is normalized row-wise to proportions. Defaults to False.
title (str, optional) – Figure title. Defaults to ‘Confusion Matrix’.
include_cbar (bool, optional) – If True a colorbar is added to the figure. Defaults to False.
savefig (bool, optional) – Included for API symmetry; saving is typically handled by the caller. Defaults to False.
- Returns:
fig (matplotlib.figure.Figure) – The created figure.
ax (matplotlib.axes.Axes) – The axes containing the confusion matrix.