pyBIA.outlier_detection

Created on Wed Aug 2 06:11:11 2023

@author: danielgodinez

Classes

Classifier

Build and apply an ensemble outlier-detection classifier on image cutouts.

Functions

`hog_feature_extraction`(images[, return_image, max_pool])	Extract Histogram of Oriented Gradients (HOG) features per channel.
`wavelet_energy_feature_extraction`(→ numpy.ndarray)	Compute per-subband wavelet energies per channel and concatenate.
`statistical_feature_extraction`(→ numpy.ndarray)	Compute global statistics and simple texture descriptors per channel.
`lbp_feature_extraction`(images[, P, R])	Extract Local Binary Pattern (LBP) histograms per channel and concatenate.
`fft_energy_feature_extraction`(images[, band_edges, ...])	Compute 2D FFT radial-band energies per channel and concatenate.

Module Contents

class pyBIA.outlier_detection.Classifier(data=None, normalize=False, min_pixel=0, max_pixel=10, img_num_channels=1, feat_set='hog', clf='iforest', impute=True, imp_method='knn', scale_features=False, scaler_type='robust', apply_pca=False, pca_components=None, SEED_NO=1909)[source]

Build and apply an ensemble outlier-detection classifier on image cutouts.

The classifier workflow supports optional min–max normalization, feature extraction (HOG, LBP, FFT, Wavelet, or simple statistics), optional imputation of missing values, and model fitting using an Isolation Forest (clf=’iforest’).

If multiple feature sets are provided, an independent pipeline (imputer, scaler, PCA, and model) is trained for each feature set. Predictions are made by either averaging the anomaly scores across all independent models or by selecting only the most anomolous score across all models.

Parameters:

data (ndarray or None, optional) – Image tensor with shape (N, H, W, C).
normalize (bool, optional) – If True, min–max normalize each image/channel before feature extraction.
min_pixel (float, optional) – Lower bound for min–max normalization.
max_pixel (float, optional) – Upper bound for min–max normalization.
img_num_channels (int, optional) – Number of channels in the input tensor.
feat_set (str or list/tuple of str, optional) – Feature family (or families) to compute for training. Options: ‘hog’,’lbp’,’fft’,’wavelet’,’stats’.
clf ({'iforest'}, optional) – Classifier to train. Currently only Isolation Forest is supported.
impute (bool, optional) – If True, impute missing feature values before fitting/predicting.
imp_method ({'knn','mean','median','mode','constant'}, optional) – Imputation strategy used by impute_missing_values.
scale_features (bool, optional) – If True, scales extracted features.
scaler_type ({'robust', 'standard'}, optional) – Type of scaler to use. Default is ‘robust’.
apply_pca (bool, optional) – If True, performs Principal Component Analysis on the extracted features.
pca_components (int or float, optional) – Number of components to keep.
SEED_NO (int, optional) – Random seed used for model initialization. Default is 1909.

models[source]

Dictionary of trained models keyed by feature name.

Type:: dict

imputers[source]

Dictionary of fitted imputers keyed by feature name.

Type:: dict

scalers[source]

Dictionary of fitted scalers keyed by feature name.

Type:: dict

pcas[source]

Dictionary of fitted PCA models keyed by feature name.

Type:: dict

data = None[source]

normalize = False[source]

min_pixel = 0[source]

max_pixel = 10[source]

img_num_channels = 1[source]

clf = 'iforest'[source]

impute = True[source]

imp_method = 'knn'[source]

scale_features = False[source]

scaler_type = 'robust'[source]

apply_pca = False[source]

pca_components = None[source]

SEED_NO = 1909[source]

models[source]

imputers[source]

scalers[source]

pcas[source]

_extract_single_feature(data, feat: str) → numpy.ndarray[source]

Extract a single feature matrix.

Parameters:

data (ndarray) – Input image tensor of shape (N, H, W, C).
feat ({'hog', 'lbp', 'fft', 'wavelet', 'stats'}) – Feature extraction method to apply: - ‘hog’ : Histogram of Oriented Gradients - ‘lbp’ : Local Binary Patterns - ‘fft’ : Fourier-based energy features - ‘wavelet’ : Wavelet energy features - ‘stats’ : Simple statistical features

Returns:

f_data – Extracted feature matrix of shape (N, D), where D depends on the selected feature type.

Return type:

ndarray

create(n_estimators=100, max_samples='auto', contamination='auto', max_features=1.0)[source]

Initialize, featurize, optionally impute, and fit the classifier. This method instantiates the model, optionally normalizes the data using the min-max bounds, extracts the features, replaces inf with NaNs, and then optionally imputes missing values. The model is then fitted on the resulting feature matrix. The optional arguments are iForest hyperparameters, which by default are the scikit-learn defaults.

Parameters:

n_estimators (int) – Number of trees to fit. Defaults to 100.
max_samples ('auto' or int) – The number of training instances to use to train the model. Defaults to ‘auto’.
contamination (float) – The expected ratio of outliers present in the training data. Sets what the anomaly score threshold should be. Defaults to ‘auto’.
max_features (int or float) – The number (or proportion if float) of training features to draw from the feature matrix when training the model. Defaults to 1.0

Return type:

None

Raises:

ValueError – If an unsupported clf is requested.
ValueError – If impute=False and the feature matrix contains NaNs or infs.

save(dirname=None, path=None, overwrite=False)[source]

Save the trained model (and imputer/scaler/pca if present) to disk.

Creates a directory pyBIA_outlier_model under path[/dirname]/ and writes the IsolationForest model and the fitted imputer/scaler/pca, if applicable.

Parameters:

dirname (str or None, optional) – Optional subdirectory to create inside path. Must not already exist.
path (str or None, optional) – Base directory where the model folder will be saved. If None, uses the user’s home directory.
overwrite (bool, optional) – If True and pyBIA_outlier_model exists, delete its contents and recreate it. If False, raise if the folder exists. Default is False.

Return type:

None

Raises:

ValueError – If no artifacts are available to save (e.g., model not created).
ValueError – If attempting to create an existing directory without overwrite=True.

load(path=None)[source]

Load a saved model (and imputer/scaler/pca if present) from disk.

Looks for a folder named pyBIA_outlier_model under path (or the user’s home directory if path is None) and attempts to load Model and Imputer artifacts into self.model and self.imputer.

Parameters:: path (str or None, optional) – Base directory containing pyBIA_outlier_model/. If None, uses the user’s home directory.
Return type:: None

predict(data, ensemble_method='strict')[source]

Predict outlier/inlier labels via ensemble aggregation.

Parameters:

data (ndarray) – Image tensor with shape (N, H, W, C).
ensemble_method ({'average', 'strict'}, optional) – How to combine the scores from the independent models. ‘average’ computes the mean of the scores. ‘strict’ takes the minimum score, therefore if any model flags the sample as an anomaly, it will be marked as an anomaly. Default is ‘strict’.

pyBIA.outlier_detection.hog_feature_extraction(images, return_image=False, max_pool=False)[source]

Extract Histogram of Oriented Gradients (HOG) features per channel.

Parameters:

images (ndarray) – Input tensor of shape (N, H, W, C), where N is the number of images, H×W are spatial dimensions, and C is the number of channels.
return_image (bool, optional) – If True, also return the HOG visualization images (per channel), stacked along the last axis. Default is False.
max_pool (bool, optional) – If True, apply global max pooling to each per-channel HOG feature vector (i.e., keep only its maximum value). The resulting feature for each image has shape (C,). If False, per-channel feature vectors are concatenated. Default is False.

Returns:

hog_features (ndarray) – If max_pool=False: array of shape (N, D), where D is the sum of HOG feature lengths across channels (concatenated). If max_pool=True: array of shape (N, C), one scalar per channel.
hog_images (ndarray, optional) – Returned only if return_image=True. Array of shape (N, H, W, C), containing per-channel HOG visualizations rescaled to display range.

Raises:

ValueError – If images does not have 4 dimensions (N, H, W, C).

Notes

Each channel is treated as a grayscale image (channel_axis=None).
HOG parameters are the scikit-image defaults (orientations, pixels per cell, cells per block, block normalization).

pyBIA.outlier_detection.wavelet_energy_feature_extraction(images: List[numpy.ndarray], wavelet: str = 'db4', level: int | None = None, mode: str = 'symmetric', stat: str = 'sum', log_scale: bool = True, normalize: bool = False, eps: float = 1e-10) → numpy.ndarray[source]

Compute per-subband wavelet energies per channel and concatenate.

Parameters:

images (sequence of ndarray or ndarray) – Iterable of images with shape (H, W, C) or an array with shape (N, H, W, C). Iteration is over the first dimension.
wavelet (str, optional) – Wavelet name for PyWavelets. Default is ‘db4’.
level (int or None, optional) – Decomposition level L. If None, uses the maximum level allowed by the image size and wavelet filter length. Default is None.
mode (str, optional) – Boundary extension mode passed to pywt.wavedec2. Default is ‘symmetric’.
stat ({'sum','mean'}, optional) – Aggregation for each subband: - ‘sum’ : sum of squares (energy) - ‘mean’ : energy per coefficient (area-normalized) Default is ‘sum’.
log_scale (bool, optional) – If True, apply log(energy + eps) to each subband value. Default is True.
normalize (bool, optional) – If True, divide all subband values in a channel by that channel’s total (after stat), for relative energies. Default is False.
eps (float, optional) – Small constant used in log/normalization to avoid division by zero and log(0). Default is 1e-10.

Returns:

feats – Wavelet-energy feature matrix. For each image (N) and channel (C), the feature length is 1 + 3L (one approximation band + three detail bands per level), concatenated across channels.

Return type:

ndarray, shape (N, C * (1 + 3L))

pyBIA.outlier_detection.statistical_feature_extraction(images: numpy.ndarray) → numpy.ndarray[source]

Compute global statistics and simple texture descriptors per channel.

For each image channel, the following 10 features are computed over finite pixels only and concatenated across channels:

mean

std (population, ddof=0)

median

median absolute deviation (MAD)

1st percentile (p01)

99th percentile (p99)

min

max

skewness

kurtosis

Parameters:: images (ndarray, shape (N, H, W, C)) – Image tensor (floats). Non-finite values (NaN/±inf) are ignored when computing per-channel statistics.
Returns:: feats – Per-image feature matrix, dtype float64. Features are ordered as listed above for channel 0, then channel 1, etc.
Return type:: ndarray, shape (N, C * 10)
Raises:: ValueError – If images does not have 4 dimensions (N, H, W, C).

pyBIA.outlier_detection.lbp_feature_extraction(images, P: int = 8, R: int = 1)[source]

Extract Local Binary Pattern (LBP) histograms per channel and concatenate.

Parameters:

images (ndarray, shape (N, H, W, C)) – Input image tensor. Each channel is treated independently.
P (int, optional) – Number of sampling points on the LBP circle. Default is 8.
R (int, optional) – Radius (in pixels) of the LBP circle. Default is 1.

Returns:

feats – Concatenated per-channel LBP histograms, dtype float64. For each image, the feature for channel 0 is first, then channel 1, etc.

Return type:

ndarray, shape (N, C * 2**P)

Raises:

ValueError – If images does not have 4 dimensions (N, H, W, C).

pyBIA.outlier_detection.fft_energy_feature_extraction(images, band_edges=(0.0, 0.1, 0.25, 0.5, 0.75, 1.0), per_band_norm=True, window=True, stat='sum', remove_dc=True, fft_norm=None)[source]

Compute 2D FFT radial-band energies per channel and concatenate.

Parameters:

images (ndarray, shape (N, H, W, C)) – Input cutouts. Channels are processed independently and concatenated.
band_edges (sequence of float, optional) – Strictly increasing edges within [0, 1], defining bands [edges[i], edges[i+1]), with the last band including its upper edge. Default is (0.0, 0.10, 0.25, 0.50, 0.75, 1.0).
per_band_norm (bool, optional) – If True, divide each channel’s band energies by their sum so that the per-channel features sum to 1. Default is True.
window (bool, optional) – If True, apply a separable Hann window prior to FFT. Default is True.
stat ({'sum','mean'}, optional) – Aggregation within each annulus: - ‘sum’ : sum of power (energy) - ‘mean’ : average power per coefficient (area-normalized) Default is ‘sum’.
remove_dc (bool, optional) – If True, zero the DC coefficient so features emphasize texture rather than mean flux. Default is True.
fft_norm ({None, 'ortho'}, optional) – Normalization passed to numpy.fft.fft2. Default is None.

Returns:

feats – Concatenated per-channel band features (float64). If per_band_norm=True, each channel’s bands sum to 1 for a given image.

Return type:

ndarray, shape (N, C * (len(band_edges) - 1))

Raises:

ValueError – If images is not (N, H, W, C).