pyBIA.outlier_detection ======================= .. py:module:: pyBIA.outlier_detection .. autoapi-nested-parse:: Created on Wed Aug 2 06:11:11 2023 @author: danielgodinez Classes ------- .. autoapisummary:: pyBIA.outlier_detection.Classifier Functions --------- .. autoapisummary:: pyBIA.outlier_detection.hog_feature_extraction pyBIA.outlier_detection.wavelet_energy_feature_extraction pyBIA.outlier_detection.statistical_feature_extraction pyBIA.outlier_detection.lbp_feature_extraction pyBIA.outlier_detection.fft_energy_feature_extraction Module Contents --------------- .. py:class:: Classifier(data=None, normalize=False, min_pixel=0, max_pixel=10, img_num_channels=1, feat_set='hog', clf='iforest', impute=True, imp_method='knn', scale_features=False, scaler_type='robust', apply_pca=False, pca_components=None, SEED_NO=1909) Build and apply an ensemble outlier-detection classifier on image cutouts. The classifier workflow supports optional min–max normalization, feature extraction (HOG, LBP, FFT, Wavelet, or simple statistics), optional imputation of missing values, and model fitting using an Isolation Forest (`clf='iforest'`). If multiple feature sets are provided, an independent pipeline (imputer, scaler, PCA, and model) is trained for each feature set. Predictions are made by either averaging the anomaly scores across all independent models or by selecting only the most anomolous score across all models. :param data: Image tensor with shape (N, H, W, C). :type data: ndarray or None, optional :param normalize: If True, min–max normalize each image/channel before feature extraction. :type normalize: bool, optional :param min_pixel: Lower bound for min–max normalization. :type min_pixel: float, optional :param max_pixel: Upper bound for min–max normalization. :type max_pixel: float, optional :param img_num_channels: Number of channels in the input tensor. :type img_num_channels: int, optional :param feat_set: Feature family (or families) to compute for training. Options: 'hog','lbp','fft','wavelet','stats'. :type feat_set: str or list/tuple of str, optional :param clf: Classifier to train. Currently only Isolation Forest is supported. :type clf: {'iforest'}, optional :param impute: If True, impute missing feature values before fitting/predicting. :type impute: bool, optional :param imp_method: Imputation strategy used by `impute_missing_values`. :type imp_method: {'knn','mean','median','mode','constant'}, optional :param scale_features: If True, scales extracted features. :type scale_features: bool, optional :param scaler_type: Type of scaler to use. Default is 'robust'. :type scaler_type: {'robust', 'standard'}, optional :param apply_pca: If True, performs Principal Component Analysis on the extracted features. :type apply_pca: bool, optional :param pca_components: Number of components to keep. :type pca_components: int or float, optional :param SEED_NO: Random seed used for model initialization. Default is 1909. :type SEED_NO: int, optional .. attribute:: models Dictionary of trained models keyed by feature name. :type: dict .. attribute:: imputers Dictionary of fitted imputers keyed by feature name. :type: dict .. attribute:: scalers Dictionary of fitted scalers keyed by feature name. :type: dict .. attribute:: pcas Dictionary of fitted PCA models keyed by feature name. :type: dict .. py:attribute:: data :value: None .. py:attribute:: normalize :value: False .. py:attribute:: min_pixel :value: 0 .. py:attribute:: max_pixel :value: 10 .. py:attribute:: img_num_channels :value: 1 .. py:attribute:: clf :value: 'iforest' .. py:attribute:: impute :value: True .. py:attribute:: imp_method :value: 'knn' .. py:attribute:: scale_features :value: False .. py:attribute:: scaler_type :value: 'robust' .. py:attribute:: apply_pca :value: False .. py:attribute:: pca_components :value: None .. py:attribute:: SEED_NO :value: 1909 .. py:attribute:: models .. py:attribute:: imputers .. py:attribute:: scalers .. py:attribute:: pcas .. py:method:: _extract_single_feature(data, feat: str) -> numpy.ndarray Extract a single feature matrix. :param data: Input image tensor of shape (N, H, W, C). :type data: ndarray :param feat: Feature extraction method to apply: - 'hog' : Histogram of Oriented Gradients - 'lbp' : Local Binary Patterns - 'fft' : Fourier-based energy features - 'wavelet' : Wavelet energy features - 'stats' : Simple statistical features :type feat: {'hog', 'lbp', 'fft', 'wavelet', 'stats'} :returns: **f_data** -- Extracted feature matrix of shape (N, D), where D depends on the selected feature type. :rtype: ndarray .. py:method:: create(n_estimators=100, max_samples='auto', contamination='auto', max_features=1.0) Initialize, featurize, optionally impute, and fit the classifier. This method instantiates the model, optionally normalizes the data using the min-max bounds, extracts the features, replaces inf with NaNs, and then optionally imputes missing values. The model is then fitted on the resulting feature matrix. The optional arguments are iForest hyperparameters, which by default are the scikit-learn defaults. :param n_estimators: Number of trees to fit. Defaults to 100. :type n_estimators: int :param max_samples: The number of training instances to use to train the model. Defaults to 'auto'. :type max_samples: 'auto' or int :param contamination: The expected ratio of outliers present in the training data. Sets what the anomaly score threshold should be. Defaults to 'auto'. :type contamination: float :param max_features: The number (or proportion if float) of training features to draw from the feature matrix when training the model. Defaults to 1.0 :type max_features: int or float :rtype: None :raises ValueError: If an unsupported `clf` is requested. :raises ValueError: If `impute=False` and the feature matrix contains NaNs or infs. .. py:method:: save(dirname=None, path=None, overwrite=False) Save the trained model (and imputer/scaler/pca if present) to disk. Creates a directory `pyBIA_outlier_model` under `path[/dirname]/` and writes the IsolationForest model and the fitted imputer/scaler/pca, if applicable. :param dirname: Optional subdirectory to create inside `path`. Must not already exist. :type dirname: str or None, optional :param path: Base directory where the model folder will be saved. If None, uses the user's home directory. :type path: str or None, optional :param overwrite: If True and `pyBIA_outlier_model` exists, delete its contents and recreate it. If False, raise if the folder exists. Default is False. :type overwrite: bool, optional :rtype: None :raises ValueError: If no artifacts are available to save (e.g., model not created). :raises ValueError: If attempting to create an existing directory without `overwrite=True`. .. py:method:: load(path=None) Load a saved model (and imputer/scaler/pca if present) from disk. Looks for a folder named `pyBIA_outlier_model` under `path` (or the user’s home directory if `path` is None) and attempts to load `Model` and `Imputer` artifacts into `self.model` and `self.imputer`. :param path: Base directory containing `pyBIA_outlier_model/`. If None, uses the user's home directory. :type path: str or None, optional :rtype: None .. py:method:: predict(data, ensemble_method='strict') Predict outlier/inlier labels via ensemble aggregation. :param data: Image tensor with shape (N, H, W, C). :type data: ndarray :param ensemble_method: How to combine the scores from the independent models. 'average' computes the mean of the scores. 'strict' takes the minimum score, therefore if any model flags the sample as an anomaly, it will be marked as an anomaly. Default is 'strict'. :type ensemble_method: {'average', 'strict'}, optional .. py:function:: hog_feature_extraction(images, return_image=False, max_pool=False) Extract Histogram of Oriented Gradients (HOG) features per channel. :param images: Input tensor of shape (N, H, W, C), where N is the number of images, H×W are spatial dimensions, and C is the number of channels. :type images: ndarray :param return_image: If True, also return the HOG visualization images (per channel), stacked along the last axis. Default is False. :type return_image: bool, optional :param max_pool: If True, apply global max pooling to each per-channel HOG feature vector (i.e., keep only its maximum value). The resulting feature for each image has shape (C,). If False, per-channel feature vectors are concatenated. Default is False. :type max_pool: bool, optional :returns: * **hog_features** (*ndarray*) -- If `max_pool=False`: array of shape (N, D), where D is the sum of HOG feature lengths across channels (concatenated). If `max_pool=True`: array of shape (N, C), one scalar per channel. * **hog_images** (*ndarray, optional*) -- Returned only if `return_image=True`. Array of shape (N, H, W, C), containing per-channel HOG visualizations rescaled to display range. :raises ValueError: If `images` does not have 4 dimensions (N, H, W, C). .. rubric:: Notes - Each channel is treated as a grayscale image (`channel_axis=None`). - HOG parameters are the scikit-image defaults (orientations, pixels per cell, cells per block, block normalization). .. py:function:: wavelet_energy_feature_extraction(images: List[numpy.ndarray], wavelet: str = 'db4', level: Optional[int] = None, mode: str = 'symmetric', stat: str = 'sum', log_scale: bool = True, normalize: bool = False, eps: float = 1e-10) -> numpy.ndarray Compute per-subband wavelet energies per channel and concatenate. :param images: Iterable of images with shape `(H, W, C)` or an array with shape `(N, H, W, C)`. Iteration is over the first dimension. :type images: sequence of ndarray or ndarray :param wavelet: Wavelet name for PyWavelets. Default is 'db4'. :type wavelet: str, optional :param level: Decomposition level `L`. If None, uses the maximum level allowed by the image size and wavelet filter length. Default is None. :type level: int or None, optional :param mode: Boundary extension mode passed to `pywt.wavedec2`. Default is 'symmetric'. :type mode: str, optional :param stat: Aggregation for each subband: - 'sum' : sum of squares (energy) - 'mean' : energy per coefficient (area-normalized) Default is 'sum'. :type stat: {'sum','mean'}, optional :param log_scale: If True, apply `log(energy + eps)` to each subband value. Default is True. :type log_scale: bool, optional :param normalize: If True, divide all subband values in a channel by that channel's total (after `stat`), for relative energies. Default is False. :type normalize: bool, optional :param eps: Small constant used in log/normalization to avoid division by zero and `log(0)`. Default is 1e-10. :type eps: float, optional :returns: **feats** -- Wavelet-energy feature matrix. For each image (N) and channel (C), the feature length is `1 + 3L` (one approximation band + three detail bands per level), concatenated across channels. :rtype: ndarray, shape (N, C * (1 + 3L)) .. py:function:: statistical_feature_extraction(images: numpy.ndarray) -> numpy.ndarray Compute global statistics and simple texture descriptors per channel. For each image channel, the following 10 features are computed over finite pixels only and concatenated across channels: 1) mean 2) std (population, ddof=0) 3) median 4) median absolute deviation (MAD) 5) 1st percentile (p01) 6) 99th percentile (p99) 7) min 8) max 9) skewness 10) kurtosis :param images: Image tensor (floats). Non-finite values (NaN/±inf) are ignored when computing per-channel statistics. :type images: ndarray, shape (N, H, W, C) :returns: **feats** -- Per-image feature matrix, dtype float64. Features are ordered as listed above for channel 0, then channel 1, etc. :rtype: ndarray, shape (N, C * 10) :raises ValueError: If `images` does not have 4 dimensions (N, H, W, C). .. py:function:: lbp_feature_extraction(images, P: int = 8, R: int = 1) Extract Local Binary Pattern (LBP) histograms per channel and concatenate. :param images: Input image tensor. Each channel is treated independently. :type images: ndarray, shape (N, H, W, C) :param P: Number of sampling points on the LBP circle. Default is 8. :type P: int, optional :param R: Radius (in pixels) of the LBP circle. Default is 1. :type R: int, optional :returns: **feats** -- Concatenated per-channel LBP histograms, dtype float64. For each image, the feature for channel 0 is first, then channel 1, etc. :rtype: ndarray, shape (N, C * 2**P) :raises ValueError: If `images` does not have 4 dimensions (N, H, W, C). .. py:function:: fft_energy_feature_extraction(images, band_edges=(0.0, 0.1, 0.25, 0.5, 0.75, 1.0), per_band_norm=True, window=True, stat='sum', remove_dc=True, fft_norm=None) Compute 2D FFT radial-band energies per channel and concatenate. :param images: Input cutouts. Channels are processed independently and concatenated. :type images: ndarray, shape (N, H, W, C) :param band_edges: Strictly increasing edges within [0, 1], defining bands `[edges[i], edges[i+1])`, with the last band including its upper edge. Default is (0.0, 0.10, 0.25, 0.50, 0.75, 1.0). :type band_edges: sequence of float, optional :param per_band_norm: If True, divide each channel’s band energies by their sum so that the per-channel features sum to 1. Default is True. :type per_band_norm: bool, optional :param window: If True, apply a separable Hann window prior to FFT. Default is True. :type window: bool, optional :param stat: Aggregation within each annulus: - 'sum' : sum of power (energy) - 'mean' : average power per coefficient (area-normalized) Default is 'sum'. :type stat: {'sum','mean'}, optional :param remove_dc: If True, zero the DC coefficient so features emphasize texture rather than mean flux. Default is True. :type remove_dc: bool, optional :param fft_norm: Normalization passed to `numpy.fft.fft2`. Default is None. :type fft_norm: {None, 'ortho'}, optional :returns: **feats** -- Concatenated per-channel band features (float64). If `per_band_norm=True`, each channel’s bands sum to 1 for a given image. :rtype: ndarray, shape (N, C * (len(band_edges) - 1)) :raises ValueError: If `images` is not (N, H, W, C).