pyBIA.outlier_detection
=======================

.. py:module:: pyBIA.outlier_detection

.. autoapi-nested-parse::

   Created on Wed Aug 2 06:11:11 2023

   @author: danielgodinez


Classes
-------

.. autoapisummary::

   pyBIA.outlier_detection.Classifier


Functions
---------

.. autoapisummary::

   pyBIA.outlier_detection.hog_feature_extraction
   pyBIA.outlier_detection.wavelet_energy_feature_extraction
   pyBIA.outlier_detection.statistical_feature_extraction
   pyBIA.outlier_detection.lbp_feature_extraction
   pyBIA.outlier_detection.fft_energy_feature_extraction


Module Contents
---------------

.. py:class:: Classifier(data=None, normalize=False, min_pixel=0, max_pixel=10, img_num_channels=1, feat_set='hog', clf='iforest', impute=True, imp_method='knn', scale_features=False, scaler_type='robust', apply_pca=False, pca_components=None, SEED_NO=1909)

   Build and apply an ensemble outlier-detection classifier on image cutouts.

   The classifier workflow supports optional min–max normalization, feature
   extraction (HOG, LBP, FFT, Wavelet, or simple statistics), optional
   imputation of missing values, and model fitting using an Isolation Forest
   (`clf='iforest'`).

   If multiple feature sets are provided, an independent pipeline (imputer,
   scaler, PCA, and model) is trained for each feature set. Predictions are
   made by either averaging the anomaly scores across all independent models
   or by selecting only the most anomolous score across all models.

   :param data: Image tensor with shape (N, H, W, C).
   :type data: ndarray or None, optional
   :param normalize: If True, min–max normalize each image/channel before feature extraction.
   :type normalize: bool, optional
   :param min_pixel: Lower bound for min–max normalization.
   :type min_pixel: float, optional
   :param max_pixel: Upper bound for min–max normalization.
   :type max_pixel: float, optional
   :param img_num_channels: Number of channels in the input tensor.
   :type img_num_channels: int, optional
   :param feat_set: Feature family (or families) to compute for training.
                    Options: 'hog','lbp','fft','wavelet','stats'.
   :type feat_set: str or list/tuple of str, optional
   :param clf: Classifier to train. Currently only Isolation Forest is supported.
   :type clf: {'iforest'}, optional
   :param impute: If True, impute missing feature values before fitting/predicting.
   :type impute: bool, optional
   :param imp_method: Imputation strategy used by `impute_missing_values`.
   :type imp_method: {'knn','mean','median','mode','constant'}, optional
   :param scale_features: If True, scales extracted features.
   :type scale_features: bool, optional
   :param scaler_type: Type of scaler to use. Default is 'robust'.
   :type scaler_type: {'robust', 'standard'}, optional
   :param apply_pca: If True, performs Principal Component Analysis on the extracted features.
   :type apply_pca: bool, optional
   :param pca_components: Number of components to keep.
   :type pca_components: int or float, optional
   :param SEED_NO: Random seed used for model initialization. Default is 1909.
   :type SEED_NO: int, optional

   .. attribute:: models

      Dictionary of trained models keyed by feature name.

      :type: dict

   .. attribute:: imputers

      Dictionary of fitted imputers keyed by feature name.

      :type: dict

   .. attribute:: scalers

      Dictionary of fitted scalers keyed by feature name.

      :type: dict

   .. attribute:: pcas

      Dictionary of fitted PCA models keyed by feature name.

      :type: dict


   .. py:attribute:: data
      :value: None


   .. py:attribute:: normalize
      :value: False


   .. py:attribute:: min_pixel
      :value: 0


   .. py:attribute:: max_pixel
      :value: 10


   .. py:attribute:: img_num_channels
      :value: 1


   .. py:attribute:: clf
      :value: 'iforest'


   .. py:attribute:: impute
      :value: True


   .. py:attribute:: imp_method
      :value: 'knn'


   .. py:attribute:: scale_features
      :value: False


   .. py:attribute:: scaler_type
      :value: 'robust'


   .. py:attribute:: apply_pca
      :value: False


   .. py:attribute:: pca_components
      :value: None


   .. py:attribute:: SEED_NO
      :value: 1909


   .. py:attribute:: models


   .. py:attribute:: imputers


   .. py:attribute:: scalers


   .. py:attribute:: pcas


   .. py:method:: _extract_single_feature(data, feat: str) -> numpy.ndarray

      Extract a single feature matrix.

      :param data: Input image tensor of shape (N, H, W, C).
      :type data: ndarray
      :param feat: Feature extraction method to apply:
                   - 'hog' : Histogram of Oriented Gradients
                   - 'lbp' : Local Binary Patterns
                   - 'fft' : Fourier-based energy features
                   - 'wavelet' : Wavelet energy features
                   - 'stats' : Simple statistical features
      :type feat: {'hog', 'lbp', 'fft', 'wavelet', 'stats'}

      :returns: **f_data** -- Extracted feature matrix of shape (N, D), where D depends on the
                selected feature type.
      :rtype: ndarray


   .. py:method:: create(n_estimators=100, max_samples='auto', contamination='auto', max_features=1.0)

      Initialize, featurize, optionally impute, and fit the classifier.
      This method instantiates the model, optionally normalizes the data using the min-max bounds,
      extracts the features, replaces inf with NaNs, and then optionally imputes missing values.
      The model is then fitted on the resulting feature matrix. The optional arguments are
      iForest hyperparameters, which by default are the scikit-learn defaults.

      :param n_estimators: Number of trees to fit. Defaults to 100.
      :type n_estimators: int
      :param max_samples: The number of training instances to use to train the model. Defaults to 'auto'.
      :type max_samples: 'auto' or int
      :param contamination: The expected ratio of outliers present in the training data. Sets what the anomaly
                            score threshold should be. Defaults to 'auto'.
      :type contamination: float
      :param max_features: The number (or proportion if float) of training features to draw from the feature matrix when training the model.
                           Defaults to 1.0
      :type max_features: int or float

      :rtype: None

      :raises ValueError: If an unsupported `clf` is requested.
      :raises ValueError: If `impute=False` and the feature matrix contains NaNs or infs.


   .. py:method:: save(dirname=None, path=None, overwrite=False)

      Save the trained model (and imputer/scaler/pca if present) to disk.

      Creates a directory `pyBIA_outlier_model` under `path[/dirname]/` and
      writes the IsolationForest model and the fitted imputer/scaler/pca, if applicable.

      :param dirname: Optional subdirectory to create inside `path`. Must not already exist.
      :type dirname: str or None, optional
      :param path: Base directory where the model folder will be saved. If None, uses the
                   user's home directory.
      :type path: str or None, optional
      :param overwrite: If True and `pyBIA_outlier_model` exists, delete its contents and
                        recreate it. If False, raise if the folder exists. Default is False.
      :type overwrite: bool, optional

      :rtype: None

      :raises ValueError: If no artifacts are available to save (e.g., model not created).
      :raises ValueError: If attempting to create an existing directory without `overwrite=True`.


   .. py:method:: load(path=None)

      Load a saved model (and imputer/scaler/pca if present) from disk.

      Looks for a folder named `pyBIA_outlier_model` under `path` (or the user’s
      home directory if `path` is None) and attempts to load `Model` and `Imputer`
      artifacts into `self.model` and `self.imputer`.

      :param path: Base directory containing `pyBIA_outlier_model/`. If None, uses the user's home directory.
      :type path: str or None, optional

      :rtype: None


   .. py:method:: predict(data, ensemble_method='strict')

      Predict outlier/inlier labels via ensemble aggregation.

      :param data: Image tensor with shape (N, H, W, C).
      :type data: ndarray
      :param ensemble_method: How to combine the scores from the independent models.
                              'average' computes the mean of the scores.
                              'strict' takes the minimum score, therefore if any model
                              flags the sample as an anomaly, it will be marked as an anomaly.
                              Default is 'strict'.
      :type ensemble_method: {'average', 'strict'}, optional


.. py:function:: hog_feature_extraction(images, return_image=False, max_pool=False)

   Extract Histogram of Oriented Gradients (HOG) features per channel.

   :param images: Input tensor of shape (N, H, W, C), where N is the number of images,
                  H×W are spatial dimensions, and C is the number of channels.
   :type images: ndarray
   :param return_image: If True, also return the HOG visualization images (per channel), stacked
                        along the last axis. Default is False.
   :type return_image: bool, optional
   :param max_pool: If True, apply global max pooling to each per-channel HOG feature vector
                    (i.e., keep only its maximum value). The resulting feature for each image
                    has shape (C,). If False, per-channel feature vectors are concatenated.
                    Default is False.
   :type max_pool: bool, optional

   :returns: * **hog_features** (*ndarray*) -- If `max_pool=False`: array of shape (N, D), where D is the sum of HOG
               feature lengths across channels (concatenated).
               If `max_pool=True`: array of shape (N, C), one scalar per channel.
             * **hog_images** (*ndarray, optional*) -- Returned only if `return_image=True`. Array of shape (N, H, W, C),
               containing per-channel HOG visualizations rescaled to display range.

   :raises ValueError: If `images` does not have 4 dimensions (N, H, W, C).

   .. rubric:: Notes

   - Each channel is treated as a grayscale image (`channel_axis=None`).
   - HOG parameters are the scikit-image defaults (orientations, pixels per cell, cells per block, block normalization).


.. py:function:: wavelet_energy_feature_extraction(images: List[numpy.ndarray], wavelet: str = 'db4', level: Optional[int] = None, mode: str = 'symmetric', stat: str = 'sum', log_scale: bool = True, normalize: bool = False, eps: float = 1e-10) -> numpy.ndarray

   Compute per-subband wavelet energies per channel and concatenate.

   :param images: Iterable of images with shape `(H, W, C)` or an array with shape
                  `(N, H, W, C)`. Iteration is over the first dimension.
   :type images: sequence of ndarray or ndarray
   :param wavelet: Wavelet name for PyWavelets. Default is 'db4'.
   :type wavelet: str, optional
   :param level: Decomposition level `L`. If None, uses the maximum level allowed by the
                 image size and wavelet filter length. Default is None.
   :type level: int or None, optional
   :param mode: Boundary extension mode passed to `pywt.wavedec2`. Default is 'symmetric'.
   :type mode: str, optional
   :param stat: Aggregation for each subband:
                - 'sum' : sum of squares (energy)
                - 'mean' : energy per coefficient (area-normalized)
                Default is 'sum'.
   :type stat: {'sum','mean'}, optional
   :param log_scale: If True, apply `log(energy + eps)` to each subband value. Default is True.
   :type log_scale: bool, optional
   :param normalize: If True, divide all subband values in a channel by that channel's total
                     (after `stat`), for relative energies. Default is False.
   :type normalize: bool, optional
   :param eps: Small constant used in log/normalization to avoid division by zero and
               `log(0)`. Default is 1e-10.
   :type eps: float, optional

   :returns: **feats** -- Wavelet-energy feature matrix. For each image (N) and channel (C), the
             feature length is `1 + 3L` (one approximation band + three detail bands
             per level), concatenated across channels.
   :rtype: ndarray, shape (N, C * (1 + 3L))


.. py:function:: statistical_feature_extraction(images: numpy.ndarray) -> numpy.ndarray

   Compute global statistics and simple texture descriptors per channel.

   For each image channel, the following 10 features are computed over finite
   pixels only and concatenated across channels:

       1) mean
       2) std (population, ddof=0)
       3) median
       4) median absolute deviation (MAD)
       5) 1st percentile (p01)
       6) 99th percentile (p99)
       7) min
       8) max
       9) skewness
      10) kurtosis

   :param images: Image tensor (floats). Non-finite values (NaN/±inf) are ignored when
                  computing per-channel statistics.
   :type images: ndarray, shape (N, H, W, C)

   :returns: **feats** -- Per-image feature matrix, dtype float64. Features are ordered as listed
             above for channel 0, then channel 1, etc.
   :rtype: ndarray, shape (N, C * 10)

   :raises ValueError: If `images` does not have 4 dimensions (N, H, W, C).


.. py:function:: lbp_feature_extraction(images, P: int = 8, R: int = 1)

   Extract Local Binary Pattern (LBP) histograms per channel and concatenate.

   :param images: Input image tensor. Each channel is treated independently.
   :type images: ndarray, shape (N, H, W, C)
   :param P: Number of sampling points on the LBP circle. Default is 8.
   :type P: int, optional
   :param R: Radius (in pixels) of the LBP circle. Default is 1.
   :type R: int, optional

   :returns: **feats** -- Concatenated per-channel LBP histograms, dtype float64.
             For each image, the feature for channel 0 is first, then channel 1, etc.
   :rtype: ndarray, shape (N, C * 2**P)

   :raises ValueError: If `images` does not have 4 dimensions (N, H, W, C).


.. py:function:: fft_energy_feature_extraction(images, band_edges=(0.0, 0.1, 0.25, 0.5, 0.75, 1.0), per_band_norm=True, window=True, stat='sum', remove_dc=True, fft_norm=None)

   Compute 2D FFT radial-band energies per channel and concatenate.

   :param images: Input cutouts. Channels are processed independently and concatenated.
   :type images: ndarray, shape (N, H, W, C)
   :param band_edges: Strictly increasing edges within [0, 1], defining bands
                      `[edges[i], edges[i+1])`, with the last band including its upper edge.
                      Default is (0.0, 0.10, 0.25, 0.50, 0.75, 1.0).
   :type band_edges: sequence of float, optional
   :param per_band_norm: If True, divide each channel’s band energies by their sum so that the
                         per-channel features sum to 1. Default is True.
   :type per_band_norm: bool, optional
   :param window: If True, apply a separable Hann window prior to FFT. Default is True.
   :type window: bool, optional
   :param stat: Aggregation within each annulus:
                - 'sum'  : sum of power (energy)
                - 'mean' : average power per coefficient (area-normalized)
                Default is 'sum'.
   :type stat: {'sum','mean'}, optional
   :param remove_dc: If True, zero the DC coefficient so features emphasize texture rather
                     than mean flux. Default is True.
   :type remove_dc: bool, optional
   :param fft_norm: Normalization passed to `numpy.fft.fft2`. Default is None.
   :type fft_norm: {None, 'ortho'}, optional

   :returns: **feats** -- Concatenated per-channel band features (float64). If `per_band_norm=True`,
             each channel’s bands sum to 1 for a given image.
   :rtype: ndarray, shape (N, C * (len(band_edges) - 1))

   :raises ValueError: If `images` is not (N, H, W, C).