pyBIA.feature_selection
=======================

.. py:module:: pyBIA.feature_selection


Classes
-------

.. autoapisummary::

   pyBIA.feature_selection.BorutaSHAP


Module Contents
---------------

.. py:class:: BorutaSHAP(model=None, importance_measure='Shap', classification=True, percentile=100, pvalue=0.05)

   Feature selection wrapper combining Boruta and SHAP-based importance metrics.

   The `BorutaSHAP` class extends the Boruta feature selection methodology using model-specific
   importance metrics (SHAP, Gini, or permutation-based), making it compatible with both classification
   and regression tasks. It introduces a shadow feature mechanism to iteratively compare feature
   importances against randomized features, statistically testing for their relevance. This class
   supports flexible importance measures, sample-based selection using Isolation Forest, and automatic
   integration with scikit-learn-style pipelines via `fit`, `transform`, and `set_params`.

   :param model: A scikit-learn-compatible model with `fit` and `predict` methods. If not provided, a default
                 RandomForestClassifier or RandomForestRegressor is used depending on the task type.
   :type model: object, optional
   :param importance_measure: Metric used to compute feature importances. One of ['Shap', 'gini', 'perm'].
   :type importance_measure: str, default='Shap'
   :param classification: Whether the task is classification (`True`) or regression (`False`).
   :type classification: bool, default=True
   :param percentile: Percentile of shadow feature importances to use as a threshold for significance testing.
                      Lower values make the algorithm more lenient.
   :type percentile: int, default=100
   :param pvalue: Significance level for hypothesis testing. Lower values result in stricter feature rejection,
                  potentially increasing runtime.
   :type pvalue: float, default=0.05

   .. attribute:: accepted

      Final list of features accepted as important.

      :type: list

   .. attribute:: rejected

      Final list of features rejected as unimportant.

      :type: list

   .. attribute:: tentative

      List of features whose importance is undetermined.

      :type: list

   .. attribute:: history_x

      DataFrame recording feature importances over each iteration.

      :type: pd.DataFrame

   .. attribute:: history_shadow

      Historical shadow feature importances.

      :type: np.ndarray

   .. attribute:: hits

      Array tracking the number of times each feature beats the shadow threshold.

      :type: np.ndarray

   .. attribute:: all_columns

      List of original column names in `X`.

      :type: np.ndarray

   .. attribute:: X_boruta

      DataFrame containing both original and shadow features.

      :type: pd.DataFrame

   .. method:: fit(X, y, ...)

      Runs the BorutaSHAP feature selection algorithm on the provided dataset.


   .. method:: results_to_csv(filename='feature_importance')

      Saves the importance scores and feature decisions to CSV.


   .. method:: create_mapping_of_features_to_attribute(maps)

      Creates a dictionary mapping each feature to a visual label (e.g., color or decision).


   .. rubric:: Notes

   SHAP-based explanations are intended for tree-based models, such as XGBoost, LightGBM, or CatBoost.

   Permutation importance is computed using scikit-learn’s `permutation_importance` function.

   Gini importance requires the model to expose a `feature_importances_` attribute, such as in RandomForest models.

   Missing values are supported only for models that can handle them internally. Otherwise, a ValueError will be raised.

   When sample-based SHAP explanations are enabled, the method uses an Isolation Forest to select a representative subset
   of the data that preserves the original anomaly score distribution.

   .. rubric:: References

   - Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the Boruta package.
   - Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions (SHAP).
   - https://github.com/Ekeany/Boruta-Shap


   .. py:attribute:: importance_measure
      :value: 'Shap'


   .. py:attribute:: percentile
      :value: 100


   .. py:attribute:: pvalue
      :value: 0.05


   .. py:attribute:: classification
      :value: True


   .. py:attribute:: model
      :value: None


   .. py:method:: _get_param_names()
      :classmethod:


      Retrieve the parameter names from the class constructor (__init__).

      This utility method introspects the constructor signature to extract all explicitly
      defined parameter names, excluding 'self' and variable keyword arguments (**kwargs).
      It ensures compatibility with scikit-learn-style estimators by enforcing that no
      variable positional arguments (*args) are used.

      :returns: A sorted list of parameter names defined in the constructor.
      :rtype: list of str

      :raises RuntimeError: If the class defines variable positional arguments (*args), which violates
          scikit-learn's estimator API convention.


   .. py:method:: get_params(deep=True)

      Get the parameters of this estimator.

      This method returns a dictionary of all parameters in the estimator. If `deep=True`,
      it will recursively retrieve parameters of sub-estimators (i.e., model attributes
      that implement `get_params` themselves), using a double-underscore naming convention.

      :param deep: If True, include parameters from nested objects (such as the wrapped model).
                   If False, only return parameters directly set on this estimator.
      :type deep: bool, default=True

      :returns: **params** -- Dictionary mapping parameter names to their current values.
                Nested parameters are flattened using the format: 'component__param'.
      :rtype: dict


   .. py:method:: set_params(**params)

      Set the parameters of this estimator.

      This method updates the estimator’s parameters using the provided dictionary.
      It supports both top-level parameters and nested parameters using the
      scikit-learn convention of double underscores (e.g., 'model__n_estimators')
      for sub-estimators.

      :param \*\*params: Dictionary of parameter names mapped to their new values. Nested parameters
                         can be updated using double-underscore notation.
      :type \*\*params: dict

      :returns: **self** -- The updated estimator instance.
      :rtype: object

      :raises ValueError: If a parameter name is invalid or does not match any parameter in the estimator.


   .. py:method:: check_model()

      Validate and initialize the model used for feature importance evaluation.

      If no model was provided at initialization, this method assigns a default
      RandomForestClassifier (for classification) or RandomForestRegressor (for regression).
      It also checks that the model has the required methods and attributes based
      on the selected importance measure.

      :rtype: None

      :raises AttributeError: If the provided model does not implement both `fit` and `predict`, or if the
          `gini` importance measure is selected but the model lacks the `feature_importances_` attribute.


   .. py:method:: check_X()

      Verify that the input feature data `X` is a pandas DataFrame.

      This method ensures that the feature matrix provided to the BorutaSHAP
      instance is of the correct type before proceeding with feature selection.

      :rtype: None

      :raises AttributeError: If `X` is not a pandas DataFrame.


   .. py:method:: missing_values_y()

      Check for missing values in the target variable `y`.

      Supports pandas Series, DataFrame, or NumPy array inputs. Returns True if
      any missing values are found.

      :returns: True if `y` contains missing values, False otherwise.
      :rtype: bool

      :raises AttributeError: If `y` is not a pandas Series, DataFrame, or NumPy array.


   .. py:method:: check_missing_values()

      Check for missing values in the feature matrix `X` and target variable `y`.

      This method verifies that no missing values are present in the input data.
      If missing values are found, a warning is issued for models that support them
      (e.g., XGBoost, LightGBM, CatBoost). Otherwise, a ValueError is raised.

      :rtype: None

      :raises ValueError: If missing values are detected and the model does not support them.

      .. rubric:: Notes

      Models known to support missing values include: XGBoost, CatBoost, and LightGBM.


   .. py:method:: Check_if_chose_train_or_test_and_train_model()

      Split the data and train the model based on the `train_or_test` strategy.

      If `train_or_test='test'`, the method splits the Boruta-augmented dataset into training
      and testing sets (70/30 split) using the specified `random_state` and optional stratification.
      The model is trained on the training portion.

      If `train_or_test='train'`, the model is trained on the full dataset without splitting.

      :rtype: None

      :raises ValueError: If stratification is requested for a regression task, or if `train_or_test` is not
          one of the accepted values ("train" or "test").

      .. rubric:: Notes

      For a detailed discussion on training vs. testing data when computing feature importance,
      see: https://slds-lmu.github.io/iml_methods_limitations/pfi-data.html


   .. py:method:: Train_model(X, y, sample_weight=None)

      Fit the model to the provided data.

      This method trains the model using the given features `X` and targets `y`.
      It handles special cases for certain models like CatBoost, which require
      categorical feature specifications. It also gracefully handles models that do
      not accept the `verbose` parameter.

      :param X: DataFrame containing the feature matrix.
      :type X: pandas.DataFrame
      :param y: Array or Series containing the target variable.
      :type y: pandas.Series or numpy.ndarray
      :param sample_weight: Sample weights to apply during model training.
      :type sample_weight: pandas.Series or numpy.ndarray, optional

      :rtype: None


   .. py:method:: fit(X, y, sample_weight=None, n_trials=20, random_state=0, sample=False, train_or_test='test', normalize=True, verbose=True, stratify=None)

      Run the BorutaSHAP feature selection process.

      This is the core method that performs iterative feature selection by comparing
      real features against shadow (randomized) features using the chosen importance
      measure (SHAP, Gini, or permutation). Features are repeatedly tested against the
      maximum shadow importance, and classified as accepted, rejected, or tentative.

      The algorithm proceeds as follows:
      1. Extend the dataset by adding shuffled copies of original features (shadow features).
      2. Train the model and compute feature importances.
      3. Identify features that outperform the maximum shadow importance threshold.
      4. Track hit counts and statistically test features against the null hypothesis.
      5. Accept, reject, or defer decision on each feature.
      6. Repeat until a decision is made for all features or the trial limit is reached.

      :param X: Feature matrix.
      :type X: pandas.DataFrame
      :param y: Target variable.
      :type y: pandas.Series or numpy.ndarray
      :param sample_weight: Observation-level weights used during model training.
      :type sample_weight: pandas.Series or numpy.ndarray, optional
      :param n_trials: Maximum number of iterations to run the feature selection process.
      :type n_trials: int, default=20
      :param random_state: Random seed for reproducibility.
      :type random_state: int, default=0
      :param sample: If True, a representative sample of the data will be selected using
                     Isolation Forest for SHAP value estimation.
      :type sample: bool, default=False
      :param train_or_test: Specifies whether feature importances should be computed on training data
                            or a held-out test split.
      :type train_or_test: {'train', 'test'}, default='test'
      :param normalize: Whether to normalize feature importances using z-score transformation.
      :type normalize: bool, default=True
      :param verbose: If True, prints the final list of accepted, rejected, and tentative features.
      :type verbose: bool, default=True
      :param stratify: Class labels used for stratified splitting during train-test division.
      :type stratify: array-like, optional

      :returns: **self** -- Returns the fitted BorutaSHAP instance.
      :rtype: object

      .. rubric:: Notes

      For a detailed discussion on the implications of computing feature importances
      on training vs. test data, see:
      https://compstat-lmu.github.io/iml_methods_limitations/pfi-data.html


   .. py:method:: calculate_rejected_accepted_tentative(verbose)

      Finalize feature decisions: accepted, rejected, or tentative.

      This method processes the accumulated hit statistics across all trials to determine
      which features are:
      - Accepted: consistently more important than shadow features.
      - Rejected: consistently less important than shadow features.
      - Tentative: not confidently accepted or rejected.

      :param verbose: If True, prints the number and names of accepted, rejected, and tentative features.
      :type verbose: bool

      :rtype: None


   .. py:method:: create_importance_history()

      Initialize arrays to store historical feature importance scores.

      This method sets up internal storage for tracking the shadow importances,
      original feature importances, and cumulative hit counts across all iterations.

      :rtype: None


   .. py:method:: update_importance_history()

      Update historical records of feature importances for the current iteration.

      This method appends the current shadow and actual feature importances to their
      respective history arrays, ensuring they remain aligned with the original
      column order using a mapping index.

      :rtype: None


   .. py:method:: store_feature_importance()

      Finalize and store historical feature importance statistics.

      This method reshapes the accumulated feature importance history into a pandas DataFrame
      and appends summary statistics for the shadow features, including maximum, minimum,
      mean, and median importance values.

      :rtype: None


   .. py:method:: results_to_csv(filename='feature_importance')

      Save feature importance summary statistics and decisions to a CSV file.

      This method compiles the average and standard deviation of each feature's
      importance across all iterations, appends its final classification
      (Accepted, Rejected, Tentative, or Shadow), and exports the result to disk.

      :param filename: The base name for the output CSV file. The file will be saved as
                       '<filename>.csv' in the current working directory.
      :type filename: str, default='feature_importance'

      :rtype: None


   .. py:method:: remove_features_if_rejected()

      Remove rejected features from the dataset.

      This method drops features from `self.X` that have been marked for removal
      based on the outcome of statistical tests in the current iteration.

      :rtype: None


   .. py:method:: flatten_list(array)
      :staticmethod:


      Flatten a list of lists into a single list.

      :param array: A nested list to be flattened.
      :type array: list of lists

      :returns: A single flattened list containing all elements from the sublists.
      :rtype: list


   .. py:method:: create_mapping_between_cols_and_indices()

      Create a mapping from feature names to their column indices.

      This mapping preserves the original order of columns in `self.X` and is used
      to align importance values across iterations.

      :returns: Dictionary mapping column names to their corresponding integer indices.
      :rtype: dict


   .. py:method:: calculate_hits()

      Compute hit counts for each feature based on shadow feature comparison.

      A feature is assigned a "hit" if its importance exceeds the specified percentile
      threshold of the shadow feature importances. Hits are padded and aligned to the
      full column index order.

      :returns: Array of length `ncols` containing the updated hit counts for each feature.
      :rtype: numpy.ndarray


   .. py:method:: create_shadow_features()

      Generate shadow features by shuffling each original feature column.

      Shadow features are created by independently permuting each column of the input data `X`.
      These are used as a baseline for comparing feature importances. The resulting shadow
      features are renamed with a 'shadow_' prefix and concatenated with the original data
      to form the extended dataset used during model training.

      This method also identifies categorical columns for models (e.g., CatBoost) that
      require explicit specification of categorical features.

      :rtype: None


   .. py:method:: calculate_Zscore(array)
      :staticmethod:


      Compute the z-score normalization of a numeric array.

      Each element is standardized by subtracting the mean and dividing by the standard deviation.

      :param array: Input array of numeric values.
      :type array: array-like

      :returns: Z-score normalized values of the input array.
      :rtype: list of float


   .. py:method:: feature_importance(normalize)

      Compute feature importance scores for both original and shadow features.

      This method calculates importance values based on the specified `importance_measure`:
      - 'shap': Uses SHAP values computed via `shap.TreeExplainer`
      - 'perm': Uses permutation importance via `sklearn.inspection.permutation_importance`
      - 'gini': Uses built-in `feature_importances_` from the model (e.g., RandomForest)

      Importance values can optionally be normalized using z-score transformation.

      :param normalize: If True, importance scores are z-score normalized.
      :type normalize: bool

      :returns: * **X_feature_import** (*array-like*) -- Importance scores for the original features.
                * **Shadow_feature_import** (*array-like*) -- Importance scores for the shadow (randomized) features.

      :raises ValueError: If `importance_measure` is not one of {'shap', 'perm', 'gini'}.


   .. py:method:: isolation_forest(X, sample_weight)
      :staticmethod:


      Fit an Isolation Forest to the dataset and compute anomaly scores.

      This method trains an Isolation Forest on the input feature matrix `X` and
      returns anomaly scores for each sample. Higher scores indicate more typical
      (less anomalous) samples.

      :param X: Input feature matrix.
      :type X: pandas.DataFrame or numpy.ndarray
      :param sample_weight: Sample weights to apply during model fitting.
      :type sample_weight: array-like

      :returns: Anomaly scores for each sample, as returned by `IsolationForest.score_samples`.
      :rtype: numpy.ndarray


   .. py:method:: get_5_percent(num)
      :staticmethod:


      Compute 5 percent of a given number.

      :param num: Input number.
      :type num: int or float

      :returns: Value corresponding to 5% of the input, rounded to the nearest integer.
      :rtype: int


   .. py:method:: get_5_percent_splits(length)

      Generate index positions at 5% intervals of a given length.

      This method returns an array of indices that split a dataset into
      successive 5% chunks, based on the total number of samples.

      :param length: Total number of samples in the dataset.
      :type length: int

      :returns: Array of index positions at 5% intervals.
      :rtype: numpy.ndarray


   .. py:method:: find_sample()

      Select a representative sample of the dataset using KS-test on anomaly scores.

      This method iteratively draws random samples of increasing size and compares the
      distribution of their anomaly scores (from Isolation Forest) to the original
      distribution using the Kolmogorov-Smirnov (KS) test. It starts at 5% of the dataset
      and increases in 5% increments until a sample is found with a KS p-value > 0.95,
      indicating statistical similarity.

      :returns: A representative sample of `self.X_boruta` with similar anomaly score distribution
                to the full dataset.
      :rtype: pandas.DataFrame


   .. py:method:: explain()

      Compute SHAP values for the model using TreeExplainer.

      This method uses the SHAP package to calculate feature importances based on
      Shapley values. It selects `TreeExplainer` with path-dependent perturbations
      for efficiency on tree-based models. If `self.sample` is True, a representative
      subset of the data (selected via `find_sample`) is used; otherwise, the full
      `self.X_boruta` dataset is used.

      For classification tasks, SHAP values across classes are aggregated to compute
      a single importance value per feature. For regression, absolute SHAP values are
      averaged directly.

      :returns: The computed SHAP values are stored internally in `self.shap_values`.
      :rtype: None

      :raises ValueError: If the model is not compatible with SHAP's TreeExplainer (though in practice,
          the method currently assumes tree-based models only).


   .. py:method:: binomial_H0_test(array, n, p, alternative)
      :staticmethod:


      Perform a binomial test for each element in an array.

      This method tests the null hypothesis that the probability of success is `p`
      in a Bernoulli trial, using a binomial test. Each element in the input array
      is treated as the number of observed successes out of `n` trials.

      :param array: Array of observed success counts (can be float; will be rounded).
      :type array: array-like
      :param n: Number of trials per test.
      :type n: int
      :param p: Null hypothesis probability of success.
      :type p: float
      :param alternative: Defines the alternative hypothesis.
      :type alternative: {'two-sided', 'greater', 'less'}

      :returns: List of p-values from the binomial tests for each element in the input array.
      :rtype: list of float


   .. py:method:: find_index_of_true_in_array(array)
      :staticmethod:


      Return the indices of elements that are True in a boolean array.

      :param array: Boolean array indicating which elements to select.
      :type array: array-like of bool

      :returns: Indices where the array has True values.
      :rtype: list of int


   .. py:method:: bonferoni_corrections(pvals, alpha=0.05, n_tests=None)
      :staticmethod:


      Perform statistical tests to accept or reject features based on hit counts.

      This method compares the number of times each feature outperformed the shadow
      features ("hits") to the expected distribution under the null hypothesis (p = 0.5),
      using a binomial test. It applies Bonferroni correction to control for multiple
      comparisons and classifies features as accepted, rejected, or tentative.

      :param iteration: Current iteration number, used as the number of trials in the binomial test.
      :type iteration: int

      :returns: Updates internal attributes:
                - `features_to_remove`: list of features to drop in the next iteration.
                - `accepted_columns`: list of newly accepted features.
                - `rejected_columns`: list of newly rejected features.
      :rtype: None


   .. py:method:: test_features(iteration)

      Perform statistical tests to accept or reject features based on accumulated hit counts.

      For each feature, this method performs two binomial hypothesis tests:
      - A right-tailed test to check if the feature is significantly better than random (acceptance).
      - A left-tailed test to check if the feature is significantly worse than random (rejection).

      The tests use the number of times each feature outperformed the shadow feature (stored in `self.hits`)
      over `iteration` trials. Bonferroni correction is applied to control for multiple comparisons.

      :param iteration: The current iteration count, used as the number of trials in the binomial test.
      :type iteration: int

      :returns: Updates the following internal attributes:
                - `self.accepted_columns`: list of accepted feature names for this iteration.
                - `self.rejected_columns`: list of rejected feature names for this iteration.
                - `self.features_to_remove`: list of features to remove from `self.X` in the next iteration.
      :rtype: None


   .. py:method:: create_list(array, color)
      :staticmethod:


      Create a list of repeated color labels for visualization or mapping.

      :param array: List of elements (used only to determine length).
      :type array: array-like
      :param color: The color label or string to repeat.
      :type color: str

      :returns: A list of the same `color` repeated to match the length of `array`.
      :rtype: list of str


   .. py:method:: create_mapping_of_features_to_attribute(maps=[])

      Create a dictionary mapping features to attribute labels (e.g., for color or status tagging).

      This method maps each feature—tentative, rejected, accepted, and shadow summary features—
      to a corresponding label or value provided in the `maps` list. It is typically used for
      visualization or export purposes.

      :param maps: A list of four strings corresponding to labels for:
                   [0] Tentative features
                   [1] Rejected features
                   [2] Accepted features
                   [3] Shadow features (e.g., Max_Shadow, Min_Shadow, etc.)
      :type maps: list of str

      :returns: Dictionary mapping each feature name to its corresponding label.
      :rtype: dict


   .. py:method:: to_dictionary(list_one, list_two)
      :staticmethod:


      Create a dictionary by zipping two lists together.

      :param list_one: List of keys.
      :type list_one: list
      :param list_two: List of values.
      :type list_two: list

      :returns: Dictionary mapping each element in `list_one` to the corresponding element in `list_two`.
      :rtype: dict