gators.feature_selection package#

Submodules#

gators.feature_selection.information_value module#

gators.feature_selection.information_value.compute_iv(X, y, regularization=0.01)[source]#

Compute the Information Value (IV) for each categorical feature in the dataset.

To convert continuous features to categorical, consider using the binning module to create bins before computing IV.

Parameters:
  • X (pl.DataFrame) – The input features.

  • y (pl.Series) – The target variable (binary).

  • regularization (float, default=0.01) – Regularization parameter to avoid division by zero in WOE/IV calculation.

Returns:

A DataFrame containing the IV values for each feature.

Return type:

pl.DataFrame

Examples

>>> import polars as pl
>>> from gators.feature_selection import compute_iv
>>> X = pl.DataFrame({
...     "feature1": ["a", "a", "b", "c"],
...     "feature2": ["x", "x", "x", "y"],
...     "target": [1, 0, 1, 0]
... })
>>> iv = compute_iv(X.drop("target"), X["target"])
>>> print(iv)
shape: (2, 2)
┌──────────┬────────────┐
│ feature  │ iv         │
│ ---      │ ---        │
│ str      │ f64        │
╞══════════╪════════════╡
│ feature1 │ 0.693147   │
│ feature2 │ 0.287682   │
└──────────┴────────────┘

gators.feature_selection.feature_stability_index module#

gators.feature_selection.feature_stability_index.feature_stability_index(estimator, skf, X, y, importance_threshold=0.0)[source]#

Compute Feature Stability Index (FSI) using repeated estimator feature importance.

Measures how consistently a feature is selected across different training folds. Higher FSI indicates more stable/reliable feature importance.

Parameters:
  • estimator (estimator object) – Any estimator with a feature_importances_ attribute (e.g., XGBoost, RandomForest).

  • skf (sklearn fold splitter object) – Any sklearn fold splitter object (e.g., StratifiedKFold, KFold) for splitting the data.

  • X (DataFrame) – Feature DataFrame with shape (n_samples, n_features).

  • y (Series) – Target series for training.

  • importance_threshold (float) – Minimum importance value for a feature to be considered “selected” in a run. Must be between 0.0 and 1.0.

Returns:

DataFrame with columns:

  • feature: Feature name

  • fsi: Feature Stability Index (0 to 1, higher is more stable)

  • importance: Average importance across all runs

Sorted by FSI and importance in descending order, filtered to fsi > 0.

Return type:

pl.DataFrame

Module contents#

gators.feature_selection.compute_iv(X, y, regularization=0.01)[source]#

Compute the Information Value (IV) for each categorical feature in the dataset.

To convert continuous features to categorical, consider using the binning module to create bins before computing IV.

Parameters:
  • X (pl.DataFrame) – The input features.

  • y (pl.Series) – The target variable (binary).

  • regularization (float, default=0.01) – Regularization parameter to avoid division by zero in WOE/IV calculation.

Returns:

A DataFrame containing the IV values for each feature.

Return type:

pl.DataFrame

Examples

>>> import polars as pl
>>> from gators.feature_selection import compute_iv
>>> X = pl.DataFrame({
...     "feature1": ["a", "a", "b", "c"],
...     "feature2": ["x", "x", "x", "y"],
...     "target": [1, 0, 1, 0]
... })
>>> iv = compute_iv(X.drop("target"), X["target"])
>>> print(iv)
shape: (2, 2)
┌──────────┬────────────┐
│ feature  │ iv         │
│ ---      │ ---        │
│ str      │ f64        │
╞══════════╪════════════╡
│ feature1 │ 0.693147   │
│ feature2 │ 0.287682   │
└──────────┴────────────┘
gators.feature_selection.feature_stability_index(estimator, skf, X, y, importance_threshold=0.0)[source]#

Compute Feature Stability Index (FSI) using repeated estimator feature importance.

Measures how consistently a feature is selected across different training folds. Higher FSI indicates more stable/reliable feature importance.

Parameters:
  • estimator (estimator object) – Any estimator with a feature_importances_ attribute (e.g., XGBoost, RandomForest).

  • skf (sklearn fold splitter object) – Any sklearn fold splitter object (e.g., StratifiedKFold, KFold) for splitting the data.

  • X (DataFrame) – Feature DataFrame with shape (n_samples, n_features).

  • y (Series) – Target series for training.

  • importance_threshold (float) – Minimum importance value for a feature to be considered “selected” in a run. Must be between 0.0 and 1.0.

Returns:

DataFrame with columns:

  • feature: Feature name

  • fsi: Feature Stability Index (0 to 1, higher is more stable)

  • importance: Average importance across all runs

Sorted by FSI and importance in descending order, filtered to fsi > 0.

Return type:

pl.DataFrame