gators.feature_selection package#

Submodules#

gators.feature_selection.information_value module#

gators.feature_selection.information_value.compute_iv(X, y, regularization=0.01)[source]#

Compute the Information Value (IV) for each categorical feature in the dataset.

To convert continuous features to categorical, consider using the binning module to create bins before computing IV.

Parameters:
  • X (pl.DataFrame) – The input features.

  • y (pl.Series) – The target variable (binary).

  • regularization (float, default=0.01) – Regularization parameter to avoid division by zero in WOE/IV calculation.

Returns:

A DataFrame containing the IV values for each feature.

Return type:

pl.DataFrame

Examples

>>> import polars as pl
>>> from gators.feature_selection import compute_iv
>>> X = pl.DataFrame({
...     "feature1": ["a", "a", "b", "c"],
...     "feature2": ["x", "x", "x", "y"],
...     "target": [1, 0, 1, 0]
... })
>>> iv = compute_iv(X.drop("target"), X["target"])
>>> print(iv)
shape: (2, 2)
┌──────────┬────────────┐
│ feature  │ iv         │
│ ---      │ ---        │
│ str      │ f64        │
╞══════════╪════════════╡
│ feature1 │ 0.693147   │
│ feature2 │ 0.287682   │
└──────────┴────────────┘

gators.feature_selection.feature_stability_index module#

gators.feature_selection.feature_stability_index.feature_stability_index(estimator, skf, X: polars.DataFrame, y: polars.Series, importance_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])] = 0.0)[source]#

Compute Feature Stability Index (FSI) using repeated estimator feature importance.

Measures how consistently a feature is selected across different training folds. Higher FSI indicates more stable/reliable feature importance.

Parameters:
  • estimator (estimator object) – Any estimator with a feature_importances_ attribute (e.g., XGBoost, RandomForest).

  • skf (sklearn fold splitter object) – Any sklearn fold splitter object (e.g., StratifiedKFold, KFold) for splitting the data.

  • X (pl.DataFrame) – Feature DataFrame with shape (n_samples, n_features).

  • y (pl.Series) – Target series for training.

  • importance_threshold (Annotated[float, Field(ge=0.0, le=1.0)], default=0.0) – Minimum importance value for a feature to be considered “selected” in a run. Must be between 0.0 and 1.0.

Returns:

DataFrame with columns:

  • feature: Feature name

  • fsi: Feature Stability Index (0 to 1, higher is more stable)

  • importance: Average importance across all runs

Sorted by FSI and importance in descending order, filtered to fsi > 0.

Return type:

pl.DataFrame

Module contents#

gators.feature_selection.compute_iv(X, y, regularization=0.01)[source]

Compute the Information Value (IV) for each categorical feature in the dataset.

To convert continuous features to categorical, consider using the binning module to create bins before computing IV.

Parameters:
  • X (pl.DataFrame) – The input features.

  • y (pl.Series) – The target variable (binary).

  • regularization (float, default=0.01) – Regularization parameter to avoid division by zero in WOE/IV calculation.

Returns:

A DataFrame containing the IV values for each feature.

Return type:

pl.DataFrame

Examples

>>> import polars as pl
>>> from gators.feature_selection import compute_iv
>>> X = pl.DataFrame({
...     "feature1": ["a", "a", "b", "c"],
...     "feature2": ["x", "x", "x", "y"],
...     "target": [1, 0, 1, 0]
... })
>>> iv = compute_iv(X.drop("target"), X["target"])
>>> print(iv)
shape: (2, 2)
┌──────────┬────────────┐
│ feature  │ iv         │
│ ---      │ ---        │
│ str      │ f64        │
╞══════════╪════════════╡
│ feature1 │ 0.693147   │
│ feature2 │ 0.287682   │
└──────────┴────────────┘
gators.feature_selection.feature_stability_index(estimator, skf, X: polars.DataFrame, y: polars.Series, importance_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])] = 0.0)[source]

Compute Feature Stability Index (FSI) using repeated estimator feature importance.

Measures how consistently a feature is selected across different training folds. Higher FSI indicates more stable/reliable feature importance.

Parameters:
  • estimator (estimator object) – Any estimator with a feature_importances_ attribute (e.g., XGBoost, RandomForest).

  • skf (sklearn fold splitter object) – Any sklearn fold splitter object (e.g., StratifiedKFold, KFold) for splitting the data.

  • X (pl.DataFrame) – Feature DataFrame with shape (n_samples, n_features).

  • y (pl.Series) – Target series for training.

  • importance_threshold (Annotated[float, Field(ge=0.0, le=1.0)], default=0.0) – Minimum importance value for a feature to be considered “selected” in a run. Must be between 0.0 and 1.0.

Returns:

DataFrame with columns:

  • feature: Feature name

  • fsi: Feature Stability Index (0 to 1, higher is more stable)

  • importance: Average importance across all runs

Sorted by FSI and importance in descending order, filtered to fsi > 0.

Return type:

pl.DataFrame

class gators.feature_selection.FeatureStabilitySelector[source]

Bases: gators.transformer._base_transformer._BaseTransformer

Drop columns whose Feature Stability Index falls below a threshold.

Wraps feature_stability_index() into a fit/transform interface. The FSI measures how consistently a feature is selected across cross-validation folds; features with low FSI are considered unstable and are dropped.

Note

Sklearn estimators require NumPy arrays. The fit method calls .to_numpy() internally — this is unavoidable and correct.

Parameters:
  • estimator (estimator object) – Any fitted estimator with a feature_importances_ attribute (e.g., RandomForestClassifier, XGBClassifier).

  • skf (sklearn splitter object) – Any sklearn cross-validation splitter (e.g., StratifiedKFold).

  • threshold (float, default=0.5) – Minimum FSI required to keep a column. Columns with FSI strictly below this value are dropped.

  • importance_threshold (float, default=0.0) – Minimum per-fold importance for a feature to count as “selected” in that fold.

selected_features_

Column names that survive the FSI threshold.

Type:

list[str]

columns_to_drop_

Column names dropped due to low FSI.

Type:

list[str]

fsi_scores_

Full FSI DataFrame (feature, fsi, importance) computed during fit.

Type:

pl.DataFrame

Examples

>>> import polars as pl
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.model_selection import StratifiedKFold
>>> from gators.feature_selection import FeatureStabilitySelector
>>> X = pl.DataFrame({
...     "stable":   [i % 2 for i in range(100)],
...     "unstable": [i % 7 for i in range(100)],
... })
>>> y = pl.Series("target", [i % 2 for i in range(100)])
>>> estimator = RandomForestClassifier(n_estimators=10, random_state=0)
>>> skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
>>> selector = FeatureStabilitySelector(estimator=estimator, skf=skf, threshold=0.5)
>>> selector.fit(X, y)
>>> X_transformed = selector.transform(X)
fit(X: polars.DataFrame, y: polars.Series | None = None) gators.feature_selection.feature_stability_selector.FeatureStabilitySelector[source]

Compute FSI for each column and record which to drop.

Parameters:
  • X (pl.DataFrame) – Input DataFrame.

  • y (pl.Series) – Target series for training the estimator.

Returns:

The fitted transformer instance.

Return type:

FeatureStabilitySelector

transform(X: polars.DataFrame) polars.DataFrame[source]

Drop unstable columns from the DataFrame.

Parameters:

X (pl.DataFrame) – Input DataFrame to transform.

Returns:

DataFrame with unstable columns removed.

Return type:

pl.DataFrame

class gators.feature_selection.InformationValueSelector[source]

Bases: gators.transformer._base_transformer._BaseTransformer

Drop columns whose Information Value falls below a threshold.

Computes the IV for each categorical (String/Categorical/Enum) column and drops those whose IV is below threshold. Numeric columns are always kept untouched; they are not considered for IV computation.

Parameters:
  • threshold (float, default=0.02) – Minimum IV required to keep a column. Columns with IV strictly below this value are dropped.

  • regularization (float, default=0.01) – Regularization applied to WOE/IV calculation to avoid division by zero.

selected_features_

All column names that survive the threshold (set after fit).

Type:

list[str]

columns_to_drop_

Categorical column names dropped because their IV was too low.

Type:

list[str]

Examples

>>> import polars as pl
>>> from gators.feature_selection import InformationValueSelector
>>> X = pl.DataFrame({
...     "cat_strong": ["a", "b", "a", "b", "a", "b"],
...     "cat_weak":   ["x", "x", "x", "x", "y", "y"],
...     "numeric":    [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
... })
>>> y = pl.Series("target", [1, 0, 1, 0, 1, 0])
>>> selector = InformationValueSelector(threshold=0.02)
>>> selector.fit(X, y)
>>> X_transformed = selector.transform(X)
fit(X: polars.DataFrame, y: polars.Series | None = None) gators.feature_selection.information_value_selector.InformationValueSelector[source]

Compute IV for each categorical column and record which to drop.

Parameters:
  • X (pl.DataFrame) – Input DataFrame.

  • y (pl.Series) – Binary target series.

Returns:

The fitted transformer instance.

Return type:

InformationValueSelector

transform(X: polars.DataFrame) polars.DataFrame[source]

Drop low-IV columns from the DataFrame.

Parameters:

X (pl.DataFrame) – Input DataFrame to transform.

Returns:

DataFrame with low-IV categorical columns removed.

Return type:

pl.DataFrame

class gators.feature_selection.PermutationImportanceSelector[source]

Bases: gators.transformer._base_transformer._BaseTransformer

Drop columns whose permutation importance falls below a threshold.

Fits estimator on the training data, then measures how much the model score degrades when each feature is randomly shuffled (permuted). Features whose mean importance across n_repeats permutations is below threshold are dropped.

Note

Sklearn estimators require NumPy arrays. The fit method calls .to_numpy() internally — this is unavoidable and correct.

Parameters:
  • estimator (estimator object) – A fitted or unfitted sklearn-compatible estimator. Must implement fit and score.

  • n_repeats (int, default=5) – Number of times to permute each feature.

  • threshold (float, default=0.0) – Minimum mean importance drop required to keep a feature. Features with mean permutation importance strictly below this value are dropped. A value of 0.0 keeps all features that contribute at least marginally.

selected_features_

Column names that survive the importance threshold.

Type:

list[str]

columns_to_drop_

Column names dropped due to low permutation importance.

Type:

list[str]

importances_

Mean permutation importance for each input feature.

Type:

dict[str, float]

Examples

>>> import polars as pl
>>> from sklearn.ensemble import RandomForestClassifier
>>> from gators.feature_selection import PermutationImportanceSelector
>>> X = pl.DataFrame({
...     "informative": [i % 2 for i in range(100)],
...     "noise":       [0] * 100,
... })
>>> y = pl.Series("target", [i % 2 for i in range(100)])
>>> estimator = RandomForestClassifier(n_estimators=10, random_state=0)
>>> selector = PermutationImportanceSelector(
...     estimator=estimator, n_repeats=5, threshold=0.0
... )
>>> selector.fit(X, y)
>>> X_transformed = selector.transform(X)
fit(X: polars.DataFrame, y: polars.Series | None = None) gators.feature_selection.permutation_importance_selector.PermutationImportanceSelector[source]

Fit estimator and compute permutation importances.

Parameters:
  • X (pl.DataFrame) – Input DataFrame.

  • y (pl.Series) – Target series.

Returns:

The fitted transformer instance.

Return type:

PermutationImportanceSelector

transform(X: polars.DataFrame) polars.DataFrame[source]

Drop low-importance columns from the DataFrame.

Parameters:

X (pl.DataFrame) – Input DataFrame to transform.

Returns:

DataFrame with low-importance columns removed.

Return type:

pl.DataFrame

class gators.feature_selection.PSIFilter[source]

Bases: gators.transformer._base_transformer._BaseTransformer

Drop columns whose Population Stability Index exceeds a threshold.

PSI quantifies how much a feature’s distribution has shifted between a reference dataset (typically training data) and the current dataset. High PSI signals distributional drift; such features are unreliable at inference time and are dropped.

PSI interpretation:

  • PSI < 0.10 — stable, no significant change

  • 0.10 ≤ PSI < 0.25 — moderate shift, investigate

  • PSI ≥ 0.25 — significant shift, feature is unstable

Only numeric (Float64, Float32, Int64, Int32) columns are evaluated for PSI. Non-numeric columns are always kept.

Parameters:
  • reference_df (pl.DataFrame) – Reference DataFrame whose distributions define the baseline.

  • threshold (float, default=0.2) – Maximum PSI allowed. Columns with PSI strictly above this value are dropped.

  • n_bins (int, default=10) – Number of quantile-based bins used when computing PSI.

  • subset (list[str] or None, default=None) – Numeric columns to evaluate. If None, all numeric columns shared between reference_df and the DataFrame passed to fit are used.

psi_scores_

PSI score for each evaluated column (set after fit).

Type:

dict[str, float]

columns_to_drop_

Columns dropped because their PSI exceeded the threshold.

Type:

list[str]

selected_features_

Columns kept after filtering.

Type:

list[str]

Examples

>>> import polars as pl
>>> from gators.feature_selection import PSIFilter
>>> reference = pl.DataFrame({
...     "stable": [float(i % 10) for i in range(100)],
...     "drifted": [float(i) for i in range(100)],
... })
>>> current = pl.DataFrame({
...     "stable":  [float(i % 10) for i in range(100)],
...     "drifted": [float(i + 200) for i in range(100)],
... })
>>> selector = PSIFilter(reference_df=reference, threshold=0.2)
>>> selector.fit(current)
>>> X_transformed = selector.transform(current)
fit(X: polars.DataFrame, y: polars.Series | None = None) gators.feature_selection.psi_filter.PSIFilter[source]

Compute PSI for each numeric column against the reference DataFrame.

Parameters:
  • X (pl.DataFrame) – Current DataFrame to compare against reference_df.

  • y (pl.Series, default=None) – Not used; present for sklearn compatibility.

Returns:

The fitted transformer instance.

Return type:

PSIFilter

transform(X: polars.DataFrame) polars.DataFrame[source]

Drop high-PSI columns from the DataFrame.

Parameters:

X (pl.DataFrame) – Input DataFrame to transform.

Returns:

DataFrame with high-PSI columns removed.

Return type:

pl.DataFrame