gators.feature_selection package#
Submodules#
gators.feature_selection.information_value module#
- gators.feature_selection.information_value.compute_iv(X, y, regularization=0.01)[source]#
Compute the Information Value (IV) for each categorical feature in the dataset.
To convert continuous features to categorical, consider using the binning module to create bins before computing IV.
- Parameters:
X (pl.DataFrame) – The input features.
y (pl.Series) – The target variable (binary).
regularization (float, default=0.01) – Regularization parameter to avoid division by zero in WOE/IV calculation.
- Returns:
A DataFrame containing the IV values for each feature.
- Return type:
pl.DataFrame
Examples
>>> import polars as pl >>> from gators.feature_selection import compute_iv
>>> X = pl.DataFrame({ ... "feature1": ["a", "a", "b", "c"], ... "feature2": ["x", "x", "x", "y"], ... "target": [1, 0, 1, 0] ... }) >>> iv = compute_iv(X.drop("target"), X["target"]) >>> print(iv) shape: (2, 2) ┌──────────┬────────────┐ │ feature │ iv │ │ --- │ --- │ │ str │ f64 │ ╞══════════╪════════════╡ │ feature1 │ 0.693147 │ │ feature2 │ 0.287682 │ └──────────┴────────────┘
gators.feature_selection.feature_stability_index module#
- gators.feature_selection.feature_stability_index.feature_stability_index(estimator, skf, X: polars.DataFrame, y: polars.Series, importance_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])] = 0.0)[source]#
Compute Feature Stability Index (FSI) using repeated estimator feature importance.
Measures how consistently a feature is selected across different training folds. Higher FSI indicates more stable/reliable feature importance.
- Parameters:
estimator (estimator object) – Any estimator with a
feature_importances_attribute (e.g., XGBoost, RandomForest).skf (sklearn fold splitter object) – Any sklearn fold splitter object (e.g., StratifiedKFold, KFold) for splitting the data.
X (pl.DataFrame) – Feature DataFrame with shape (n_samples, n_features).
y (pl.Series) – Target series for training.
importance_threshold (Annotated[float, Field(ge=0.0, le=1.0)], default=0.0) – Minimum importance value for a feature to be considered “selected” in a run. Must be between 0.0 and 1.0.
- Returns:
DataFrame with columns:
feature: Feature name
fsi: Feature Stability Index (0 to 1, higher is more stable)
importance: Average importance across all runs
Sorted by FSI and importance in descending order, filtered to fsi > 0.
- Return type:
pl.DataFrame
Module contents#
- gators.feature_selection.compute_iv(X, y, regularization=0.01)[source]
Compute the Information Value (IV) for each categorical feature in the dataset.
To convert continuous features to categorical, consider using the binning module to create bins before computing IV.
- Parameters:
X (pl.DataFrame) – The input features.
y (pl.Series) – The target variable (binary).
regularization (float, default=0.01) – Regularization parameter to avoid division by zero in WOE/IV calculation.
- Returns:
A DataFrame containing the IV values for each feature.
- Return type:
pl.DataFrame
Examples
>>> import polars as pl >>> from gators.feature_selection import compute_iv
>>> X = pl.DataFrame({ ... "feature1": ["a", "a", "b", "c"], ... "feature2": ["x", "x", "x", "y"], ... "target": [1, 0, 1, 0] ... }) >>> iv = compute_iv(X.drop("target"), X["target"]) >>> print(iv) shape: (2, 2) ┌──────────┬────────────┐ │ feature │ iv │ │ --- │ --- │ │ str │ f64 │ ╞══════════╪════════════╡ │ feature1 │ 0.693147 │ │ feature2 │ 0.287682 │ └──────────┴────────────┘
- gators.feature_selection.feature_stability_index(estimator, skf, X: polars.DataFrame, y: polars.Series, importance_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])] = 0.0)[source]
Compute Feature Stability Index (FSI) using repeated estimator feature importance.
Measures how consistently a feature is selected across different training folds. Higher FSI indicates more stable/reliable feature importance.
- Parameters:
estimator (estimator object) – Any estimator with a
feature_importances_attribute (e.g., XGBoost, RandomForest).skf (sklearn fold splitter object) – Any sklearn fold splitter object (e.g., StratifiedKFold, KFold) for splitting the data.
X (pl.DataFrame) – Feature DataFrame with shape (n_samples, n_features).
y (pl.Series) – Target series for training.
importance_threshold (Annotated[float, Field(ge=0.0, le=1.0)], default=0.0) – Minimum importance value for a feature to be considered “selected” in a run. Must be between 0.0 and 1.0.
- Returns:
DataFrame with columns:
feature: Feature name
fsi: Feature Stability Index (0 to 1, higher is more stable)
importance: Average importance across all runs
Sorted by FSI and importance in descending order, filtered to fsi > 0.
- Return type:
pl.DataFrame
- class gators.feature_selection.FeatureStabilitySelector[source]
Bases:
gators.transformer._base_transformer._BaseTransformerDrop columns whose Feature Stability Index falls below a threshold.
Wraps
feature_stability_index()into a fit/transform interface. The FSI measures how consistently a feature is selected across cross-validation folds; features with low FSI are considered unstable and are dropped.Note
Sklearn estimators require NumPy arrays. The
fitmethod calls.to_numpy()internally — this is unavoidable and correct.- Parameters:
estimator (estimator object) – Any fitted estimator with a
feature_importances_attribute (e.g.,RandomForestClassifier,XGBClassifier).skf (sklearn splitter object) – Any sklearn cross-validation splitter (e.g.,
StratifiedKFold).threshold (float, default=0.5) – Minimum FSI required to keep a column. Columns with FSI strictly below this value are dropped.
importance_threshold (float, default=0.0) – Minimum per-fold importance for a feature to count as “selected” in that fold.
- selected_features_
Column names that survive the FSI threshold.
- Type:
list[str]
- columns_to_drop_
Column names dropped due to low FSI.
- Type:
list[str]
- fsi_scores_
Full FSI DataFrame (feature, fsi, importance) computed during fit.
- Type:
pl.DataFrame
Examples
>>> import polars as pl >>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.model_selection import StratifiedKFold >>> from gators.feature_selection import FeatureStabilitySelector
>>> X = pl.DataFrame({ ... "stable": [i % 2 for i in range(100)], ... "unstable": [i % 7 for i in range(100)], ... }) >>> y = pl.Series("target", [i % 2 for i in range(100)]) >>> estimator = RandomForestClassifier(n_estimators=10, random_state=0) >>> skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0) >>> selector = FeatureStabilitySelector(estimator=estimator, skf=skf, threshold=0.5) >>> selector.fit(X, y) >>> X_transformed = selector.transform(X)
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.feature_selection.feature_stability_selector.FeatureStabilitySelector[source]
Compute FSI for each column and record which to drop.
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series) – Target series for training the estimator.
- Returns:
The fitted transformer instance.
- Return type:
FeatureStabilitySelector
- transform(X: polars.DataFrame) polars.DataFrame[source]
Drop unstable columns from the DataFrame.
- Parameters:
X (pl.DataFrame) – Input DataFrame to transform.
- Returns:
DataFrame with unstable columns removed.
- Return type:
pl.DataFrame
- class gators.feature_selection.InformationValueSelector[source]
Bases:
gators.transformer._base_transformer._BaseTransformerDrop columns whose Information Value falls below a threshold.
Computes the IV for each categorical (String/Categorical/Enum) column and drops those whose IV is below
threshold. Numeric columns are always kept untouched; they are not considered for IV computation.- Parameters:
threshold (float, default=0.02) – Minimum IV required to keep a column. Columns with IV strictly below this value are dropped.
regularization (float, default=0.01) – Regularization applied to WOE/IV calculation to avoid division by zero.
- selected_features_
All column names that survive the threshold (set after
fit).- Type:
list[str]
- columns_to_drop_
Categorical column names dropped because their IV was too low.
- Type:
list[str]
Examples
>>> import polars as pl >>> from gators.feature_selection import InformationValueSelector
>>> X = pl.DataFrame({ ... "cat_strong": ["a", "b", "a", "b", "a", "b"], ... "cat_weak": ["x", "x", "x", "x", "y", "y"], ... "numeric": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0], ... }) >>> y = pl.Series("target", [1, 0, 1, 0, 1, 0]) >>> selector = InformationValueSelector(threshold=0.02) >>> selector.fit(X, y) >>> X_transformed = selector.transform(X)
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.feature_selection.information_value_selector.InformationValueSelector[source]
Compute IV for each categorical column and record which to drop.
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series) – Binary target series.
- Returns:
The fitted transformer instance.
- Return type:
InformationValueSelector
- transform(X: polars.DataFrame) polars.DataFrame[source]
Drop low-IV columns from the DataFrame.
- Parameters:
X (pl.DataFrame) – Input DataFrame to transform.
- Returns:
DataFrame with low-IV categorical columns removed.
- Return type:
pl.DataFrame
- class gators.feature_selection.PermutationImportanceSelector[source]
Bases:
gators.transformer._base_transformer._BaseTransformerDrop columns whose permutation importance falls below a threshold.
Fits
estimatoron the training data, then measures how much the model score degrades when each feature is randomly shuffled (permuted). Features whose mean importance acrossn_repeatspermutations is belowthresholdare dropped.Note
Sklearn estimators require NumPy arrays. The
fitmethod calls.to_numpy()internally — this is unavoidable and correct.- Parameters:
estimator (estimator object) – A fitted or unfitted sklearn-compatible estimator. Must implement
fitandscore.n_repeats (int, default=5) – Number of times to permute each feature.
threshold (float, default=0.0) – Minimum mean importance drop required to keep a feature. Features with mean permutation importance strictly below this value are dropped. A value of
0.0keeps all features that contribute at least marginally.
- selected_features_
Column names that survive the importance threshold.
- Type:
list[str]
- columns_to_drop_
Column names dropped due to low permutation importance.
- Type:
list[str]
- importances_
Mean permutation importance for each input feature.
- Type:
dict[str, float]
Examples
>>> import polars as pl >>> from sklearn.ensemble import RandomForestClassifier >>> from gators.feature_selection import PermutationImportanceSelector
>>> X = pl.DataFrame({ ... "informative": [i % 2 for i in range(100)], ... "noise": [0] * 100, ... }) >>> y = pl.Series("target", [i % 2 for i in range(100)]) >>> estimator = RandomForestClassifier(n_estimators=10, random_state=0) >>> selector = PermutationImportanceSelector( ... estimator=estimator, n_repeats=5, threshold=0.0 ... ) >>> selector.fit(X, y) >>> X_transformed = selector.transform(X)
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.feature_selection.permutation_importance_selector.PermutationImportanceSelector[source]
Fit estimator and compute permutation importances.
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series) – Target series.
- Returns:
The fitted transformer instance.
- Return type:
PermutationImportanceSelector
- transform(X: polars.DataFrame) polars.DataFrame[source]
Drop low-importance columns from the DataFrame.
- Parameters:
X (pl.DataFrame) – Input DataFrame to transform.
- Returns:
DataFrame with low-importance columns removed.
- Return type:
pl.DataFrame
- class gators.feature_selection.PSIFilter[source]
Bases:
gators.transformer._base_transformer._BaseTransformerDrop columns whose Population Stability Index exceeds a threshold.
PSI quantifies how much a feature’s distribution has shifted between a reference dataset (typically training data) and the current dataset. High PSI signals distributional drift; such features are unreliable at inference time and are dropped.
PSI interpretation:
PSI < 0.10 — stable, no significant change
0.10 ≤ PSI < 0.25 — moderate shift, investigate
PSI ≥ 0.25 — significant shift, feature is unstable
Only numeric (Float64, Float32, Int64, Int32) columns are evaluated for PSI. Non-numeric columns are always kept.
- Parameters:
reference_df (pl.DataFrame) – Reference DataFrame whose distributions define the baseline.
threshold (float, default=0.2) – Maximum PSI allowed. Columns with PSI strictly above this value are dropped.
n_bins (int, default=10) – Number of quantile-based bins used when computing PSI.
subset (list[str] or None, default=None) – Numeric columns to evaluate. If None, all numeric columns shared between
reference_dfand the DataFrame passed tofitare used.
- psi_scores_
PSI score for each evaluated column (set after
fit).- Type:
dict[str, float]
- columns_to_drop_
Columns dropped because their PSI exceeded the threshold.
- Type:
list[str]
- selected_features_
Columns kept after filtering.
- Type:
list[str]
Examples
>>> import polars as pl >>> from gators.feature_selection import PSIFilter
>>> reference = pl.DataFrame({ ... "stable": [float(i % 10) for i in range(100)], ... "drifted": [float(i) for i in range(100)], ... }) >>> current = pl.DataFrame({ ... "stable": [float(i % 10) for i in range(100)], ... "drifted": [float(i + 200) for i in range(100)], ... }) >>> selector = PSIFilter(reference_df=reference, threshold=0.2) >>> selector.fit(current) >>> X_transformed = selector.transform(current)
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.feature_selection.psi_filter.PSIFilter[source]
Compute PSI for each numeric column against the reference DataFrame.
- Parameters:
X (pl.DataFrame) – Current DataFrame to compare against
reference_df.y (pl.Series, default=None) – Not used; present for sklearn compatibility.
- Returns:
The fitted transformer instance.
- Return type:
PSIFilter
- transform(X: polars.DataFrame) polars.DataFrame[source]
Drop high-PSI columns from the DataFrame.
- Parameters:
X (pl.DataFrame) – Input DataFrame to transform.
- Returns:
DataFrame with high-PSI columns removed.
- Return type:
pl.DataFrame