gators.feature_selection package#
Submodules#
gators.feature_selection.information_value module#
- gators.feature_selection.information_value.compute_iv(X, y, regularization=0.01)[source]#
Compute the Information Value (IV) for each categorical feature in the dataset.
To convert continuous features to categorical, consider using the binning module to create bins before computing IV.
- Parameters:
X (pl.DataFrame) – The input features.
y (pl.Series) – The target variable (binary).
regularization (float, default=0.01) – Regularization parameter to avoid division by zero in WOE/IV calculation.
- Returns:
A DataFrame containing the IV values for each feature.
- Return type:
pl.DataFrame
Examples
>>> import polars as pl >>> from gators.feature_selection import compute_iv
>>> X = pl.DataFrame({ ... "feature1": ["a", "a", "b", "c"], ... "feature2": ["x", "x", "x", "y"], ... "target": [1, 0, 1, 0] ... }) >>> iv = compute_iv(X.drop("target"), X["target"]) >>> print(iv) shape: (2, 2) ┌──────────┬────────────┐ │ feature │ iv │ │ --- │ --- │ │ str │ f64 │ ╞══════════╪════════════╡ │ feature1 │ 0.693147 │ │ feature2 │ 0.287682 │ └──────────┴────────────┘
gators.feature_selection.feature_stability_index module#
- gators.feature_selection.feature_stability_index.feature_stability_index(estimator, skf, X, y, importance_threshold=0.0)[source]#
Compute Feature Stability Index (FSI) using repeated estimator feature importance.
Measures how consistently a feature is selected across different training folds. Higher FSI indicates more stable/reliable feature importance.
- Parameters:
estimator (estimator object) – Any estimator with a
feature_importances_attribute (e.g., XGBoost, RandomForest).skf (sklearn fold splitter object) – Any sklearn fold splitter object (e.g., StratifiedKFold, KFold) for splitting the data.
X (
DataFrame) – Feature DataFrame with shape (n_samples, n_features).y (
Series) – Target series for training.importance_threshold (
float) – Minimum importance value for a feature to be considered “selected” in a run. Must be between 0.0 and 1.0.
- Returns:
DataFrame with columns:
feature: Feature name
fsi: Feature Stability Index (0 to 1, higher is more stable)
importance: Average importance across all runs
Sorted by FSI and importance in descending order, filtered to fsi > 0.
- Return type:
pl.DataFrame
Module contents#
- gators.feature_selection.compute_iv(X, y, regularization=0.01)[source]#
Compute the Information Value (IV) for each categorical feature in the dataset.
To convert continuous features to categorical, consider using the binning module to create bins before computing IV.
- Parameters:
X (pl.DataFrame) – The input features.
y (pl.Series) – The target variable (binary).
regularization (float, default=0.01) – Regularization parameter to avoid division by zero in WOE/IV calculation.
- Returns:
A DataFrame containing the IV values for each feature.
- Return type:
pl.DataFrame
Examples
>>> import polars as pl >>> from gators.feature_selection import compute_iv
>>> X = pl.DataFrame({ ... "feature1": ["a", "a", "b", "c"], ... "feature2": ["x", "x", "x", "y"], ... "target": [1, 0, 1, 0] ... }) >>> iv = compute_iv(X.drop("target"), X["target"]) >>> print(iv) shape: (2, 2) ┌──────────┬────────────┐ │ feature │ iv │ │ --- │ --- │ │ str │ f64 │ ╞══════════╪════════════╡ │ feature1 │ 0.693147 │ │ feature2 │ 0.287682 │ └──────────┴────────────┘
- gators.feature_selection.feature_stability_index(estimator, skf, X, y, importance_threshold=0.0)[source]#
Compute Feature Stability Index (FSI) using repeated estimator feature importance.
Measures how consistently a feature is selected across different training folds. Higher FSI indicates more stable/reliable feature importance.
- Parameters:
estimator (estimator object) – Any estimator with a
feature_importances_attribute (e.g., XGBoost, RandomForest).skf (sklearn fold splitter object) – Any sklearn fold splitter object (e.g., StratifiedKFold, KFold) for splitting the data.
X (
DataFrame) – Feature DataFrame with shape (n_samples, n_features).y (
Series) – Target series for training.importance_threshold (
float) – Minimum importance value for a feature to be considered “selected” in a run. Must be between 0.0 and 1.0.
- Returns:
DataFrame with columns:
feature: Feature name
fsi: Feature Stability Index (0 to 1, higher is more stable)
importance: Average importance across all runs
Sorted by FSI and importance in descending order, filtered to fsi > 0.
- Return type:
pl.DataFrame