gators.feature_generation package#
Module contents#
- class gators.feature_generation.IsNull[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinCreates boolean features indicating whether values are null for specified columns.
- Parameters:
subset (Optional[List[str]], default=None) – List of column names to check for null values. If None, all columns in the DataFrame are used.
Examples
>>> from is_null import IsNull >>> import polars as pl
>>> X ={'A': [1, None, 3, 4], ... 'B': [4, 3, None, 1], ... 'C': [1, 2, 1, 2]} >>> X = pl.DataFrame(X)
>>> transformer = IsNull(subset=['A', 'B']) >>> transformer.fit(X) IsNull(subset=['A', 'B']) >>> result = transformer.transform(X) >>> result shape: (4, 5) ┌──────┬──────┬─────┬──────────────┬──────────────┐ │ A │ B │ C │ A__is_null │ B__is_null │ │ i64 │ i64 │ i64 │ bool │ bool │ ├──────┼──────┼─────┼──────────────┼──────────────┤ │ 1 │ 4 │ 1 │ false │ false │ │ null │ 3 │ 2 │ true │ false │ │ 3 │ null │ 1 │ false │ true │ │ 4 │ 1 │ 2 │ false │ false │ └──────┴──────┴─────┴──────────────┴──────────────┘
- class gators.feature_generation.PolynomialFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates polynomial and interaction features.
- Parameters:
subset (Optional[List[str]], default=None) – Subset of columns to transform. If None, all columns except strings and booleans.
degree (int, default=2) – The degree of the polynomial features.
interaction_only (bool, default=False) – If True, only interaction features are produced.
include_bias (bool, default=True) – If True, include a bias column (column of ones).
Examples
Example 1: Degree 2 polynomial with bias term
>>> from gators.discretizers import PolynomialFeatures >>> import polars as pl >>> X = pl.DataFrame({'A': [1, 2], 'B': [3, 4]}) >>> transformer = PolynomialFeatures(degree=2, include_bias=True) >>> transformer.fit(X) >>> transformer.transform(X) shape: (2, 5) ┌─────┬─────┬─────┬─────┬─────┐─────┐ │ A │ B │ A__A│ A__B│ B__B│ bias| ├─────┼─────┼─────┼─────┼─────┤─────┤ │ 1 │ 3 │ 1 │ 3 │ 9 │ 1 │ │ 2 │ 4 │ 4 │ 8 │ 16 │ 1 │ └─────┴─────┴─────┴─────┴─────┴─────┘
Example 2: Polynomial on subset of columns
>>> transformer = PolynomialFeatures(subset=['A'], degree=2) >>> transformer.fit(X) >>> transformer.transform(X) shape: (2, 3) ┌─────┬─────┬─────┐ │ A │ B │ A__A│ ├─────┼─────┼─────┤ │ 1 │ 3 │ 1 │ │ 2 │ 4 │ 4 │ └─────┴─────┴─────┘
Example 3: Interaction features only
>>> transformer = PolynomialFeatures(degree=2, interaction_only=True) >>> transformer.fit(X) >>> transformer.transform(X) shape: (2, 4) ┌─────┬─────┬─────┐ │ A │ B │ A__B│ ├─────┼─────┼─────┼ │ 1 │ 3 │ 3 │ │ 2 │ 4 │ 8 │ └─────┴─────┴─────┴
- class gators.feature_generation.PlanRotationFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinCreate new columns based on the plan rotation mapping.
The data should be composed of numerical columns only. Use gators.encoders to replace the categorical columns by numerical ones before using PlanRotationFeatures.
- Parameters:
Examples
Basic usage with plan rotation
Imports and initialization:
>>> from gators.feature_generation import PlanRotationFeatures >>> obj = PlanRotationFeatures( ... subset=[['X', 'Y'], ['X', 'Z']] , angles=[45.0, 60.0])
The fit, transform, and fit_transform methods accept polars dataframes:
>>> import polars as pl >>> X = pl.DataFrame( ... {'X': [200.0, 210.0], 'Y': [140.0, 160.0], 'Z': [100.0, 125.0]})
The result is a transformed polars dataframe.
>>> obj.fit_transform(X) shape: (2, 9) ┌───────┬───────┬───────┬────────────┬───┬────────────┬────────────┬────────────┐ │ X ┆ Y ┆ Z ┆ XY_x_45.0… ┆ … ┆ XZ_y_45.0… ┆ XZ_x_60.0… ┆ XZ_y_60.0… │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 │ ╞═══════╪═══════╪═══════╪════════════╪═══╪════════════╪════════════╪════════════╡ │ 200.0 ┆ 140.0 ┆ 100.0 ┆ 42.426407 ┆ … ┆ 212.132034 ┆ 13.397460 ┆ 223.205081 │ │ 210.0 ┆ 160.0 ┆ 125.0 ┆ 35.355339 ┆ … ┆ 236.880772 ┆ -3.253175 ┆ 244.365335 │ └───────┴───────┴───────┴────────────┴───┴────────────┴────────────┴────────────┘
- class gators.feature_generation.MathFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates new features by applying mathematical operations to groups of columns.
- Parameters:
groups (List[List[str]]) – List of groups of column names to apply operations on.
operations (List[str]) –
List of operations to apply to each group of columns. Available operations:
’sum’: Sum of all columns
’mean’: Mean of all columns
’minus’: Subtraction (reduces columns left to right)
’mul’: Product of all columns
’div’: Division (reduces columns left to right)
’min’: Minimum value across columns
’max’: Maximum value across columns
’std’: Standard deviation across columns
’var’: Variance across columns
’median’: Median across columns
’range’: Range (max - min)
’abs_diff’: Absolute difference (reduces columns left to right)
’count_null’: Count of null values
’count_zero’: Count of zero values
’count_nonzero’: Count of non-zero values
Note: For division operations, consider using RatioFeatures instead, which provides safer division with automatic handling of division by zero and null values.
drop_columns (bool, optional) – Whether to drop the original columns after creating the new features, by default False.
new_column_names (Optional[List[str]], optional) – List of new column names for the created features, by default None.
Examples
>>> from math_features import MathFeatures >>> import polars as pl
>>> X ={'A': [1, 2, 3, 4], ... 'B': [4, 3, 2, 1], ... 'C': [1, 2, 1, 2]} >>> X = pl.DataFrame(X)
Example 1: drop_columns=False
>>> transformer = MathFeatures(groups=[['A', 'B'], ['B', 'C']], operations=['sum', 'mean']) >>> transformer.fit(X) MathFeatures(groups=[['A', 'B'], ['B', 'C']], operations=['sum', 'mean']) >>> result = transformer.transform(X) >>> result shape: (4, 6) ┌─────┬─────┬─────┬────────┬─────-───┬────────┐ │ A │ B │ C │ A_B_sum│ A_B_mean│ B_C_sum│ │ i64 │ i64 │ i64 │ f64 │ f64 │ f64 │ ├─────┼─────┼─────┼────────┼──────-──┼────────┤ │ 1 │ 4 │ 1 │ 5.0 │ 2.5 │ 5.0 │ │ 2 │ 3 │ 2 │ 5.0 │ 2.5 │ 5.0 │ │ 3 │ 2 │ 1 │ 5.0 │ 2.5 │ 3.0 │ │ 4 │ 1 │ 2 │ 5.0 │ 2.5 │ 3.0 │ └─────┴─────┴─────┴────────┴───────-─┴────────┘
Example 2: drop_columns=True
>>> transformer = MathFeatures(groups=[['A', 'B'], ['B', 'C']], operations=['sum'], drop_columns=True) >>> transformer.fit(X) MathFeatures(groups=[['A', 'B'], ['B', 'C']], operations=['sum'], drop_columns=True) >>> result = transformer.transform(X) >>> result shape: (4, 2) ┌────────┬────────┐ │ A_B_sum│ B_C_sum│ │ f64 │ f64 │ ├────────┼────────┤ │ 5.0 │ 5.0 │ │ 5.0 │ 5.0 │ │ 5.0 │ 3.0 │ │ 5.0 │ 3.0 │ └────────┴────────┘
- class gators.feature_generation.RatioFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates ratio features by dividing numerator columns by denominator columns.
This transformer creates ratio features in a 1-to-1 pairing between numerator and denominator columns. Division by zero is handled by replacing the result with null values.
- Parameters:
numerator_columns (List[str]) – List of column names to use as numerators.
denominator_columns (List[str]) – List of column names to use as denominators. Must have the same length as numerator_columns.
new_column_names (Optional[List[str]], optional) – List of custom names for the ratio features. If None, names will be automatically generated as ‘{numerator}__div__{denominator}’, by default None.
drop_columns (bool, optional) – Whether to drop the original numerator and denominator columns after creating ratios, by default False.
Examples
>>> from gators.feature_generation import RatioFeatures >>> import polars as pl
>>> X = pl.DataFrame({ ... 'revenue': [100, 200, 300, 400], ... 'cost': [80, 100, 150, 0], ... 'clicks': [1000, 2000, 3000, 4000], ... 'impressions': [10000, 20000, 30000, 40000] ... })
Example 1: Basic ratio features
>>> transformer = RatioFeatures( ... numerator_columns=['revenue', 'clicks'], ... denominator_columns=['cost', 'impressions'] ... ) >>> transformer.fit(X) RatioFeatures(numerator_columns=['revenue', 'clicks'], denominator_columns=['cost', 'impressions']) >>> result = transformer.transform(X) >>> result shape: (4, 6) ┌─────────┬──────┬────────┬─────────────┬────────────────────┬─────────────────────────┐ │ revenue │ cost │ clicks │ impressions │ revenue__div__cost │ clicks__div__impressions│ │ i64 │ i64 │ i64 │ i64 │ f64 │ f64 │ ├─────────┼──────┼────────┼─────────────┼────────────────────┼─────────────────────────┤ │ 100 │ 80 │ 1000 │ 10000 │ 1.25 │ 0.1 │ │ 200 │ 100 │ 2000 │ 20000 │ 2.0 │ 0.1 │ │ 300 │ 150 │ 3000 │ 30000 │ 2.0 │ 0.1 │ │ 400 │ 0 │ 4000 │ 40000 │ null │ 0.1 │ └─────────┴──────┴────────┴─────────────┴────────────────────┴─────────────────────────┘
Example 2: Custom column names
>>> transformer = RatioFeatures( ... numerator_columns=['revenue'], ... denominator_columns=['cost'], ... new_column_names=['profit_margin'] ... ) >>> result = transformer.fit_transform(X) >>> result shape: (4, 5) ┌─────────┬──────┬────────┬─────────────┬───────────────┐ │ revenue │ cost │ clicks │ impressions │ profit_margin │ │ i64 │ i64 │ i64 │ i64 │ f64 │ ├─────────┼──────┼────────┼─────────────┼───────────────┤ │ 100 │ 80 │ 1000 │ 10000 │ 1.25 │ │ 200 │ 100 │ 2000 │ 20000 │ 2.0 │ │ 300 │ 150 │ 3000 │ 30000 │ 2.0 │ │ 400 │ 0 │ 4000 │ 40000 │ null │ └─────────┴──────┴────────┴─────────────┴───────────────┘
Example 3: With drop_columns=True
>>> transformer = RatioFeatures( ... numerator_columns=['revenue'], ... denominator_columns=['cost'], ... drop_columns=True ... ) >>> result = transformer.fit_transform(X) >>> result shape: (4, 3) ┌────────┬─────────────┬────────────────────┐ │ clicks │ impressions │ revenue__div__cost │ │ i64 │ i64 │ f64 │ ├────────┼─────────────┼────────────────────┤ │ 1000 │ 10000 │ 1.25 │ │ 2000 │ 20000 │ 2.0 │ │ 3000 │ 30000 │ 2.0 │ │ 4000 │ 40000 │ null │ └────────┴─────────────┴────────────────────┘
Example 4: Handling null values
>>> X_with_nulls = pl.DataFrame({ ... 'A': [10, None, 30, 40], ... 'B': [2, 5, None, 0] ... }) >>> transformer = RatioFeatures( ... numerator_columns=['A'], ... denominator_columns=['B'] ... ) >>> result = transformer.fit_transform(X_with_nulls) >>> result shape: (4, 3) ┌──────┬──────┬──────────────┐ │ A │ B │ A__div__B │ │ i64 │ i64 │ f64 │ ├──────┼──────┼──────────────┤ │ 10 │ 2 │ 5.0 │ │ null │ 5 │ null │ │ 30 │ null │ null │ │ 40 │ 0 │ null │ └──────┴──────┴──────────────┘
- class gators.feature_generation.GroupScalingFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates group-based scaling features for numerical columns.
This transformer creates features like:
value / group_mean (most common: relative position vs average)
value / group_median (robust to outliers)
(value - group_mean) / group_std (z-score: standardized deviation)
(value - group_min) / (group_max - group_min) (min-max: 0-1 normalization)
Importance for Fraud Detection#
Group scaling features are particularly valuable in fraud detection because they capture relative deviations from group-level behavior patterns. Fraudulent transactions often exhibit unusual characteristics compared to the typical behavior within their segments.
mean/median ratios: Show multiplicative deviation (e.g., 10x the group average)
zscore: Quantifies how many standard deviations away from group mean (e.g., 3σ anomaly)
minmax: Shows relative position within observed range (0=min, 1=max, handles negatives)
These features are especially powerful when combined with various grouping dimensions (e.g., by merchant, customer segment, time of day, or geographic location) to capture different aspects of abnormal behavior.
- param subset:
List of numerical column names to transform.
- type subset:
List[str]
- param by:
List of column names to use for groupby operations. Each column will be used for a separate groupby operation (e.g., [‘cat1’, ‘cat2’] creates features grouped by cat1 and separate features grouped by cat2).
- type by:
List[str]
- param func:
List of scaling functions to apply. Available options: - ‘mean’: value / group_mean (relative position vs average) - ‘median’: value / group_median (robust to outliers) - ‘zscore’: (value - group_mean) / group_std (standardized deviation) - ‘minmax’: (value - group_min) / (group_max - group_min) (0-1 normalization)
- type func:
List[str]
- param fill_value:
Value to use when denominator is zero or null (safe division/scaling).
- type fill_value:
float, default=0.0
- param drop_columns:
Whether to drop the original numerical columns after creating scaled features.
- type drop_columns:
bool, default=False
- param new_column_names:
List of custom names for the scaled feature columns. If None, uses default naming pattern ‘{num_col}__{func}_{groupby_col}’. Must have same length as the total number of features created (subset × by × func).
- type new_column_names:
Optional[List[str]], default=None
Examples
>>> from gators.feature_generation import GroupScalingFeatures >>> import polars as pl
>>> X ={ ... 'amount': [100, 200, 150, 300, 250], ... 'cat1': ['A', 'A', 'B', 'B', 'A'], ... 'cat2': ['X', 'Y', 'X', 'X', 'X'] ... } >>> X = pl.DataFrame(X)
Example 1: Single groupby column with multiple scaling functions
>>> transformer = GroupScalingFeatures( ... subset=['amount'], ... by=['cat1'], ... func=['mean', 'zscore'] ... ) >>> transformer.fit(X) GroupScalingFeatures(subset=['amount'], by=['cat1'], func=['mean', 'zscore']) >>> result = transformer.transform(X) >>> result shape: (5, 5) ┌────────┬──────┬──────┬──────────────────┬────────────────────┐ │ amount ┆ cat1 ┆ cat2 ┆ amount__mean_cat1 ┆ amount__zscore_cat1 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str ┆ f64 ┆ f64 │ ╞════════╪══════╪══════╪══════════════════╪════════════════════╡ │ 100 ┆ A ┆ X ┆ 0.545455 ┆ -1.069045 │ │ 200 ┆ A ┆ Y ┆ 1.090909 ┆ 0.267261 │ │ 150 ┆ B ┆ X ┆ 0.666667 ┆ -0.707107 │ │ 300 ┆ B ┆ X ┆ 1.333333 ┆ 0.707107 │ │ 250 ┆ A ┆ X ┆ 1.363636 ┆ 0.801784 │ └────────┴──────┴──────┴──────────────────┴────────────────────┘
Example 2: Multiple groupby columns
>>> X ={ ... 'amount': [100, 200, 150, 300], ... 'value': [50, 100, 75, 150], ... 'cat1': ['A', 'A', 'B', 'B'], ... 'cat2': ['X', 'Y', 'X', 'Y'] ... } >>> X = pl.DataFrame(X) >>> transformer = GroupScalingFeatures( ... subset=['amount'], ... by=['cat1', 'cat2'], ... func=['mean'] ... ) >>> result = transformer.fit_transform(X) >>> result.columns ['amount', 'value', 'cat1', 'cat2', 'amount__mean_cat1', 'amount__mean_cat2'] # Creates separate features grouped by cat1 and grouped by cat2
Example 3: Min-max scaling
>>> X ={ ... 'amount': [100, 200, 150, 300], ... 'cat1': ['A', 'A', 'B', 'B'] ... } >>> X = pl.DataFrame(X) >>> transformer = GroupScalingFeatures( ... subset=['amount'], ... by=['cat1'], ... func=['minmax'] ... ) >>> result = transformer.fit_transform(X) >>> result shape: (4, 3) ┌────────┬──────┬─────────────────────┐ │ amount ┆ cat1 ┆ amount__minmax_cat1 │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 │ ╞════════╪══════╪═════════════════════╡ │ 100 ┆ A ┆ 0.0 │ │ 200 ┆ A ┆ 1.0 │ │ 150 ┆ B ┆ 0.0 │ │ 300 ┆ B ┆ 1.0 │ └────────┴──────┴─────────────────────┘
- class gators.feature_generation.GroupStatisticsFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates statistical aggregation features based on group-level computations.
Unlike GroupRatioFeatures which divides values by group stats, this transformer directly adds the group statistics as new columns.
- Parameters:
subset (List[str]) – List of numerical column names to aggregate.
by (List[str]) – List of column names to use for groupby operations. Each column will be used for a separate groupby operation (e.g., [‘cat1’, ‘cat2’] creates features grouped by cat1 and separate features grouped by cat2).
func (List[str]) – List of aggregation functions to apply. Available options: - ‘mean’: Group mean - ‘std’: Group standard deviation - ‘median’: Group median - ‘min’: Group minimum - ‘max’: Group maximum - ‘sum’: Group sum - ‘count’: Group count - ‘range’: Group range (max - min)
drop_columns (bool, default=False) – Whether to drop the original numerical columns after creating statistics.
new_column_names (Optional[List[str]], default=None) – List of custom names for the statistic columns. If None, uses default naming pattern ‘{agg}_{num_col}__per_{groupby_col}’. Must have same length as the total number of features created (subset × by × func).
Examples
>>> from gators.feature_generation import GroupStatisticsFeatures >>> import polars as pl
>>> X ={ ... 'amount': [100, 200, 150, 300, 250], ... 'cat1': ['A', 'A', 'B', 'B', 'A'], ... 'cat2': ['X', 'Y', 'X', 'X', 'X'] ... } >>> X = pl.DataFrame(X)
Example 1: Basic group statistics
>>> transformer = GroupStatisticsFeatures( ... subset=['amount'], ... by=['cat1'], ... func=['mean', 'count'] ... ) >>> result = transformer.fit_transform(X) >>> result shape: (5, 5) ┌────────┬───────┬───────┬───────────────────────┬────────────────────────┐ │ amount ┆ cat1 ┆ cat2 ┆ mean_amount__per_cat1 ┆ count_amount__per_cat1 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str ┆ f64 ┆ u32 │ ╞════════╪═══════╪═══════╪═══════════════════════╪════════════════════════╡ │ 100 ┆ A ┆ X ┆ 183.333333 ┆ 3 │ │ 200 ┆ A ┆ Y ┆ 183.333333 ┆ 3 │ │ 150 ┆ B ┆ X ┆ 225.0 ┆ 2 │ │ 300 ┆ B ┆ X ┆ 225.0 ┆ 2 │ │ 250 ┆ A ┆ X ┆ 183.333333 ┆ 3 │ └────────┴───────┴───────┴───────────────────────┴────────────────────────┘
Example 2: Multiple groupby columns
>>> X ={ ... 'amount': [100, 200, 150, 300], ... 'cat1': ['A', 'A', 'B', 'B'], ... 'cat2': ['X', 'Y', 'X', 'Y'] ... } >>> X = pl.DataFrame(X) >>> transformer = GroupStatisticsFeatures( ... subset=['amount'], ... by=['cat1', 'cat2'], ... func=['mean'] ... ) >>> result = transformer.fit_transform(X) >>> result.columns ['amount', 'cat1', 'cat2', 'mean_amount__per_cat1', 'mean_amount__per_cat2'] # Creates separate features grouped by cat1 and grouped by cat2
Example 3: Multiple func
>>> transformer = GroupStatisticsFeatures( ... subset=['amount'], ... by=['cat1'], ... func=['mean', 'std', 'min', 'max'] ... ) >>> result = transformer.fit_transform(X) >>> result.columns ['amount', 'cat1', 'cat2', 'mean_amount__per_cat1', 'std_amount__per_cat1', 'min_amount__per_cat1', 'max_amount__per_cat1']
- class gators.feature_generation.GroupLagFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates lag (previous values) and lead (next values) features within groups.
This transformer creates features like:
Previous transaction amount for this card
Next transaction amount for this card
Value N periods ago within group
Useful for time-series analysis and detecting changes in behavior patterns.
- Parameters:
subset (List[str]) – List of numerical column names to create lag/lead features for.
by (List[str]) – List of columns to group by. Lags/leads are computed within each group.
lags (List[int]) – List of lag periods. Positive integers create lag features (previous values). Example: [1, 2, 3] creates lag_1, lag_2, lag_3
leads (List[int], default=[]) – List of lead periods. Positive integers create lead features (next values). Example: [1, 2] creates lead_1, lead_2
fill_value (Optional[float], default=None) – Value to use for missing lag/lead values. If None, uses null.
drop_columns (bool, default=False) – Whether to drop the original numerical columns after creating lag features.
new_column_names (Optional[List[str]], default=None) – List of custom names for the lag/lead columns. If None, uses default naming pattern ‘{num_col}_lag{n}_{groupby_cols}’ or ‘{num_col}_lead{n}_{groupby_cols}’. Must have same length as the total number of features created.
Examples
>>> from gators.feature_generation import GroupLagFeatures >>> import polars as pl
>>> X ={ ... 'amount': [100, 200, 150, 300, 250, 180], ... 'cat1': ['A', 'A', 'B', 'B', 'A', 'B'], ... 'time': [1, 2, 1, 2, 3, 3] ... } >>> X = pl.DataFrame(X).sort(['cat1', 'time'])
Example 1: Basic lag features
>>> transformer = GroupLagFeatures( ... subset=['amount'], ... by=['cat1'], ... lags=[1, 2] ... ) >>> result = transformer.fit_transform(X) >>> result shape: (6, 5) ┌────────┬───────┬──────┬─────────────────────┬─────────────────────┐ │ amount ┆ cat1 ┆ time ┆ amount_lag1_cat1 ┆ amount_lag2_cat1 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 ┆ i64 ┆ i64 │ ╞════════╪═══════╪══════╪═════════════════════╪═════════════════════╡ │ 100 ┆ A ┆ 1 ┆ null ┆ null │ │ 200 ┆ A ┆ 2 ┆ 100 ┆ null │ │ 250 ┆ A ┆ 3 ┆ 200 ┆ 100 │ │ 150 ┆ B ┆ 1 ┆ null ┆ null │ │ 300 ┆ B ┆ 2 ┆ 150 ┆ null │ │ 180 ┆ B ┆ 3 ┆ 300 ┆ 150 │ └────────┴───────┴──────┴─────────────────────┴─────────────────────┘
Example 2: Lag and lead features
>>> transformer = GroupLagFeatures( ... subset=['amount'], ... by=['cat1'], ... lags=[1], ... leads=[1] ... ) >>> result = transformer.fit_transform(X) >>> result shape: (6, 5) ┌────────┬───────┬──────┬───────────────────┬────────────────────┐ │ amount ┆ cat1 ┆ time ┆ amount_lag1_cat1 ┆ amount_lead1_cat1 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 ┆ i64 ┆ i64 │ ╞════════╪═══════╪══════╪═══════════════════╪════════════════════╡ │ 100 ┆ A ┆ 1 ┆ null ┆ 200 │ │ 200 ┆ A ┆ 2 ┆ 100 ┆ 250 │ │ 250 ┆ A ┆ 3 ┆ 200 ┆ null │ │ 150 ┆ B ┆ 1 ┆ null ┆ 300 │ │ 300 ┆ B ┆ 2 ┆ 150 ┆ 180 │ │ 180 ┆ B ┆ 3 ┆ 300 ┆ null │ └────────┴───────┴──────┴───────────────────┴────────────────────┘
Example 3: With fill_value
>>> transformer = GroupLagFeatures( ... subset=['amount'], ... by=['cat1'], ... lags=[1], ... fill_value=0.0 ... ) >>> result = transformer.fit_transform(X) >>> result['amount_lag1_cat1'][0] # First row, no previous value 0.0
Notes
Data should be sorted by by and time before transformation
Lag features look backwards: lag_1 is the previous row within the group
Lead features look forwards: lead_1 is the next row within the group
First rows in each group will have null (or fill_value) for lag features
Last rows in each group will have null (or fill_value) for lead features
- class gators.feature_generation.ComparisonFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates binary comparison features between pairs of columns, or unary null checks.
- Parameters:
subset_a (List[str]) – List of column names for the left side of comparisons (or the only column for unary operators).
subset_b (List[str]) – List of column names for the right side of comparisons. For unary operators (‘is_null’, ‘is_not_null’), these values are ignored.
operators (List[Literal[">", "<", ">=", "<=", "==", "!=", "is_null", "is_not_null"]]) – List of comparison operators to apply. Must match length of columns. Unary operators: ‘is_null’, ‘is_not_null’ (only use subset_a) Binary operators: ‘>’, ‘<’, ‘>=’, ‘<=’, ‘==’, ‘!=’ (use both subset_a and subset_b)
drop_columns (bool, default=False) – Whether to drop the original columns after creating comparisons.
Examples
>>> from gators.feature_generation import ComparisonFeatures >>> import polars as pl
>>> X ={'A': [10, 20, 30, 40], ... 'B': [15, 10, 30, 35], ... 'C': [5, 25, 20, 50]} >>> X = pl.DataFrame(X)
Example 1: Single comparison
>>> transformer = ComparisonFeatures( ... subset_a=['A'], ... subset_b=['B'], ... operators=['>'] ... ) >>> transformer.fit(X) ComparisonFeatures(subset_a=['A'], subset_b=['B'], operators=['>']) >>> result = transformer.transform(X) >>> result shape: (4, 4) ┌──────┬──────┬──────┬─────────┐ │ A │ B │ C │ A_gt_B │ │ i64 │ i64 │ i64 │ bool │ ├──────┼──────┼──────┼─────────┤ │ 10 │ 15 │ 5 │ false │ │ 20 │ 10 │ 25 │ true │ │ 30 │ 30 │ 20 │ false │ │ 40 │ 35 │ 50 │ true │ └──────┴──────┴──────┴─────────┘
Example 2: Multiple comparisons with different operators
>>> transformer = ComparisonFeatures( ... subset_a=['A', 'B', 'A'], ... subset_b=['B', 'C', 'C'], ... operators=['>', '<', '>='] ... ) >>> result = transformer.fit_transform(X) >>> result shape: (4, 6) ┌──────┬──────┬──────┬─────────┬─────────┬─────────┐ │ A │ B │ C │ A_gt_B │ B_lt_C │ A_gte_C │ │ i64 │ i64 │ i64 │ bool │ bool │ bool │ ├──────┼──────┼──────┼─────────┼─────────┼─────────┤ │ 10 │ 15 │ 5 │ false │ false │ true │ │ 20 │ 10 │ 25 │ true │ true │ false │ │ 30 │ 30 │ 20 │ false │ false │ true │ │ 40 │ 35 │ 50 │ true │ true │ false │ └──────┴──────┴──────┴─────────┴─────────┴─────────┘
Example 3: Null checks (unary operators)
>>> data_with_nulls = pl.DataFrame({ ... 'A': [10, None, 30, None], ... 'B': [15, 10, None, 35] ... }) >>> transformer = ComparisonFeatures( ... subset_a=['A', 'B'], ... subset_b=['', ''], # Ignored for unary operators ... operators=['is_null', 'is_not_null'] ... ) >>> result = transformer.fit_transform(data_with_nulls) >>> result shape: (4, 4) ┌──────┬──────┬────────────┬────────────────┐ │ A │ B │ A__is_null │ B__is_not_null │ │ i64 │ i64 │ bool │ bool │ ├──────┼──────┼────────────┼────────────────┤ │ 10 │ 15 │ false │ true │ │ null │ 10 │ true │ true │ │ 30 │ null │ false │ false │ │ null │ 35 │ true │ true │ └──────┴──────┴────────────┴────────────────┘
Example 4: With drop_columns=True
>>> transformer = ComparisonFeatures( ... subset_a=['A'], ... subset_b=['B'], ... operators=['>'], ... drop_columns=True ... ) >>> result = transformer.fit_transform(X) >>> result shape: (4, 2) ┌──────┬─────────┐ │ C │ A_gt_B │ │ i64 │ bool │ ├──────┼─────────┤ │ 5 │ false │ │ 25 │ true │ │ 20 │ false │ │ 50 │ true │ └──────┴─────────┘
- class gators.feature_generation.ConditionFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinCreates multiple independent boolean features, one for each condition.
This transformer is designed for creating simple boolean flags without combination logic. Each condition produces exactly one boolean output column. For combining multiple conditions with AND/OR logic, use RuleFeatures instead.
Use Cases:
Create simple boolean flags (is_adult, is_weekend, is_premium, etc.)
Materialize threshold-based features (is_high_value, is_frequent_user)
Feature engineering: Generate independent indicator variables
Fraud detection: Create simple risk flags before combining them
When to Use:
Need multiple independent boolean columns
Each condition stands alone (no AND/OR combination needed)
Want cleaner API than RuleFeatures for simple cases
Building feature sets for downstream transformers
When NOT to Use:
Need to combine conditions with AND/OR (use RuleFeatures)
One-off exploratory analysis (use Polars native expressions)
Very simple cases with 1-2 conditions (just use .with_columns())
- Parameters:
conditions (List[Dict[str, Any]]) –
List of condition dictionaries. Each condition creates one boolean output column.
Each condition dictionary must contain:
’column’: str - Name of the column to evaluate
’op’: str - Comparison operator. Supported:
Binary: ‘>’, ‘<’, ‘>=’, ‘<=’, ‘==’, ‘!=’ (require ‘value’ or ‘other_column’)
Unary: ‘is_null’, ‘is_not_null’ (no ‘value’ or ‘other_column’ needed)
’value’: Any (optional) - Scalar value to compare the column against
’other_column’: str (optional) - Name of another column to compare against
For binary operators: Either ‘value’ or ‘other_column’ must be specified, but not both. For unary operators: Neither ‘value’ nor ‘other_column’ should be specified.
Examples:
# Simple conditions: [ {'column': 'age', 'op': '>=', 'value': 18}, {'column': 'amount', 'op': '>', 'value': 1000} ] # Column comparison: [ {'column': 'velocity_24h', 'op': '>', 'other_column': 'velocity_7d'} ] # Null checks: [ {'column': 'age', 'op': 'is_null'}, {'column': 'email', 'op': 'is_not_null'} ]
new_column_names (Optional[List[str]], default=None) –
Names for the resulting boolean feature columns. If provided, must have the same length as
conditions. If None, column names are auto-generated in the format:Scalar comparison:
{column}_{op_name}_{value}(e.g., ‘age_gte_18’)Column comparison:
{column}_{op_name}_{other_column}(e.g., ‘velocity_24h_gt_velocity_7d’)Unary operation:
{column}__{op_name}(e.g., ‘age__is_null’)
Operator name mapping:
’>’ -> ‘gt’
’<’ -> ‘lt’
’>=’ -> ‘gte’
’<=’ -> ‘lte’
’==’ -> ‘eq’
’!=’ -> ‘ne’
’is_null’ -> ‘is_null’
’is_not_null’ -> ‘is_not_null’
Examples
>>> import polars as pl >>> from gators.feature_generation import ConditionFeatures
>>> X ={ ... 'age': [15, 25, 30, 17, 45], ... 'amount': [100, 1500, 500, 200, 2000], ... 'family_size': [1, 3, 1, 4, 2], ... 'fare': [50, 75, 30, 100, 80] ... } >>> X = pl.DataFrame(X)
Example 1: Create simple boolean flags
>>> transformer = ConditionFeatures( ... conditions=[ ... {'column': 'age', 'op': '>=', 'value': 18}, ... {'column': 'amount', 'op': '>', 'value': 1000}, ... {'column': 'family_size', 'op': '==', 'value': 1} ... ], ... new_column_names=['is_adult', 'is_high_amount', 'is_alone'] ... ) >>> result = transformer.fit_transform(X) >>> result.select(['age', 'amount', 'family_size', 'is_adult', 'is_high_amount', 'is_alone']) shape: (5, 6) ┌─────┬────────┬─────────────┬──────────┬─────────────────┬──────────┐ │ age ┆ amount ┆ family_size ┆ is_adult ┆ is_high_amount ┆ is_alone │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ bool ┆ bool ┆ bool │ ╞═════╪════════╪═════════════╪══════════╪═════════════════╪══════════╡ │ 15 ┆ 100 ┆ 1 ┆ false ┆ false ┆ true │ │ 25 ┆ 1500 ┆ 3 ┆ true ┆ true ┆ false │ │ 30 ┆ 500 ┆ 1 ┆ true ┆ false ┆ true │ │ 17 ┆ 200 ┆ 4 ┆ false ┆ false ┆ false │ │ 45 ┆ 2000 ┆ 2 ┆ true ┆ true ┆ false │ └─────┴────────┴─────────────┴──────────┴─────────────────┴──────────┘
Example 2: Column-to-column comparison
>>> fare_X ={ ... 'fare': [50.0, 100.0, 30.0, 200.0, 80.0], ... 'fare_per_person': [50.0, 33.3, 30.0, 50.0, 40.0] ... } >>> fare_X = pl.DataFrame(fare_data) >>> fare_transformer = ConditionFeatures( ... conditions=[ ... {'column': 'fare', 'op': '>', 'value': 100}, ... {'column': 'fare_per_person', 'op': '>', 'other_column': 'fare'} ... ], ... new_column_names=['is_expensive', 'paid_more_per_person'] ... ) >>> result = fare_transformer.fit_transform(fare_X) >>> result shape: (5, 4) ┌───────┬──────────────────┬──────────────┬──────────────────────┐ │ fare ┆ fare_per_person ┆ is_expensive ┆ paid_more_per_person │ │ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ bool ┆ bool │ ╞═══════╪══════════════════╪══════════════╪══════════════════════╡ │ 50.0 ┆ 50.0 ┆ false ┆ false │ │ 100.0 ┆ 33.3 ┆ false ┆ false │ │ 30.0 ┆ 30.0 ┆ false ┆ false │ │ 200.0 ┆ 50.0 ┆ true ┆ false │ │ 80.0 ┆ 40.0 ┆ false ┆ false │ └───────┴──────────────────┴──────────────┴──────────────────────┘
Example 3: Titanic-style feature engineering
>>> titanic_X ={ ... 'Age': [22.0, 38.0, 26.0, 35.0, 12.0], ... 'Pclass': [3, 1, 3, 1, 3], ... 'SibSp': [1, 1, 0, 1, 0], ... 'Parch': [0, 0, 0, 0, 1] ... } >>> titanic_X = pl.DataFrame(titanic_data) >>> # First add family_size >>> titanic_X = titanic_X.with_columns( ... (pl.col('SibSp') + pl.col('Parch')).alias('family_size') ... ) >>> titanic_transformer = ConditionFeatures( ... conditions=[ ... {'column': 'Age', 'op': '<', 'value': 18}, ... {'column': 'Pclass', 'op': '==', 'value': 1}, ... {'column': 'family_size', 'op': '==', 'value': 0} ... ], ... new_column_names=['is_child', 'is_first_class', 'is_alone'] ... ) >>> result = titanic_transformer.fit_transform(titanic_X) >>> result.select(['Age', 'Pclass', 'family_size', 'is_child', 'is_first_class', 'is_alone']) shape: (5, 6) ┌──────┬────────┬─────────────┬──────────┬────────────────┬──────────┐ │ Age ┆ Pclass ┆ family_size ┆ is_child ┆ is_first_class ┆ is_alone │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ i64 ┆ i64 ┆ bool ┆ bool ┆ bool │ ╞══════╪════════╪═════════════╪══════════╪════════════════╪══════════╡ │ 22.0 ┆ 3 ┆ 1 ┆ false ┆ false ┆ false │ │ 38.0 ┆ 1 ┆ 1 ┆ false ┆ true ┆ false │ │ 26.0 ┆ 3 ┆ 0 ┆ false ┆ false ┆ true │ │ 35.0 ┆ 1 ┆ 1 ┆ false ┆ true ┆ false │ │ 12.0 ┆ 3 ┆ 1 ┆ true ┆ false ┆ false │ └──────┴────────┴─────────────┴──────────┴────────────────┴──────────┘
Example 4: Auto-generated column names
>>> auto_transformer = ConditionFeatures( ... conditions=[ ... {'column': 'age', 'op': '>=', 'value': 18}, ... {'column': 'amount', 'op': '>', 'value': 1000}, ... {'column': 'family_size', 'op': '==', 'value': 1} ... ] ... # new_column_names not specified - will be auto-generated ... ) >>> result = auto_transformer.fit_transform(X) >>> result.select(['age', 'amount', 'family_size', 'age_gte_18', 'amount_gt_1000', 'family_size_eq_1']) shape: (5, 6) ┌─────┬────────┬─────────────┬────────────┬────────────────┬──────────────────┐ │ age ┆ amount ┆ family_size ┆ age_gte_18 ┆ amount_gt_1000 ┆ family_size_eq_1 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ bool ┆ bool ┆ bool │ ╞═════╪════════╪═════════════╪════════════╪════════════════╪══════════════════╡ │ 15 ┆ 100 ┆ 1 ┆ false ┆ false ┆ true │ │ 25 ┆ 1500 ┆ 3 ┆ true ┆ true ┆ false │ │ 30 ┆ 500 ┆ 1 ┆ true ┆ false ┆ true │ │ 17 ┆ 200 ┆ 4 ┆ false ┆ false ┆ false │ │ 45 ┆ 2000 ┆ 2 ┆ true ┆ true ┆ false │ └─────┴────────┴─────────────┴────────────┴────────────────┴──────────────────┘
Example 5: Null checks (unary operators)
>>> data_with_nulls = { ... 'age': [25, None, 30, 17, None], ... 'email': ['a@test.com', 'b@test.com', None, 'd@test.com', None], ... 'amount': [100, 1500, 500, 200, 2000] ... } >>> X_nulls = pl.DataFrame(data_with_nulls) >>> null_transformer = ConditionFeatures( ... conditions=[ ... {'column': 'age', 'op': 'is_null'}, ... {'column': 'email', 'op': 'is_not_null'}, ... {'column': 'amount', 'op': '>', 'value': 1000} ... ], ... new_column_names=['age_missing', 'has_email', 'is_high_amount'] ... ) >>> result = null_transformer.fit_transform(X_nulls) >>> result shape: (5, 6) ┌──────┬─────────────┬────────┬─────────────┬───────────┬─────────────────┐ │ age ┆ email ┆ amount ┆ age_missing ┆ has_email ┆ is_high_amount │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 ┆ bool ┆ bool ┆ bool │ ╞══════╪═════════════╪════════╪═════════════╪═══════════╪═════════════════╡ │ 25 ┆ a@test.com ┆ 100 ┆ false ┆ true ┆ false │ │ null ┆ b@test.com ┆ 1500 ┆ true ┆ true ┆ true │ │ 30 ┆ null ┆ 500 ┆ false ┆ false ┆ false │ │ 17 ┆ d@test.com ┆ 200 ┆ false ┆ true ┆ false │ │ null ┆ null ┆ 2000 ┆ true ┆ false ┆ true │ └──────┴─────────────┴────────┴─────────────┴───────────┴─────────────────┘
Notes
Each condition produces exactly one independent boolean column
Auto-naming: If new_column_names is None, names are auto-generated as: * Scalar: {column}_{op_name}_{value} (e.g., ‘age_gte_18’) * Column-to-column: {column}_{op_name}_{other_column} (e.g., ‘velocity_24h_gt_velocity_7d’) * Unary: {column}__{op_name} (e.g., ‘age__is_null’)
No combination logic - use RuleFeatures if you need AND/OR
Simpler API than RuleFeatures for common use cases
Missing values (null) in comparisons typically result in null/false
Unary operators ‘is_null’ and ‘is_not_null’ explicitly check for null values
Can be used as preprocessing step before RuleFeatures for complex logic
See also
RuleFeaturesFor combining multiple conditions with AND/OR logic
- class gators.feature_generation.DistanceFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinCalculates distances between geographic coordinate pairs.
This transformer computes distances between consecutive pairs of latitude/longitude coordinates using different distance metrics (euclidean, manhattan, haversine) and units (km, miles, meters, feet).
For fraud detection, distance features are valuable for:
Detecting location anomalies (billing vs shipping address distance)
Identifying suspicious IP geolocation patterns
Flagging transactions far from customer’s typical location
Calculating travel feasibility (transaction velocity checks)
- Parameters:
lats (List[str]) – List of latitude column names. Must have at least 2 elements. Coordinates are paired sequentially: (lats[0], longs[0]) to (lats[1], longs[1]), etc.
longs (List[str]) – List of longitude column names. Must have same length as lats.
unit (Literal["km", "miles", "meters", "feet"], default="km") – Unit for distance output.
method (Literal["euclidean", "manhattan", "haversine"], default="haversine") – Distance calculation method: - ‘haversine’: Great-circle distance on a sphere (recommended for lat/long) - ‘euclidean’: Straight-line distance - ‘manhattan’: Sum of absolute differences (taxicab distance)
drop_columns (bool, default=True) – Whether to drop the original coordinate columns.
new_column_names (Optional[List[str]], default=None) – Custom names for distance columns. If None, uses pattern: ‘distance__{lat1}_to_{lat2}__{method}_{unit}’
Examples
>>> from gators.feature_generation import DistanceFeatures >>> import polars as pl
Example 1: Haversine distance between two locations
>>> X = pl.DataFrame({ ... 'billing_lat': [40.7128, 34.0522, 41.8781], ... 'billing_long': [-74.0060, -118.2437, -87.6298], ... 'shipping_lat': [40.7580, 34.0522, 42.3601], ... 'shipping_long': [-73.9855, -118.2437, -71.0589] ... }) >>> transformer = DistanceFeatures( ... lats=['billing_lat', 'shipping_lat'], ... longs=['billing_long', 'shipping_long'], ... method='haversine', ... unit='km' ... ) >>> result = transformer.fit_transform(X) >>> result.columns ['distance__billing_lat_to_shipping_lat__haversine_km'] >>> result['distance__billing_lat_to_shipping_lat__haversine_km'][0] 5.376...
Example 2: Multiple distance pairs
>>> X = pl.DataFrame({ ... 'home_lat': [40.7128, 34.0522], ... 'home_long': [-74.0060, -118.2437], ... 'work_lat': [40.7580, 34.0700], ... 'work_long': [-73.9855, -118.3000], ... 'shop_lat': [40.7489, 34.0800], ... 'shop_long': [-73.9680, -118.3500] ... }) >>> transformer = DistanceFeatures( ... lats=['home_lat', 'work_lat', 'shop_lat'], ... longs=['home_long', 'work_long', 'shop_long'], ... method='haversine', ... unit='miles', ... drop_columns=False ... ) >>> result = transformer.fit_transform(X) >>> result.columns ['home_lat', 'home_long', 'work_lat', 'work_long', 'shop_lat', 'shop_long', 'distance__home_lat_to_work_lat__haversine_miles', 'distance__work_lat_to_shop_lat__haversine_miles']
Example 3: Euclidean distance
>>> X = pl.DataFrame({ ... 'x1': [0.0, 1.0, 2.0], ... 'y1': [0.0, 1.0, 2.0], ... 'x2': [3.0, 4.0, 5.0], ... 'y2': [4.0, 5.0, 6.0] ... }) >>> transformer = DistanceFeatures( ... lats=['x1', 'x2'], ... longs=['y1', 'y2'], ... method='euclidean', ... unit='meters' ... ) >>> result = transformer.fit_transform(X)
- class gators.feature_generation.ScalarMathFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates new features by applying mathematical operations between columns and scalar values.
This transformer performs element-wise operations between a column and a scalar constant. Each operation creates one new feature column. For operations between multiple columns, use MathFeatures instead.
Use Cases:
Unit conversions (days to years, meters to feet, Celsius to Fahrenheit)
Normalization (divide by constant, multiply by scaling factor)
Feature scaling (percentage calculation, ratio computation)
Offset adjustments (add/subtract baseline values)
When to Use:
Need to apply arithmetic operations with fixed scalar values
Creating interpretable transformations (e.g., Age/365 for age_in_years)
Scaling features by known constants
Building feature sets for downstream models
When NOT to Use:
Operations between multiple columns (use MathFeatures)
Need learned scaling (use StandardScaler, MinMaxScaler)
Complex mathematical functions (use DataFrame.with_columns directly)
- Parameters:
operations –
List of operation dictionaries. Each operation creates one new feature column.
Each operation dictionary must contain:
- new_column_namesOptional[List[str]], default=None
Names for the resulting feature columns. If provided, must have the same length as
operations. If None, column names are auto-generated in the format:{column}_{op_name}_{scalar}(e.g., ‘Age_div_365’, ‘Price_mul_1.1’)Operator name mapping: ‘+’ -> ‘plus’, ‘-’ -> ‘minus’, ‘*’ -> ‘mul’, ‘/’ -> ‘div’, ‘**’ -> ‘pow’, ‘//’ -> ‘floordiv’, ‘%’ -> ‘mod’
Examples
>>> import polars as pl >>> from gators.feature_generation import ScalarMathFeatures
>>> X ={ ... 'Age': [25, 30, 45, 12, 65], ... 'Price': [100.0, 150.0, 200.0, 75.0, 300.0], ... 'Temperature': [20.0, 25.0, 15.0, 30.0, 22.0] ... } >>> X = pl.DataFrame(X)
Example 1: Unit conversions with custom names
>>> transformer = ScalarMathFeatures( ... operations=[ ... {'column': 'Age', 'op': '/', 'scalar': 365}, ... {'column': 'Temperature', 'op': '+', 'scalar': 273.15} ... ], ... new_column_names=['Age_years', 'Temperature_kelvin'] ... ) >>> result = transformer.fit_transform(X) >>> result.select(['Age', 'Age_years', 'Temperature', 'Temperature_kelvin']) shape: (5, 4) ┌─────┬───────────┬─────────────┬───────────────────┐ │ Age ┆ Age_years ┆ Temperature ┆ Temperature_kelvin│ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ f64 ┆ f64 │ ╞═════╪═══════════╪═════════════╪═══════════════════╡ │ 25 ┆ 0.068493 ┆ 20.0 ┆ 293.15 │ │ 30 ┆ 0.082192 ┆ 25.0 ┆ 298.15 │ │ 45 ┆ 0.123288 ┆ 15.0 ┆ 288.15 │ │ 12 ┆ 0.032877 ┆ 30.0 ┆ 303.15 │ │ 65 ┆ 0.178082 ┆ 22.0 ┆ 295.15 │ └─────┴───────────┴─────────────┴───────────────────┘
Example 2: Auto-generated column names
>>> auto_transformer = ScalarMathFeatures( ... operations=[ ... {'column': 'Price', 'op': '*', 'scalar': 1.1}, ... {'column': 'Price', 'op': '/', 'scalar': 100} ... ] ... # new_column_names not specified - will be auto-generated ... ) >>> result = auto_transformer.fit_transform(X) >>> result.select(['Price', 'Price_mul_1.1', 'Price_div_100']) shape: (5, 3) ┌───────┬──────────────┬───────────────┐ │ Price ┆ Price_mul_1.1┆ Price_div_100 │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞═══════╪══════════════╪═══════════════╡ │ 100.0 ┆ 110.0 ┆ 1.0 │ │ 150.0 ┆ 165.0 ┆ 1.5 │ │ 200.0 ┆ 220.0 ┆ 2.0 │ │ 75.0 ┆ 82.5 ┆ 0.75 │ │ 300.0 ┆ 330.0 ┆ 3.0 │ └───────┴──────────────┴───────────────┘
Example 3: Multiple operations (scaling, percentage, tax)
>>> multi_ops = ScalarMathFeatures( ... operations=[ ... {'column': 'Price', 'op': '*', 'scalar': 1.2}, # 20% markup ... {'column': 'Price', 'op': '/', 'scalar': 100}, # as percentage of 100 ... {'column': 'Age', 'op': '%', 'scalar': 10} # age modulo 10 ... ], ... new_column_names=['Price_with_tax', 'Price_pct', 'Age_decade_offset'] ... ) >>> result = multi_ops.fit_transform(X) >>> result.select(['Price', 'Price_with_tax', 'Price_pct', 'Age', 'Age_decade_offset']) shape: (5, 5) ┌───────┬────────────────┬───────────┬─────┬───────────────────┐ │ Price ┆ Price_with_tax ┆ Price_pct ┆ Age ┆ Age_decade_offset │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 ┆ i64 ┆ i64 │ ╞═══════╪════════════════╪═══════════╪═════╪═══════════════════╡ │ 100.0 ┆ 120.0 ┆ 1.0 ┆ 25 ┆ 5 │ │ 150.0 ┆ 180.0 ┆ 1.5 ┆ 30 ┆ 0 │ │ 200.0 ┆ 240.0 ┆ 2.0 ┆ 45 ┆ 5 │ │ 75.0 ┆ 90.0 ┆ 0.75 ┆ 12 ┆ 2 │ │ 300.0 ┆ 360.0 ┆ 3.0 ┆ 65 ┆ 5 │ └───────┴────────────────┴───────────┴─────┴───────────────────┘
Example 4: Power and floor division
>>> power_ops = ScalarMathFeatures( ... operations=[ ... {'column': 'Age', 'op': '**', 'scalar': 2}, ... {'column': 'Age', 'op': '//', 'scalar': 10} ... ], ... new_column_names=['Age_squared', 'Age_decade'] ... ) >>> result = power_ops.fit_transform(X) >>> result.select(['Age', 'Age_squared', 'Age_decade']) shape: (5, 3) ┌─────┬─────────────┬────────────┐ │ Age ┆ Age_squared ┆ Age_decade │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════════════╪════════════╡ │ 25 ┆ 625 ┆ 2 │ │ 30 ┆ 900 ┆ 3 │ │ 45 ┆ 2025 ┆ 4 │ │ 12 ┆ 144 ┆ 1 │ │ 65 ┆ 4225 ┆ 6 │ └─────┴─────────────┴────────────┘
Notes
Each operation produces exactly one new feature column
Auto-naming: If new_column_names is None, names are auto-generated as: {column}_{op_name}_{scalar} (e.g., ‘Age_div_365’)
Operations are applied element-wise to each row
Division by zero will result in inf or null values (Polars default behavior)
Can chain multiple ScalarMathFeatures transformers in a pipeline
For learned transformations, consider sklearn scalers instead
See also
MathFeaturesFor operations between multiple columns
ConditionFeaturesFor creating boolean features from conditions
- class gators.feature_generation.RuleFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinCreates multiple boolean features, each from a group of conditions combined with logical operators.
This transformer is useful for creating multiple rule-based features simultaneously, where each rule represents a distinct business logic or fraud detection pattern. Each rule group produces its own boolean output column.
Use Cases:
Fraud detection: Create multiple risk indicators (velocity spike, amount anomaly, etc.)
Business rules: Generate several eligibility/qualification flags at once
Feature engineering: Build a family of related boolean features
Production pipelines: Encapsulate multiple rule definitions in one transformer
When to Use:
Building production ML pipelines that need serialization
Creating reusable feature engineering templates
Working with sklearn-based systems that expect transformers
Need version control of feature logic (can serialize to JSON/YAML)
Want to create multiple related boolean features efficiently
When NOT to Use:
One-off exploratory analysis (use Polars native expressions)
Very complex nested logic within a single rule (consider Polars native)
Performance-critical scenarios where every microsecond counts
- Parameters:
rules (List[List[Dict[str, Any]]]) –
List of rule groups. Each rule group contains condition dictionaries that will be combined to create one boolean output column.
Each condition dictionary must contain:
’column’: str - Name of the column to evaluate
’op’: str - Comparison operator. Supported: ‘>’, ‘<’, ‘>=’, ‘<=’, ‘==’, ‘!=’
’value’: Any (optional) - Scalar value to compare the column against
’other_column’: str (optional) - Name of another column to compare against
Either ‘value’ or ‘other_column’ must be specified, but not both.
Examples:
# Two rules: [ [{'column': 'age', 'op': '>', 'value': 18}], [{'column': 'amount', 'op': '>', 'value': 1000}] ] # Rule with multiple conditions: [ [{'column': 'age', 'op': '>', 'value': 18}, {'column': 'amount', 'op': '>', 'value': 1000}] ]
rule_logic (Literal['and', 'or'], default='and') –
How to combine conditions within each rule group:
’and’: All conditions in a group must be True
’or’: At least one condition in a group must be True
new_column_names (List[str]) – Names for the resulting boolean feature columns. Must have the same length as rules. Each rule group will produce a column with the corresponding name.
drop_conditions (bool, default=False) – Whether to drop intermediate condition columns after combining. Recommended: True for cleaner output.
Examples
>>> import polars as pl >>> from gators.feature_generation import RuleFeatures
>>> X ={ ... 'amount': [100, 500, 1200, 50, 2000], ... 'velocity_24h': [1, 3, 5, 0, 10], ... 'velocity_7d': [5, 8, 10, 2, 15], ... 'is_new_user': [True, False, False, True, False] ... } >>> X = pl.DataFrame(X)
Example 1: Create two risk indicators in one pass
>>> multi_risk_transformer = RuleFeatures( ... rules=[ ... # Rule 1: Activity spike (24h > 0 AND 7d == 24h) ... [ ... {'column': 'velocity_24h', 'op': '>', 'value': 0}, ... {'column': 'velocity_7d', 'op': '==', 'other_column': 'velocity_24h'} ... ], ... # Rule 2: High amount (amount > 1000) ... [ ... {'column': 'amount', 'op': '>', 'value': 1000} ... ] ... ], ... rule_logic='and', ... new_column_names=['is_activity_spike', 'is_high_amount'], ... drop_conditions=True ... ) >>> result = multi_risk_transformer.fit_transform(X) >>> result.select(['velocity_24h', 'velocity_7d', 'amount', ... 'is_activity_spike', 'is_high_amount']) shape: (5, 5) ┌──────────────┬─────────────┬────────┬────────────────────┬─────────────────┐ │ velocity_24h ┆ velocity_7d ┆ amount ┆ is_activity_spike ┆ is_high_amount │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ bool ┆ bool │ ╞══════════════╪═════════════╪════════╪════════════════════╪═════════════════╡ │ 1 ┆ 5 ┆ 100 ┆ false ┆ false │ │ 3 ┆ 8 ┆ 500 ┆ false ┆ false │ │ 5 ┆ 10 ┆ 1200 ┆ false ┆ true │ │ 0 ┆ 2 ┆ 50 ┆ false ┆ false │ │ 10 ┆ 15 ┆ 2000 ┆ false ┆ true │ └──────────────┴─────────────┴────────┴────────────────────┴─────────────────┘
Example 2: OR logic within a rule (high amount OR high velocity)
>>> or_transformer = RuleFeatures( ... rules=[ ... [ ... {'column': 'amount', 'op': '>', 'value': 1000}, ... {'column': 'velocity_24h', 'op': '>=', 'value': 5} ... ] ... ], ... rule_logic='or', ... new_column_names=['is_high_risk'], ... drop_conditions=True ... ) >>> result = or_transformer.fit_transform(X) >>> result.select(['amount', 'velocity_24h', 'is_high_risk']) shape: (5, 3) ┌────────┬──────────────┬──────────────┐ │ amount ┆ velocity_24h ┆ is_high_risk │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ bool │ ╞════════╪══════════════╪══════════════╡ │ 100 ┆ 1 ┆ false │ │ 500 ┆ 3 ┆ false │ │ 1200 ┆ 5 ┆ true │ │ 50 ┆ 0 ┆ false │ │ 2000 ┆ 10 ┆ true │ └────────┴──────────────┴──────────────┘
Example 3: Multiple rules with different logic patterns
>>> complex_transformer = RuleFeatures( ... rules=[ ... # New user AND high amount AND high velocity ... [ ... {'column': 'is_new_user', 'op': '==', 'value': True}, ... {'column': 'amount', 'op': '>', 'value': 1000}, ... {'column': 'velocity_24h', 'op': '>', 'value': 3} ... ], ... # Very high velocity (simple rule) ... [ ... {'column': 'velocity_24h', 'op': '>=', 'value': 10} ... ] ... ], ... rule_logic='and', ... new_column_names=['is_suspicious_new_user', 'is_extreme_velocity'] ... ) >>> result = complex_transformer.fit_transform(X) >>> result.select(['is_new_user', 'amount', 'velocity_24h', ... 'is_suspicious_new_user', 'is_extreme_velocity']) shape: (5, 5) ┌─────────────┬────────┬──────────────┬─────────────────────────┬──────────────────────┐ │ is_new_user ┆ amount ┆ velocity_24h ┆ is_suspicious_new_user ┆ is_extreme_velocity │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ bool ┆ i64 ┆ i64 ┆ bool ┆ bool │ ╞═════════════╪════════╪══════════════╪═════════════════════════╪══════════════════════╡ │ true ┆ 100 ┆ 1 ┆ false ┆ false │ │ false ┆ 500 ┆ 3 ┆ false ┆ false │ │ false ┆ 1200 ┆ 5 ┆ false ┆ false │ │ true ┆ 50 ┆ 0 ┆ false ┆ false │ │ false ┆ 2000 ┆ 10 ┆ false ┆ true │ └─────────────┴────────┴──────────────┴─────────────────────────┴──────────────────────┘
Notes
Each rule group produces one boolean output column
All conditions within a rule are evaluated independently before combining
Missing values (null) in comparisons typically result in null/false
Creates intermediate boolean columns, so use drop_conditions=True for cleaner output
To create a single column from multiple rules with complex logic (AND of ORs), use this transformer to create intermediate columns, then combine them manually
- class gators.feature_generation.RowStatisticsFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates row-level aggregation features across groups of columns.
This transformer computes statistics (min, max, mean, median, std, range) horizontally across specified column groups for each row. Unlike GroupRatioFeatures which aggregates vertically (across rows within groups), this computes statistics across columns within each row.
Importance for Fraud Detection#
Row-level aggregation features are valuable in fraud detection because they capture relationships and patterns across related features within individual transactions. For example:
Computing statistics across multiple transaction amounts can reveal unusual patterns (e.g., all amounts being identical might indicate scripted fraud)
Aggregating across card verification fields can identify inconsistencies
Statistics across temporal features can detect velocity anomalies
Range calculations can flag suspiciously uniform or extreme value spreads
These features help models identify transactions where the distribution of values across related fields deviates from normal patterns, which is often indicative of fraudulent behavior.
- param column_groups:
Dictionary mapping group names to lists of column names. Each group defines a set of columns over which to compute row-level statistics. Example: {‘card_fields’: [‘card1’, ‘card2’, ‘card3’]}
- type column_groups:
Dict[str, List[str]]
- param func:
List of aggregation functions to apply. Available options:
‘min’: Row-wise minimum value
‘max’: Row-wise maximum value
‘mean’: Row-wise mean (average)
‘median’: Row-wise median
‘std’: Row-wise standard deviation
‘range’: Row-wise range (max - min)
‘sum’: Row-wise sum
- type func:
List[str]
- param drop_columns:
Whether to drop the original columns after creating aggregation features.
- type drop_columns:
bool, default=False
- param new_column_names:
List of custom names for the aggregation columns. If None, uses default naming pattern ‘{group_name}__{func}’. Must have same length as the total number of features created (len(column_groups) × len(func)).
- type new_column_names:
Optional[List[str]], default=None
Examples
>>> from gators.feature_generation import RowStatisticsFeatures >>> import polars as pl
Example 1: Single group with multiple aggregations
>>> X = pl.DataFrame({ ... 'A': [9, 9, 7], ... 'B': [3, 4, 5], ... 'C': [6, 7, 8] ... }) >>> transformer = RowStatisticsFeatures( ... column_groups={'cluster_1': ['A', 'B']}, ... func=['mean', 'std'] ... ) >>> result = transformer.fit_transform(X) >>> result shape: (3, 5) ┌─────┬─────┬─────┬───────────────────┬──────────────────┐ │ A ┆ B ┆ C ┆ cluster_1__mean ┆ cluster_1__std │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 │ ╞═════╪═════╪═════╪═══════════════════╪══════════════════╡ │ 9 ┆ 3 ┆ 6 ┆ 6.0 ┆ 4.242641 │ │ 9 ┆ 4 ┆ 7 ┆ 6.5 ┆ 3.535534 │ │ 7 ┆ 5 ┆ 8 ┆ 6.0 ┆ 1.414214 │ └─────┴─────┴─────┴───────────────────┴──────────────────┘
Example 2: Multiple groups with different columns
>>> X = pl.DataFrame({ ... 'A': [9, 9, 7], ... 'B': [3, 4, 5], ... 'C': [6, 7, 8], ... 'D': [1, 2, 3] ... }) >>> transformer = RowStatisticsFeatures( ... column_groups={ ... 'cluster_1': ['A', 'B'], ... 'cluster_2': ['C', 'D'] ... }, ... func=['min', 'max', 'range'] ... ) >>> result = transformer.fit_transform(X) >>> result shape: (3, 10) ┌─────┬─────┬─────┬─────┬──────────────┬──────────────┬─────────────────┬──────────────┬──────────────┬─────────────────┐ │ A ┆ B ┆ C ┆ D ┆ cluster_1__… ┆ cluster_1__… ┆ cluster_1__ran… ┆ cluster_2__… ┆ cluster_2__… ┆ cluster_2__ran… │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╪═════╪══════════════╪══════════════╪═════════════════╪══════════════╪══════════════╪═════════════════╡ │ 9 ┆ 3 ┆ 6 ┆ 1 ┆ 3 ┆ 9 ┆ 6 ┆ 1 ┆ 6 ┆ 5 │ │ 9 ┆ 4 ┆ 7 ┆ 2 ┆ 4 ┆ 9 ┆ 5 ┆ 2 ┆ 7 ┆ 5 │ │ 7 ┆ 5 ┆ 8 ┆ 3 ┆ 5 ┆ 7 ┆ 2 ┆ 3 ┆ 8 ┆ 5 │ └─────┴─────┴─────┴─────┴──────────────┴──────────────┴─────────────────┴──────────────┴──────────────┴─────────────────┘
Example 3: Using custom column names
>>> X = pl.DataFrame({ ... 'amount1': [100, 200, 150], ... 'amount2': [50, 100, 75], ... 'amount3': [25, 50, 30] ... }) >>> transformer = RowStatisticsFeatures( ... column_groups={'amounts': ['amount1', 'amount2', 'amount3']}, ... func=['mean', 'std'], ... new_column_names=['avg_amount', 'std_amount'] ... ) >>> result = transformer.fit_transform(X) >>> result shape: (3, 5) ┌──────────┬──────────┬──────────┬────────────┬────────────┐ │ amount1 ┆ amount2 ┆ amount3 ┆ avg_amount ┆ std_amount │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 │ ╞══════════╪══════════╪══════════╪════════════╪════════════╡ │ 100 ┆ 50 ┆ 25 ┆ 58.333333 ┆ 30.957... │ │ 200 ┆ 100 ┆ 50 ┆ 116.666... ┆ 61.914... │ │ 150 ┆ 75 ┆ 30 ┆ 85.0 ┆ 49.606... │ └──────────┴──────────┴──────────┴────────────┴────────────┘
Example 4: Fraud detection use case - card verification fields
>>> X = pl.DataFrame({ ... 'card_cvv_match': [1, 0, 1, 1], ... 'card_addr_match': [1, 1, 0, 1], ... 'card_zip_match': [1, 1, 1, 0], ... 'is_fraud': [0, 1, 1, 1] ... }) >>> # Aggregate verification fields to detect inconsistencies >>> transformer = RowStatisticsFeatures( ... column_groups={'verification': ['card_cvv_match', 'card_addr_match', 'card_zip_match']}, ... func=['mean', 'std'], ... drop_columns=False ... ) >>> result = transformer.fit_transform(X) >>> result.select(['verification__mean', 'verification__std', 'is_fraud']) shape: (4, 3) ┌─────────────────────┬────────────────────┬──────────┐ │ verification__mean ┆ verification__std ┆ is_fraud │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ i64 │ ╞═════════════════════╪════════════════════╪══════════╡ │ 1.0 ┆ 0.0 ┆ 0 │ │ 0.666667 ┆ 0.471405 ┆ 1 │ │ 0.666667 ┆ 0.471405 ┆ 1 │ │ 0.666667 ┆ 0.471405 ┆ 1 │ └─────────────────────┴────────────────────┴──────────┘ # Notice: legitimate transaction has perfect verification (mean=1, std=0) # Fraudulent transactions show inconsistent verification patterns