gators.feature_generation package#

Module contents#

class gators.feature_generation.IsNull[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Creates boolean features indicating whether values are null for specified columns.

Parameters:: subset (list[str], default=None) – List of column names to check for null values. If None, all columns in the DataFrame are used.

Examples

>>> from is_null import IsNull
>>> import polars as pl

>>> X ={'A': [1, None, 3, 4],
...         'B': [4, 3, None, 1],
...         'C': [1, 2, 1, 2]}
>>> X = pl.DataFrame(X)

>>> transformer = IsNull(subset=['A', 'B'])
>>> transformer.fit(X)
IsNull(subset=['A', 'B'])
>>> result = transformer.transform(X)
>>> result
shape: (4, 5)
┌──────┬──────┬─────┬──────────────┬──────────────┐
│  A   │  B   │  C  │ A__is_null   │ B__is_null   │
│ i64  │ i64  │ i64 │ bool         │ bool         │
├──────┼──────┼─────┼──────────────┼──────────────┤
│  1   │  4   │  1  │ false        │ false        │
│ null │  3   │  2  │ true         │ false        │
│  3   │ null │  1  │ false        │ true         │
│  4   │  1   │  2  │ false        │ false        │
└──────┴──────┴─────┴──────────────┴──────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation.is_null.IsNull[source]#

Fit the transformer by generating column name mappings.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

IsNull

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by adding is_null indicator columns.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with additional is_null columns.
Return type:: pl.DataFrame

class gators.feature_generation.PolynomialFeatures[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates polynomial and interaction features.

Parameters:

subset (list[str], default=None) – Subset of columns to transform. If None, all columns except strings and booleans.
degree (int, default=2) – The degree of the polynomial features.
interaction_only (bool, default=False) – If True, only interaction features are produced.
include_bias (bool, default=True) – If True, include a bias column (column of ones).

Examples

Example 1: Degree 2 polynomial with bias term

>>> from gators.discretizers import PolynomialFeatures
>>> import polars as pl
>>> X = pl.DataFrame({'A': [1, 2], 'B': [3, 4]})
>>> transformer = PolynomialFeatures(degree=2, include_bias=True)
>>> transformer.fit(X)
>>> transformer.transform(X)
shape: (2, 5)
┌─────┬─────┬─────┬─────┬─────┐─────┐
│ A   │ B   │ A__A│ A__B│ B__B│ bias|
├─────┼─────┼─────┼─────┼─────┤─────┤
│ 1   │ 3   │ 1   │ 3   │ 9   │ 1   │
│ 2   │ 4   │ 4   │ 8   │ 16  │ 1   │
└─────┴─────┴─────┴─────┴─────┴─────┘

Example 2: Polynomial on subset of columns

>>> transformer = PolynomialFeatures(subset=['A'], degree=2)
>>> transformer.fit(X)
>>> transformer.transform(X)
shape: (2, 3)
┌─────┬─────┬─────┐
│ A   │ B   │ A__A│
├─────┼─────┼─────┤
│ 1   │ 3   │ 1   │
│ 2   │ 4   │ 4   │
└─────┴─────┴─────┘

Example 3: Interaction features only

>>> transformer = PolynomialFeatures(degree=2, interaction_only=True)
>>> transformer.fit(X)
>>> transformer.transform(X)
shape: (2, 4)
┌─────┬─────┬─────┐
│ A   │ B   │ A__B│
├─────┼─────┼─────┼
│ 1   │ 3   │ 3   │
│ 2   │ 4   │ 8   │
└─────┴─────┴─────┴

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation.polynomial_features.PolynomialFeatures[source]#

Fit the transformer by identifying columns to transform.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

PolynomialFeatures

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by extracting specified components.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame

class gators.feature_generation.PlaneRotationFeatures[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Create new columns based on the plan rotation mapping.

The data should be composed of numerical columns only. Use gators.encoders to replace the categorical columns by numerical ones before using PlaneRotationFeatures.

Parameters:

subset (list[list[str]]) – List of pair-wise columns.
angles (list[float]) – List of rotation angles.

Examples

Basic usage with plan rotation

Imports and initialization:

>>> from gators.feature_generation import PlaneRotationFeatures
>>> obj = PlaneRotationFeatures(
... subset=[['X', 'Y'], ['X', 'Z']] , angles=[45.0, 60.0])

The fit, transform, and fit_transform methods accept polars dataframes:

>>> import polars as pl
>>> X = pl.DataFrame(
... {'X': [200.0, 210.0], 'Y': [140.0, 160.0], 'Z': [100.0, 125.0]})

The result is a transformed polars dataframe.

>>> obj.fit_transform(X)
shape: (2, 9)
┌───────┬───────┬───────┬────────────┬───┬────────────┬────────────┬────────────┐
│ X     ┆ Y     ┆ Z     ┆ XY_x_45.0… ┆ … ┆ XZ_y_45.0… ┆ XZ_x_60.0… ┆ XZ_y_60.0… │
│ ---   ┆ ---   ┆ ---   ┆ ---        ┆   ┆ ---        ┆ ---        ┆ ---        │
│ f64   ┆ f64   ┆ f64   ┆ f64        ┆   ┆ f64        ┆ f64        ┆ f64        │
╞═══════╪═══════╪═══════╪════════════╪═══╪════════════╪════════════╪════════════╡
│ 200.0 ┆ 140.0 ┆ 100.0 ┆ 42.426407  ┆ … ┆ 212.132034 ┆ 13.397460  ┆ 223.205081 │
│ 210.0 ┆ 160.0 ┆ 125.0 ┆ 35.355339  ┆ … ┆ 236.880772 ┆ -3.253175  ┆ 244.365335 │
└───────┴───────┴───────┴────────────┴───┴────────────┴────────────┴────────────┘

compute_column_names()[source]#: Compute column names after initialization.

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation.plane_rotation_features.PlaneRotationFeatures[source]#

Fit the transformer by identifying columns to flatten.

Parameters:

X (pl.DataFrame) – Input dataframe.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

PlaneRotationFeatures

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the dataframe X.

Parameters:: X (pl.DataFrame.) – Input dataframe.
Returns:: Transformed dataframe.
Return type:: pl.DataFrame

class gators.feature_generation.MathFeatures[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates new features by applying mathematical operations to groups of columns.

Parameters:

groups (list[list[str]]) – List of groups of column names to apply operations on.
operations (list[str]) –
List of operations to apply to each group of columns. Available operations:
- ’sum’: Sum of all columns
- ’mean’: Mean of all columns
- ’minus’: Subtraction (reduces columns left to right)
- ’mul’: Product of all columns
- ’div’: Division (reduces columns left to right)
- ’min’: Minimum value across columns
- ’max’: Maximum value across columns
- ’std’: Standard deviation across columns
- ’var’: Variance across columns
- ’median’: Median across columns
- ’range’: Range (max - min)
- ’abs_diff’: Absolute difference (reduces columns left to right)
- ’count_null’: Count of null values
- ’count_zero’: Count of zero values
- ’count_nonzero’: Count of non-zero values
Note: For division operations, consider using RatioFeatures instead, which provides safer division with automatic handling of division by zero and null values.
drop_columns (bool, optional) – Whether to drop the original columns after creating the new features, by default False.
new_column_names (list[str]], optional) – List of new column names for the created features, by default None.

Examples

>>> from math_features import MathFeatures
>>> import polars as pl

>>> X ={'A': [1, 2, 3, 4],
...         'B': [4, 3, 2, 1],
...         'C': [1, 2, 1, 2]}
>>> X = pl.DataFrame(X)

Example 1: drop_columns=False

>>> transformer = MathFeatures(groups=[['A', 'B'], ['B', 'C']], operations=['sum', 'mean'])
>>> transformer.fit(X)
MathFeatures(groups=[['A', 'B'], ['B', 'C']], operations=['sum', 'mean'])
>>> result = transformer.transform(X)
>>> result
shape: (4, 6)
┌─────┬─────┬─────┬────────┬─────-───┬────────┐
│  A  │  B  │  C  │ A_B_sum│ A_B_mean│ B_C_sum│
│ i64 │ i64 │ i64 │  f64   │  f64    │  f64   │
├─────┼─────┼─────┼────────┼──────-──┼────────┤
│  1  │  4  │  1  │  5.0   │  2.5    │  5.0   │
│  2  │  3  │  2  │  5.0   │  2.5    │  5.0   │
│  3  │  2  │  1  │  5.0   │  2.5    │  3.0   │
│  4  │  1  │  2  │  5.0   │  2.5    │  3.0   │
└─────┴─────┴─────┴────────┴───────-─┴────────┘

Example 2: drop_columns=True

>>> transformer = MathFeatures(groups=[['A', 'B'], ['B', 'C']], operations=['sum'], drop_columns=True)
>>> transformer.fit(X)
MathFeatures(groups=[['A', 'B'], ['B', 'C']], operations=['sum'], drop_columns=True)
>>> result = transformer.transform(X)
>>> result
shape: (4, 2)
┌────────┬────────┐
│ A_B_sum│ B_C_sum│
│  f64   │  f64   │
├────────┼────────┤
│  5.0   │  5.0   │
│  5.0   │  5.0   │
│  5.0   │  3.0   │
│  5.0   │  3.0   │
└────────┴────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation.math_features.MathFeatures[source]#

Fit the transformer by generating column name mappings.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

MathFeatures

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by extracting specified components.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame

class gators.feature_generation.GroupScalingFeatures[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates group-based scaling features for numerical columns.

This transformer creates features like:

value / group_mean (most common: relative position vs average)
value / group_median (robust to outliers)
(value - group_mean) / group_std (z-score: standardized deviation)
(value - group_min) / (group_max - group_min) (min-max: 0-1 normalization)

Importance for Fraud Detection#

Group scaling features are particularly valuable in fraud detection because they capture relative deviations from group-level behavior patterns. Fraudulent transactions often exhibit unusual characteristics compared to the typical behavior within their segments.

mean/median ratios: Show multiplicative deviation (e.g., 10x the group average)
zscore: Quantifies how many standard deviations away from group mean (e.g., 3σ anomaly)
minmax: Shows relative position within observed range (0=min, 1=max, handles negatives)

These features are especially powerful when combined with various grouping dimensions (e.g., by merchant, customer segment, time of day, or geographic location) to capture different aspects of abnormal behavior.

param subset:: List of numerical column names to transform.
type subset:: list[str]
param by:: List of column names to use for groupby operations. Each column will be used for a separate groupby operation (e.g., [‘cat1’, ‘cat2’] creates features grouped by cat1 and separate features grouped by cat2).
type by:: list[str]
param func:: List of scaling functions to apply. Available options: - ‘mean’: value / group_mean (relative position vs average) - ‘median’: value / group_median (robust to outliers) - ‘zscore’: (value - group_mean) / group_std (standardized deviation) - ‘minmax’: (value - group_min) / (group_max - group_min) (0-1 normalization)
type func:: list[str]
param fill_value:: Value to use when denominator is zero or null (safe division/scaling).
type fill_value:: float, default=0.0
param drop_columns:: Whether to drop the original numerical columns after creating scaled features.
type drop_columns:: bool, default=False
param new_column_names:: List of custom names for the scaled feature columns. If None, uses default naming pattern ‘{num_col}__{func}_{groupby_col}’. Must have same length as the total number of features created (subset × by × func).
type new_column_names:: list[str], default=None

Examples

>>> from gators.feature_generation import GroupScalingFeatures
>>> import polars as pl

>>> X ={
...     'amount': [100, 200, 150, 300, 250],
...     'cat1': ['A', 'A', 'B', 'B', 'A'],
...     'cat2': ['X', 'Y', 'X', 'X', 'X']
... }
>>> X = pl.DataFrame(X)

Example 1: Single groupby column with multiple scaling functions

>>> transformer = GroupScalingFeatures(
...     subset=['amount'],
...     by=['cat1'],
...     func=['mean', 'zscore']
... )
>>> transformer.fit(X)
GroupScalingFeatures(subset=['amount'], by=['cat1'], func=['mean', 'zscore'])
>>> result = transformer.transform(X)
>>> result
shape: (5, 5)
┌────────┬──────┬──────┬──────────────────┬────────────────────┐
│ amount ┆ cat1 ┆ cat2 ┆ amount__mean_cat1 ┆ amount__zscore_cat1 │
│ ---    ┆ ---  ┆ ---  ┆ ---              ┆ ---                │
│ i64    ┆ str  ┆ str  ┆ f64              ┆ f64                │
╞════════╪══════╪══════╪══════════════════╪════════════════════╡
│ 100    ┆ A    ┆ X    ┆ 0.545455         ┆ -1.069045          │
│ 200    ┆ A    ┆ Y    ┆ 1.090909         ┆ 0.267261           │
│ 150    ┆ B    ┆ X    ┆ 0.666667         ┆ -0.707107          │
│ 300    ┆ B    ┆ X    ┆ 1.333333         ┆ 0.707107           │
│ 250    ┆ A    ┆ X    ┆ 1.363636         ┆ 0.801784           │
└────────┴──────┴──────┴──────────────────┴────────────────────┘

Example 2: Multiple groupby columns

>>> X ={
...     'amount': [100, 200, 150, 300],
...     'value': [50, 100, 75, 150],
...     'cat1': ['A', 'A', 'B', 'B'],
...     'cat2': ['X', 'Y', 'X', 'Y']
... }
>>> X = pl.DataFrame(X)
>>> transformer = GroupScalingFeatures(
...     subset=['amount'],
...     by=['cat1', 'cat2'],
...     func=['mean']
... )
>>> result = transformer.fit_transform(X)
>>> result.columns
['amount', 'value', 'cat1', 'cat2', 'amount__mean_cat1', 'amount__mean_cat2']
# Creates separate features grouped by cat1 and grouped by cat2

Example 3: Min-max scaling

>>> X ={
...     'amount': [100, 200, 150, 300],
...     'cat1': ['A', 'A', 'B', 'B']
... }
>>> X = pl.DataFrame(X)
>>> transformer = GroupScalingFeatures(
...     subset=['amount'],
...     by=['cat1'],
...     func=['minmax']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (4, 3)
┌────────┬──────┬─────────────────────┐
│ amount ┆ cat1 ┆ amount__minmax_cat1 │
│ ---    ┆ ---  ┆ ---                 │
│ i64    ┆ str  ┆ f64                 │
╞════════╪══════╪═════════════════════╡
│ 100    ┆ A    ┆ 0.0                 │
│ 200    ┆ A    ┆ 1.0                 │
│ 150    ┆ B    ┆ 0.0                 │
│ 300    ┆ B    ┆ 1.0                 │
└────────┴──────┴─────────────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation.group_scaling_features.GroupScalingFeatures[source]#

Fit the transformer by generating column name mappings.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

GroupScalingFeatures

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by creating group scaling features.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with group scaling features.
Return type:: pl.DataFrame

class gators.feature_generation.GroupStatisticsFeatures[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates statistical aggregation features based on group-level computations.

Unlike GroupRatioFeatures which divides values by group stats, this transformer directly adds the group statistics as new columns.

Parameters:

subset (list[str]) – List of numerical column names to aggregate.
by (list[str]) – List of column names to use for groupby operations. Each column will be used for a separate groupby operation (e.g., [‘cat1’, ‘cat2’] creates features grouped by cat1 and separate features grouped by cat2).
func (list[str]) – List of aggregation functions to apply. Available options: - ‘mean’: Group mean - ‘std’: Group standard deviation - ‘median’: Group median - ‘min’: Group minimum - ‘max’: Group maximum - ‘sum’: Group sum - ‘count’: Group count - ‘range’: Group range (max - min)
drop_columns (bool, default=False) – Whether to drop the original numerical columns after creating statistics.
new_column_names (list[str], default=None) – List of custom names for the statistic columns. If None, uses default naming pattern ‘{agg}_{num_col}__per_{groupby_col}’. Must have same length as the total number of features created (subset × by × func).

Examples

>>> from gators.feature_generation import GroupStatisticsFeatures
>>> import polars as pl

>>> X ={
...     'amount': [100, 200, 150, 300, 250],
...     'cat1': ['A', 'A', 'B', 'B', 'A'],
...     'cat2': ['X', 'Y', 'X', 'X', 'X']
... }
>>> X = pl.DataFrame(X)

Example 1: Basic group statistics

>>> transformer = GroupStatisticsFeatures(
...     subset=['amount'],
...     by=['cat1'],
...     func=['mean', 'count']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (5, 5)
┌────────┬───────┬───────┬───────────────────────┬────────────────────────┐
│ amount ┆ cat1  ┆ cat2  ┆ mean_amount__per_cat1 ┆ count_amount__per_cat1 │
│ ---    ┆ ---   ┆ ---   ┆ ---                   ┆ ---                    │
│ i64    ┆ str   ┆ str   ┆ f64                   ┆ u32                    │
╞════════╪═══════╪═══════╪═══════════════════════╪════════════════════════╡
│ 100    ┆ A     ┆ X     ┆ 183.333333            ┆ 3                      │
│ 200    ┆ A     ┆ Y     ┆ 183.333333            ┆ 3                      │
│ 150    ┆ B     ┆ X     ┆ 225.0                 ┆ 2                      │
│ 300    ┆ B     ┆ X     ┆ 225.0                 ┆ 2                      │
│ 250    ┆ A     ┆ X     ┆ 183.333333            ┆ 3                      │
└────────┴───────┴───────┴───────────────────────┴────────────────────────┘

Example 2: Multiple groupby columns

>>> X ={
...     'amount': [100, 200, 150, 300],
...     'cat1': ['A', 'A', 'B', 'B'],
...     'cat2': ['X', 'Y', 'X', 'Y']
... }
>>> X = pl.DataFrame(X)
>>> transformer = GroupStatisticsFeatures(
...     subset=['amount'],
...     by=['cat1', 'cat2'],
...     func=['mean']
... )
>>> result = transformer.fit_transform(X)
>>> result.columns
['amount', 'cat1', 'cat2', 'mean_amount__per_cat1', 'mean_amount__per_cat2']
# Creates separate features grouped by cat1 and grouped by cat2

Example 3: Multiple func

>>> transformer = GroupStatisticsFeatures(
...     subset=['amount'],
...     by=['cat1'],
...     func=['mean', 'std', 'min', 'max']
... )
>>> result = transformer.fit_transform(X)
>>> result.columns
['amount', 'cat1', 'cat2', 'mean_amount__per_cat1', 'std_amount__per_cat1',
 'min_amount__per_cat1', 'max_amount__per_cat1']

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation.group_statistics_features.GroupStatisticsFeatures[source]#

Fit the transformer by generating column name mappings.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

GroupStatisticsFeatures

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by creating group statistic features.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with group statistic features.
Return type:: pl.DataFrame

class gators.feature_generation.GroupLagFeatures[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates lag (previous values) and lead (next values) features within groups.

This transformer creates features like:

Previous transaction amount for this card
Next transaction amount for this card
Value N periods ago within group

Useful for time-series analysis and detecting changes in behavior patterns.

Parameters:

subset (list[str]) – List of numerical column names to create lag/lead features for.
by (list[str]) – List of columns to group by. Lags/leads are computed within each group.
lags (list[int]) – List of lag periods. Positive integers create lag features (previous values). Example: [1, 2, 3] creates lag_1, lag_2, lag_3
leads (list[int], default=[]) – List of lead periods. Positive integers create lead features (next values). Example: [1, 2] creates lead_1, lead_2
fill_value (float, default=None) – Value to use for missing lag/lead values. If None, uses null.
drop_columns (bool, default=False) – Whether to drop the original numerical columns after creating lag features.
new_column_names (list[str], default=None) – List of custom names for the lag/lead columns. If None, uses default naming pattern ‘{num_col}_lag{n}_{groupby_cols}’ or ‘{num_col}_lead{n}_{groupby_cols}’. Must have same length as the total number of features created.

Examples

>>> from gators.feature_generation import GroupLagFeatures
>>> import polars as pl

>>> X ={
...     'amount': [100, 200, 150, 300, 250, 180],
...     'cat1': ['A', 'A', 'B', 'B', 'A', 'B'],
...     'time': [1, 2, 1, 2, 3, 3]
... }
>>> X = pl.DataFrame(X).sort(['cat1', 'time'])

Example 1: Basic lag features

>>> transformer = GroupLagFeatures(
...     subset=['amount'],
...     by=['cat1'],
...     lags=[1, 2]
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (6, 5)
┌────────┬───────┬──────┬─────────────────────┬─────────────────────┐
│ amount ┆ cat1  ┆ time ┆ amount_lag1_cat1    ┆ amount_lag2_cat1    │
│ ---    ┆ ---   ┆ ---  ┆ ---                 ┆ ---                 │
│ i64    ┆ str   ┆ i64  ┆ i64                 ┆ i64                 │
╞════════╪═══════╪══════╪═════════════════════╪═════════════════════╡
│ 100    ┆ A     ┆ 1    ┆ null                ┆ null                │
│ 200    ┆ A     ┆ 2    ┆ 100                 ┆ null                │
│ 250    ┆ A     ┆ 3    ┆ 200                 ┆ 100                 │
│ 150    ┆ B     ┆ 1    ┆ null                ┆ null                │
│ 300    ┆ B     ┆ 2    ┆ 150                 ┆ null                │
│ 180    ┆ B     ┆ 3    ┆ 300                 ┆ 150                 │
└────────┴───────┴──────┴─────────────────────┴─────────────────────┘

Example 2: Lag and lead features

>>> transformer = GroupLagFeatures(
...     subset=['amount'],
...     by=['cat1'],
...     lags=[1],
...     leads=[1]
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (6, 5)
┌────────┬───────┬──────┬───────────────────┬────────────────────┐
│ amount ┆ cat1  ┆ time ┆ amount_lag1_cat1  ┆ amount_lead1_cat1  │
│ ---    ┆ ---   ┆ ---  ┆ ---               ┆ ---                │
│ i64    ┆ str   ┆ i64  ┆ i64               ┆ i64                │
╞════════╪═══════╪══════╪═══════════════════╪════════════════════╡
│ 100    ┆ A     ┆ 1    ┆ null              ┆ 200                │
│ 200    ┆ A     ┆ 2    ┆ 100               ┆ 250                │
│ 250    ┆ A     ┆ 3    ┆ 200               ┆ null               │
│ 150    ┆ B     ┆ 1    ┆ null              ┆ 300                │
│ 300    ┆ B     ┆ 2    ┆ 150               ┆ 180                │
│ 180    ┆ B     ┆ 3    ┆ 300               ┆ null               │
└────────┴───────┴──────┴───────────────────┴────────────────────┘

Example 3: With fill_value

>>> transformer = GroupLagFeatures(
...     subset=['amount'],
...     by=['cat1'],
...     lags=[1],
...     fill_value=0.0
... )
>>> result = transformer.fit_transform(X)
>>> result['amount_lag1_cat1'][0]  # First row, no previous value
0.0

Notes

Data should be sorted by by and time before transformation
Lag features look backwards: lag_1 is the previous row within the group
Lead features look forwards: lead_1 is the next row within the group
First rows in each group will have null (or fill_value) for lag features
Last rows in each group will have null (or fill_value) for lead features

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation.group_lag_features.GroupLagFeatures[source]#

Fit the transformer by generating column name mappings.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

GroupLagFeatures

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by creating lag/lead features.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with lag/lead features.
Return type:: pl.DataFrame

class gators.feature_generation.ComparisonFeatures[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates binary comparison features between pairs of columns, or unary null checks.

Parameters:

subset_a (list[str]) – List of column names for the left side of comparisons (or the only column for unary operators).
subset_b (list[str]) – List of column names for the right side of comparisons. For unary operators (‘is_null’, ‘is_not_null’), these values are ignored.
operators (list[Literal[">", "<", ">=", "<=", "==", "!=", "is_null", "is_not_null"]]) – List of comparison operators to apply. Must match length of columns. Unary operators: ‘is_null’, ‘is_not_null’ (only use subset_a) Binary operators: ‘>’, ‘<’, ‘>=’, ‘<=’, ‘==’, ‘!=’ (use both subset_a and subset_b)
drop_columns (bool, default=False) – Whether to drop the original columns after creating comparisons.

Examples

>>> from gators.feature_generation import ComparisonFeatures
>>> import polars as pl

>>> X ={'A': [10, 20, 30, 40],
...         'B': [15, 10, 30, 35],
...         'C': [5, 25, 20, 50]}
>>> X = pl.DataFrame(X)

Example 1: Single comparison

>>> transformer = ComparisonFeatures(
...     subset_a=['A'],
...     subset_b=['B'],
...     operators=['>']
... )
>>> transformer.fit(X)
ComparisonFeatures(subset_a=['A'], subset_b=['B'], operators=['>'])
>>> result = transformer.transform(X)
>>> result
shape: (4, 4)
┌──────┬──────┬──────┬─────────┐
│  A   │  B   │  C   │ A_gt_B  │
│ i64  │ i64  │ i64  │  bool   │
├──────┼──────┼──────┼─────────┤
│  10  │  15  │  5   │  false  │
│  20  │  10  │  25  │  true   │
│  30  │  30  │  20  │  false  │
│  40  │  35  │  50  │  true   │
└──────┴──────┴──────┴─────────┘

Example 2: Multiple comparisons with different operators

>>> transformer = ComparisonFeatures(
...     subset_a=['A', 'B', 'A'],
...     subset_b=['B', 'C', 'C'],
...     operators=['>', '<', '>=']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (4, 6)
┌──────┬──────┬──────┬─────────┬─────────┬─────────┐
│  A   │  B   │  C   │ A_gt_B  │ B_lt_C  │ A_gte_C │
│ i64  │ i64  │ i64  │  bool   │  bool   │  bool   │
├──────┼──────┼──────┼─────────┼─────────┼─────────┤
│  10  │  15  │  5   │  false  │  false  │  true   │
│  20  │  10  │  25  │  true   │  true   │  false  │
│  30  │  30  │  20  │  false  │  false  │  true   │
│  40  │  35  │  50  │  true   │  true   │  false  │
└──────┴──────┴──────┴─────────┴─────────┴─────────┘

Example 3: Null checks (unary operators)

>>> data_with_nulls = pl.DataFrame({
...     'A': [10, None, 30, None],
...     'B': [15, 10, None, 35]
... })
>>> transformer = ComparisonFeatures(
...     subset_a=['A', 'B'],
...     subset_b=['', ''],  # Ignored for unary operators
...     operators=['is_null', 'is_not_null']
... )
>>> result = transformer.fit_transform(data_with_nulls)
>>> result
shape: (4, 4)
┌──────┬──────┬────────────┬────────────────┐
│  A   │  B   │ A__is_null │ B__is_not_null │
│ i64  │ i64  │  bool      │  bool          │
├──────┼──────┼────────────┼────────────────┤
│  10  │  15  │  false     │  true          │
│ null │  10  │  true      │  true          │
│  30  │ null │  false     │  false         │
│ null │  35  │  true      │  true          │
└──────┴──────┴────────────┴────────────────┘

Example 4: With drop_columns=True

>>> transformer = ComparisonFeatures(
...     subset_a=['A'],
...     subset_b=['B'],
...     operators=['>'],
...     drop_columns=True
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (4, 2)
┌──────┬─────────┐
│  C   │ A_gt_B  │
│ i64  │  bool   │
├──────┼─────────┤
│  5   │  false  │
│  25  │  true   │
│  20  │  false  │
│  50  │  true   │
└──────┴─────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation.comparison_features.ComparisonFeatures[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

ComparisonFeatures

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by creating comparison features.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with comparison features.
Return type:: pl.DataFrame

class gators.feature_generation.ConditionFeatures[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Creates multiple independent boolean features, one for each condition.

This transformer is designed for creating simple boolean flags without combination logic. Each condition produces exactly one boolean output column. For combining multiple conditions with AND/OR logic, use RuleFeatures instead.

Use Cases:

Create simple boolean flags (is_adult, is_weekend, is_premium, etc.)
Materialize threshold-based features (is_high_value, is_frequent_user)
Feature engineering: Generate independent indicator variables
Fraud detection: Create simple risk flags before combining them

When to Use:

Need multiple independent boolean columns
Each condition stands alone (no AND/OR combination needed)
Want cleaner API than RuleFeatures for simple cases
Building feature sets for downstream transformers

When NOT to Use:

Need to combine conditions with AND/OR (use RuleFeatures)
One-off exploratory analysis (use Polars native expressions)
Very simple cases with 1-2 conditions (just use .with_columns())

Parameters:

conditions (list[dict[str, Any]]) –
List of condition dictionaries. Each condition creates one boolean output column.

Each condition dictionary must contain:
- ’column’: str - Name of the column to evaluate
- ’op’: str - Comparison operator. Supported:
  - Binary: ‘>’, ‘<’, ‘>=’, ‘<=’, ‘==’, ‘!=’ (require ‘value’ or ‘other_column’)
  - Unary: ‘is_null’, ‘is_not_null’ (no ‘value’ or ‘other_column’ needed)
- ’value’: Any (optional) - Scalar value to compare the column against
- ’other_column’: str (optional) - Name of another column to compare against
For binary operators: Either ‘value’ or ‘other_column’ must be specified, but not both. For unary operators: Neither ‘value’ nor ‘other_column’ should be specified.

Examples:
```
# Simple conditions:
[
    {'column': 'age', 'op': '>=', 'value': 18},
    {'column': 'amount', 'op': '>', 'value': 1000}
]

# Column comparison:
[
    {'column': 'velocity_24h', 'op': '>', 'other_column': 'velocity_7d'}
]

# Null checks:
[
    {'column': 'age', 'op': 'is_null'},
    {'column': 'email', 'op': 'is_not_null'}
]
```
new_column_names (list[str], default=None) –
Names for the resulting boolean feature columns. If provided, must have the same length as conditions. If None, column names are auto-generated in the format:
- Scalar comparison: {column}_{op_name}_{value} (e.g., ‘age_gte_18’)
- Column comparison: {column}_{op_name}_{other_column} (e.g., ‘velocity_24h_gt_velocity_7d’)
- Unary operation: {column}__{op_name} (e.g., ‘age__is_null’)
Operator name mapping:
- ’>’ -> ‘gt’
- ’<’ -> ‘lt’
- ’>=’ -> ‘gte’
- ’<=’ -> ‘lte’
- ’==’ -> ‘eq’
- ’!=’ -> ‘ne’
- ’is_null’ -> ‘is_null’
- ’is_not_null’ -> ‘is_not_null’

Examples

>>> import polars as pl
>>> from gators.feature_generation import ConditionFeatures

>>> X ={
...     'age': [15, 25, 30, 17, 45],
...     'amount': [100, 1500, 500, 200, 2000],
...     'family_size': [1, 3, 1, 4, 2],
...     'fare': [50, 75, 30, 100, 80]
... }
>>> X = pl.DataFrame(X)

Example 1: Create simple boolean flags

>>> transformer = ConditionFeatures(
...     conditions=[
...         {'column': 'age', 'op': '>=', 'value': 18},
...         {'column': 'amount', 'op': '>', 'value': 1000},
...         {'column': 'family_size', 'op': '==', 'value': 1}
...     ],
...     new_column_names=['is_adult', 'is_high_amount', 'is_alone']
... )
>>> result = transformer.fit_transform(X)
>>> result.select(['age', 'amount', 'family_size', 'is_adult', 'is_high_amount', 'is_alone'])
shape: (5, 6)
┌─────┬────────┬─────────────┬──────────┬─────────────────┬──────────┐
│ age ┆ amount ┆ family_size ┆ is_adult ┆ is_high_amount  ┆ is_alone │
│ --- ┆ ---    ┆ ---         ┆ ---      ┆ ---             ┆ ---      │
│ i64 ┆ i64    ┆ i64         ┆ bool     ┆ bool            ┆ bool     │
╞═════╪════════╪═════════════╪══════════╪═════════════════╪══════════╡
│ 15  ┆ 100    ┆ 1           ┆ false    ┆ false           ┆ true     │
│ 25  ┆ 1500   ┆ 3           ┆ true     ┆ true            ┆ false    │
│ 30  ┆ 500    ┆ 1           ┆ true     ┆ false           ┆ true     │
│ 17  ┆ 200    ┆ 4           ┆ false    ┆ false           ┆ false    │
│ 45  ┆ 2000   ┆ 2           ┆ true     ┆ true            ┆ false    │
└─────┴────────┴─────────────┴──────────┴─────────────────┴──────────┘

Example 2: Column-to-column comparison

>>> fare_X ={
...     'fare': [50.0, 100.0, 30.0, 200.0, 80.0],
...     'fare_per_person': [50.0, 33.3, 30.0, 50.0, 40.0]
... }
>>> fare_X = pl.DataFrame(fare_data)
>>> fare_BaseTransformer = ConditionFeatures(
...     conditions=[
...         {'column': 'fare', 'op': '>', 'value': 100},
...         {'column': 'fare_per_person', 'op': '>', 'other_column': 'fare'}
...     ],
...     new_column_names=['is_expensive', 'paid_more_per_person']
... )
>>> result = fare_BaseTransformer.fit_transform(fare_X)
>>> result
shape: (5, 4)
┌───────┬──────────────────┬──────────────┬──────────────────────┐
│ fare  ┆ fare_per_person  ┆ is_expensive ┆ paid_more_per_person │
│ ---   ┆ ---              ┆ ---          ┆ ---                  │
│ f64   ┆ f64              ┆ bool         ┆ bool                 │
╞═══════╪══════════════════╪══════════════╪══════════════════════╡
│ 50.0  ┆ 50.0             ┆ false        ┆ false                │
│ 100.0 ┆ 33.3             ┆ false        ┆ false                │
│ 30.0  ┆ 30.0             ┆ false        ┆ false                │
│ 200.0 ┆ 50.0             ┆ true         ┆ false                │
│ 80.0  ┆ 40.0             ┆ false        ┆ false                │
└───────┴──────────────────┴──────────────┴──────────────────────┘

Example 3: Titanic-style feature engineering

>>> titanic_X ={
...     'Age': [22.0, 38.0, 26.0, 35.0, 12.0],
...     'Pclass': [3, 1, 3, 1, 3],
...     'SibSp': [1, 1, 0, 1, 0],
...     'Parch': [0, 0, 0, 0, 1]
... }
>>> titanic_X = pl.DataFrame(titanic_data)
>>> # First add family_size
>>> titanic_X = titanic_X.with_columns(
...     (pl.col('SibSp') + pl.col('Parch')).alias('family_size')
... )
>>> titanic_BaseTransformer = ConditionFeatures(
...     conditions=[
...         {'column': 'Age', 'op': '<', 'value': 18},
...         {'column': 'Pclass', 'op': '==', 'value': 1},
...         {'column': 'family_size', 'op': '==', 'value': 0}
...     ],
...     new_column_names=['is_child', 'is_first_class', 'is_alone']
... )
>>> result = titanic_BaseTransformer.fit_transform(titanic_X)
>>> result.select(['Age', 'Pclass', 'family_size', 'is_child', 'is_first_class', 'is_alone'])
shape: (5, 6)
┌──────┬────────┬─────────────┬──────────┬────────────────┬──────────┐
│ Age  ┆ Pclass ┆ family_size ┆ is_child ┆ is_first_class ┆ is_alone │
│ ---  ┆ ---    ┆ ---         ┆ ---      ┆ ---            ┆ ---      │
│ f64  ┆ i64    ┆ i64         ┆ bool     ┆ bool           ┆ bool     │
╞══════╪════════╪═════════════╪══════════╪════════════════╪══════════╡
│ 22.0 ┆ 3      ┆ 1           ┆ false    ┆ false          ┆ false    │
│ 38.0 ┆ 1      ┆ 1           ┆ false    ┆ true           ┆ false    │
│ 26.0 ┆ 3      ┆ 0           ┆ false    ┆ false          ┆ true     │
│ 35.0 ┆ 1      ┆ 1           ┆ false    ┆ true           ┆ false    │
│ 12.0 ┆ 3      ┆ 1           ┆ true     ┆ false          ┆ false    │
└──────┴────────┴─────────────┴──────────┴────────────────┴──────────┘

Example 4: Auto-generated column names

>>> auto_BaseTransformer = ConditionFeatures(
...     conditions=[
...         {'column': 'age', 'op': '>=', 'value': 18},
...         {'column': 'amount', 'op': '>', 'value': 1000},
...         {'column': 'family_size', 'op': '==', 'value': 1}
...     ]
...     # new_column_names not specified - will be auto-generated
... )
>>> result = auto_BaseTransformer.fit_transform(X)
>>> result.select(['age', 'amount', 'family_size', 'age_gte_18', 'amount_gt_1000', 'family_size_eq_1'])
shape: (5, 6)
┌─────┬────────┬─────────────┬────────────┬────────────────┬──────────────────┐
│ age ┆ amount ┆ family_size ┆ age_gte_18 ┆ amount_gt_1000 ┆ family_size_eq_1 │
│ --- ┆ ---    ┆ ---         ┆ ---        ┆ ---            ┆ ---              │
│ i64 ┆ i64    ┆ i64         ┆ bool       ┆ bool           ┆ bool             │
╞═════╪════════╪═════════════╪════════════╪════════════════╪══════════════════╡
│ 15  ┆ 100    ┆ 1           ┆ false      ┆ false          ┆ true             │
│ 25  ┆ 1500   ┆ 3           ┆ true       ┆ true           ┆ false            │
│ 30  ┆ 500    ┆ 1           ┆ true       ┆ false          ┆ true             │
│ 17  ┆ 200    ┆ 4           ┆ false      ┆ false          ┆ false            │
│ 45  ┆ 2000   ┆ 2           ┆ true       ┆ true           ┆ false            │
└─────┴────────┴─────────────┴────────────┴────────────────┴──────────────────┘

Example 5: Null checks (unary operators)

>>> data_with_nulls = {
...     'age': [25, None, 30, 17, None],
...     'email': ['a@test.com', 'b@test.com', None, 'd@test.com', None],
...     'amount': [100, 1500, 500, 200, 2000]
... }
>>> X_nulls = pl.DataFrame(data_with_nulls)
>>> null_BaseTransformer = ConditionFeatures(
...     conditions=[
...         {'column': 'age', 'op': 'is_null'},
...         {'column': 'email', 'op': 'is_not_null'},
...         {'column': 'amount', 'op': '>', 'value': 1000}
...     ],
...     new_column_names=['age_missing', 'has_email', 'is_high_amount']
... )
>>> result = null_BaseTransformer.fit_transform(X_nulls)
>>> result
shape: (5, 6)
┌──────┬─────────────┬────────┬─────────────┬───────────┬─────────────────┐
│ age  ┆ email       ┆ amount ┆ age_missing ┆ has_email ┆ is_high_amount  │
│ ---  ┆ ---         ┆ ---    ┆ ---         ┆ ---       ┆ ---             │
│ i64  ┆ str         ┆ i64    ┆ bool        ┆ bool      ┆ bool            │
╞══════╪═════════════╪════════╪═════════════╪═══════════╪═════════════════╡
│ 25   ┆ a@test.com  ┆ 100    ┆ false       ┆ true      ┆ false           │
│ null ┆ b@test.com  ┆ 1500   ┆ true        ┆ true      ┆ true            │
│ 30   ┆ null        ┆ 500    ┆ false       ┆ false     ┆ false           │
│ 17   ┆ d@test.com  ┆ 200    ┆ false       ┆ true      ┆ false           │
│ null ┆ null        ┆ 2000   ┆ true        ┆ false     ┆ true            │
└──────┴─────────────┴────────┴─────────────┴───────────┴─────────────────┘

Notes

Each condition produces exactly one independent boolean column
Auto-naming: If new_column_names is None, names are auto-generated as: * Scalar: {column}_{op_name}_{value} (e.g., ‘age_gte_18’) * Column-to-column: {column}_{op_name}_{other_column} (e.g., ‘velocity_24h_gt_velocity_7d’) * Unary: {column}__{op_name} (e.g., ‘age__is_null’)
No combination logic - use RuleFeatures if you need AND/OR
Simpler API than RuleFeatures for common use cases
Missing values (null) in comparisons typically result in null/false
Unary operators ‘is_null’ and ‘is_not_null’ explicitly check for null values
Can be used as preprocessing step before RuleFeatures for complex logic

Importance for Fraud Detection#

Row-level aggregation features are valuable in fraud detection because they capture relationships and patterns across related features within individual transactions. For example:

Computing statistics across multiple transaction amounts can reveal unusual patterns (e.g., all amounts being identical might indicate scripted fraud)
Aggregating across card verification fields can identify inconsistencies
Statistics across temporal features can detect velocity anomalies
Range calculations can flag suspiciously uniform or extreme value spreads

These features help models identify transactions where the distribution of values across related fields deviates from normal patterns, which is often indicative of fraudulent behavior.

param column_groups:

Dictionary mapping group names to lists of column names. Each group defines a set of columns over which to compute row-level statistics. Example: {‘card_fields’: [‘card1’, ‘card2’, ‘card3’]}

type column_groups:

dict[str, list[str]]

param func:

List of aggregation functions to apply. Available options:

‘min’: Row-wise minimum value
‘max’: Row-wise maximum value
‘mean’: Row-wise mean (average)
‘median’: Row-wise median
‘std’: Row-wise standard deviation
‘range’: Row-wise range (max - min)
‘sum’: Row-wise sum

type func:

list[str]

param drop_columns:

Whether to drop the original columns after creating aggregation features.

type drop_columns:

bool, default=False

param new_column_names:

List of custom names for the aggregation columns. If None, uses default naming pattern ‘{group_name}__{func}’. Must have same length as the total number of features created (len(column_groups) × len(func)).

type new_column_names:

list[str], default=None

Examples

>>> from gators.feature_generation import RowStatisticsFeatures
>>> import polars as pl

Example 1: Single group with multiple aggregations

>>> X = pl.DataFrame({
...     'A': [9, 9, 7],
...     'B': [3, 4, 5],
...     'C': [6, 7, 8]
... })
>>> transformer = RowStatisticsFeatures(
...     column_groups={'cluster_1': ['A', 'B']},
...     func=['mean', 'std']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (3, 5)
┌─────┬─────┬─────┬───────────────────┬──────────────────┐
│ A   ┆ B   ┆ C   ┆ cluster_1__mean   ┆ cluster_1__std   │
│ --- ┆ --- ┆ --- ┆ ---               ┆ ---              │
│ i64 ┆ i64 ┆ i64 ┆ f64               ┆ f64              │
╞═════╪═════╪═════╪═══════════════════╪══════════════════╡
│ 9   ┆ 3   ┆ 6   ┆ 6.0               ┆ 4.242641         │
│ 9   ┆ 4   ┆ 7   ┆ 6.5               ┆ 3.535534         │
│ 7   ┆ 5   ┆ 8   ┆ 6.0               ┆ 1.414214         │
└─────┴─────┴─────┴───────────────────┴──────────────────┘

Example 2: Multiple groups with different columns

>>> X = pl.DataFrame({
...     'A': [9, 9, 7],
...     'B': [3, 4, 5],
...     'C': [6, 7, 8],
...     'D': [1, 2, 3]
... })
>>> transformer = RowStatisticsFeatures(
...     column_groups={
...         'cluster_1': ['A', 'B'],
...         'cluster_2': ['C', 'D']
...     },
...     func=['min', 'max', 'range']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (3, 10)
┌─────┬─────┬─────┬─────┬──────────────┬──────────────┬─────────────────┬──────────────┬──────────────┬─────────────────┐
│ A   ┆ B   ┆ C   ┆ D   ┆ cluster_1__… ┆ cluster_1__… ┆ cluster_1__ran… ┆ cluster_2__… ┆ cluster_2__… ┆ cluster_2__ran… │
│ --- ┆ --- ┆ --- ┆ --- ┆ ---          ┆ ---          ┆ ---             ┆ ---          ┆ ---          ┆ ---             │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64          ┆ i64          ┆ i64             ┆ i64          ┆ i64          ┆ i64             │
╞═════╪═════╪═════╪═════╪══════════════╪══════════════╪═════════════════╪══════════════╪══════════════╪═════════════════╡
│ 9   ┆ 3   ┆ 6   ┆ 1   ┆ 3            ┆ 9            ┆ 6               ┆ 1            ┆ 6            ┆ 5               │
│ 9   ┆ 4   ┆ 7   ┆ 2   ┆ 4            ┆ 9            ┆ 5               ┆ 2            ┆ 7            ┆ 5               │
│ 7   ┆ 5   ┆ 8   ┆ 3   ┆ 5            ┆ 7            ┆ 2               ┆ 3            ┆ 8            ┆ 5               │
└─────┴─────┴─────┴─────┴──────────────┴──────────────┴─────────────────┴──────────────┴──────────────┴─────────────────┘

Example 3: Using custom column names

>>> X = pl.DataFrame({
...     'amount1': [100, 200, 150],
...     'amount2': [50, 100, 75],
...     'amount3': [25, 50, 30]
... })
>>> transformer = RowStatisticsFeatures(
...     column_groups={'amounts': ['amount1', 'amount2', 'amount3']},
...     func=['mean', 'std'],
...     new_column_names=['avg_amount', 'std_amount']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (3, 5)
┌──────────┬──────────┬──────────┬────────────┬────────────┐
│ amount1  ┆ amount2  ┆ amount3  ┆ avg_amount ┆ std_amount │
│ ---      ┆ ---      ┆ ---      ┆ ---        ┆ ---        │
│ i64      ┆ i64      ┆ i64      ┆ f64        ┆ f64        │
╞══════════╪══════════╪══════════╪════════════╪════════════╡
│ 100      ┆ 50       ┆ 25       ┆ 58.333333  ┆ 30.957...  │
│ 200      ┆ 100      ┆ 50       ┆ 116.666... ┆ 61.914...  │
│ 150      ┆ 75       ┆ 30       ┆ 85.0       ┆ 49.606...  │
└──────────┴──────────┴──────────┴────────────┴────────────┘

Example 4: Fraud detection use case - card verification fields

>>> X = pl.DataFrame({
...     'card_cvv_match': [1, 0, 1, 1],
...     'card_addr_match': [1, 1, 0, 1],
...     'card_zip_match': [1, 1, 1, 0],
...     'is_fraud': [0, 1, 1, 1]
... })
>>> # Aggregate verification fields to detect inconsistencies
>>> transformer = RowStatisticsFeatures(
...     column_groups={'verification': ['card_cvv_match', 'card_addr_match', 'card_zip_match']},
...     func=['mean', 'std'],
...     drop_columns=False
... )
>>> result = transformer.fit_transform(X)
>>> result.select(['verification__mean', 'verification__std', 'is_fraud'])
shape: (4, 3)
┌─────────────────────┬────────────────────┬──────────┐
│ verification__mean  ┆ verification__std  ┆ is_fraud │
│ ---                 ┆ ---                ┆ ---      │
│ f64                 ┆ f64                ┆ i64      │
╞═════════════════════╪════════════════════╪══════════╡
│ 1.0                 ┆ 0.0                ┆ 0        │
│ 0.666667            ┆ 0.471405           ┆ 1        │
│ 0.666667            ┆ 0.471405           ┆ 1        │
│ 0.666667            ┆ 0.471405           ┆ 1        │
└─────────────────────┴────────────────────┴──────────┘
# Notice: legitimate transaction has perfect verification (mean=1, std=0)
# Fraudulent transactions show inconsistent verification patterns

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation.row_statistics_features.RowStatisticsFeatures[source]#

Fit the transformer by generating column name mappings.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

RowStatisticsFeatures

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by creating row-level aggregation features.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with row-level aggregation features.
Return type:: pl.DataFrame

class gators.feature_generation.RatioFeatures[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates ratio features by dividing numerator columns by denominator columns.

This transformer creates ratio features in a 1-to-1 pairing between numerator and denominator columns. Division by zero is handled by replacing the result with null values.

Parameters:

numerator_columns (list[str]) – List of column names to use as numerators.
denominator_columns (list[str]) – List of column names to use as denominators. Must have the same length as numerator_columns.
new_column_names (list[str]], optional) – List of custom names for the ratio features. If None, names will be automatically generated as ‘{numerator}__div__{denominator}’, by default None.
drop_columns (bool, optional) – Whether to drop the original numerator and denominator columns after creating ratios, by default False.

Examples

>>> from gators.feature_generation import RatioFeatures
>>> import polars as pl

>>> X = pl.DataFrame({
...     'revenue': [100, 200, 300, 400],
...     'cost': [80, 100, 150, 0],
...     'clicks': [1000, 2000, 3000, 4000],
...     'impressions': [10000, 20000, 30000, 40000]
... })

Example 1: Basic ratio features

>>> transformer = RatioFeatures(
...     numerator_columns=['revenue', 'clicks'],
...     denominator_columns=['cost', 'impressions']
... )
>>> transformer.fit(X)
RatioFeatures(numerator_columns=['revenue', 'clicks'], denominator_columns=['cost', 'impressions'])
>>> result = transformer.transform(X)
>>> result
shape: (4, 6)
┌─────────┬──────┬────────┬─────────────┬────────────────────┬─────────────────────────┐
│ revenue │ cost │ clicks │ impressions │ revenue__div__cost │ clicks__div__impressions│
│ i64     │ i64  │ i64    │ i64         │ f64                │ f64                     │
├─────────┼──────┼────────┼─────────────┼────────────────────┼─────────────────────────┤
│ 100     │ 80   │ 1000   │ 10000       │ 1.25               │ 0.1                     │
│ 200     │ 100  │ 2000   │ 20000       │ 2.0                │ 0.1                     │
│ 300     │ 150  │ 3000   │ 30000       │ 2.0                │ 0.1                     │
│ 400     │ 0    │ 4000   │ 40000       │ null               │ 0.1                     │
└─────────┴──────┴────────┴─────────────┴────────────────────┴─────────────────────────┘

Example 2: Custom column names

>>> transformer = RatioFeatures(
...     numerator_columns=['revenue'],
...     denominator_columns=['cost'],
...     new_column_names=['profit_margin']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (4, 5)
┌─────────┬──────┬────────┬─────────────┬───────────────┐
│ revenue │ cost │ clicks │ impressions │ profit_margin │
│ i64     │ i64  │ i64    │ i64         │ f64           │
├─────────┼──────┼────────┼─────────────┼───────────────┤
│ 100     │ 80   │ 1000   │ 10000       │ 1.25          │
│ 200     │ 100  │ 2000   │ 20000       │ 2.0           │
│ 300     │ 150  │ 3000   │ 30000       │ 2.0           │
│ 400     │ 0    │ 4000   │ 40000       │ null          │
└─────────┴──────┴────────┴─────────────┴───────────────┘

Example 3: With drop_columns=True

>>> transformer = RatioFeatures(
...     numerator_columns=['revenue'],
...     denominator_columns=['cost'],
...     drop_columns=True
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (4, 3)
┌────────┬─────────────┬────────────────────┐
│ clicks │ impressions │ revenue__div__cost │
│ i64    │ i64         │ f64                │
├────────┼─────────────┼────────────────────┤
│ 1000   │ 10000       │ 1.25               │
│ 2000   │ 20000       │ 2.0                │
│ 3000   │ 30000       │ 2.0                │
│ 4000   │ 40000       │ null               │
└────────┴─────────────┴────────────────────┘

Example 4: Handling null values

>>> X_with_nulls = pl.DataFrame({
...     'A': [10, None, 30, 40],
...     'B': [2, 5, None, 0]
... })
>>> transformer = RatioFeatures(
...     numerator_columns=['A'],
...     denominator_columns=['B']
... )
>>> result = transformer.fit_transform(X_with_nulls)
>>> result
shape: (4, 3)
┌──────┬──────┬──────────────┐
│ A    │ B    │ A__div__B    │
│ i64  │ i64  │ f64          │
├──────┼──────┼──────────────┤
│ 10   │ 2    │ 5.0          │
│ null │ 5    │ null         │
│ 30   │ null │ null         │
│ 40   │ 0    │ null         │
└──────┴──────┴──────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation.ratio_features.RatioFeatures[source]#

Fit the transformer by generating column name mappings.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

RatioFeatures

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by creating ratio features.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with ratio features.
Return type:: pl.DataFrame