gators.feature_generation package#

Module contents#

class gators.feature_generation.IsNull[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Creates boolean features indicating whether values are null for specified columns.

Parameters:

subset (Optional[List[str]], default=None) – List of column names to check for null values. If None, all columns in the DataFrame are used.

Examples

>>> from is_null import IsNull
>>> import polars as pl
>>> X ={'A': [1, None, 3, 4],
...         'B': [4, 3, None, 1],
...         'C': [1, 2, 1, 2]}
>>> X = pl.DataFrame(X)
>>> transformer = IsNull(subset=['A', 'B'])
>>> transformer.fit(X)
IsNull(subset=['A', 'B'])
>>> result = transformer.transform(X)
>>> result
shape: (4, 5)
┌──────┬──────┬─────┬──────────────┬──────────────┐
│  A   │  B   │  C  │ A__is_null   │ B__is_null   │
│ i64  │ i64  │ i64 │ bool         │ bool         │
├──────┼──────┼─────┼──────────────┼──────────────┤
│  1   │  4   │  1  │ false        │ false        │
│ null │  3   │  2  │ true         │ false        │
│  3   │ null │  1  │ false        │ true         │
│  4   │  1   │  2  │ false        │ false        │
└──────┴──────┴─────┴──────────────┴──────────────┘
fit(X, y=None)[source]#

Fit the transformer by generating column name mappings.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

IsNull

transform(X)[source]#

Transform the input DataFrame by adding is_null indicator columns.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with additional is_null columns.

Return type:

DataFrame

class gators.feature_generation.PolynomialFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates polynomial and interaction features.

Parameters:
  • subset (Optional[List[str]], default=None) – Subset of columns to transform. If None, all columns except strings and booleans.

  • degree (int, default=2) – The degree of the polynomial features.

  • interaction_only (bool, default=False) – If True, only interaction features are produced.

  • include_bias (bool, default=True) – If True, include a bias column (column of ones).

Examples

Example 1: Degree 2 polynomial with bias term

>>> from gators.discretizers import PolynomialFeatures
>>> import polars as pl
>>> X = pl.DataFrame({'A': [1, 2], 'B': [3, 4]})
>>> transformer = PolynomialFeatures(degree=2, include_bias=True)
>>> transformer.fit(X)
>>> transformer.transform(X)
shape: (2, 5)
┌─────┬─────┬─────┬─────┬─────┐─────┐
│ A   │ B   │ A__A│ A__B│ B__B│ bias|
├─────┼─────┼─────┼─────┼─────┤─────┤
│ 1   │ 3   │ 1   │ 3   │ 9   │ 1   │
│ 2   │ 4   │ 4   │ 8   │ 16  │ 1   │
└─────┴─────┴─────┴─────┴─────┴─────┘

Example 2: Polynomial on subset of columns

>>> transformer = PolynomialFeatures(subset=['A'], degree=2)
>>> transformer.fit(X)
>>> transformer.transform(X)
shape: (2, 3)
┌─────┬─────┬─────┐
│ A   │ B   │ A__A│
├─────┼─────┼─────┤
│ 1   │ 3   │ 1   │
│ 2   │ 4   │ 4   │
└─────┴─────┴─────┘

Example 3: Interaction features only

>>> transformer = PolynomialFeatures(degree=2, interaction_only=True)
>>> transformer.fit(X)
>>> transformer.transform(X)
shape: (2, 4)
┌─────┬─────┬─────┐
│ A   │ B   │ A__B│
├─────┼─────┼─────┼
│ 1   │ 3   │ 3   │
│ 2   │ 4   │ 8   │
└─────┴─────┴─────┴
fit(X, y=None)[source]#

Fit the transformer by identifying columns to transform.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

PolynomialFeatures

transform(X)[source]#

Transform the input DataFrame by extracting specified components.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame.

Return type:

DataFrame

class gators.feature_generation.PlanRotationFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Create new columns based on the plan rotation mapping.

The data should be composed of numerical columns only. Use gators.encoders to replace the categorical columns by numerical ones before using PlanRotationFeatures.

Parameters:
  • subset (List[List[str]]) – List of pair-wise columns.

  • angles (List[float]) – List of rotation angles.

Examples

Basic usage with plan rotation

Imports and initialization:

>>> from gators.feature_generation import PlanRotationFeatures
>>> obj = PlanRotationFeatures(
... subset=[['X', 'Y'], ['X', 'Z']] , angles=[45.0, 60.0])

The fit, transform, and fit_transform methods accept polars dataframes:

>>> import polars as pl
>>> X = pl.DataFrame(
... {'X': [200.0, 210.0], 'Y': [140.0, 160.0], 'Z': [100.0, 125.0]})

The result is a transformed polars dataframe.

>>> obj.fit_transform(X)
shape: (2, 9)
┌───────┬───────┬───────┬────────────┬───┬────────────┬────────────┬────────────┐
│ X     ┆ Y     ┆ Z     ┆ XY_x_45.0… ┆ … ┆ XZ_y_45.0… ┆ XZ_x_60.0… ┆ XZ_y_60.0… │
│ ---   ┆ ---   ┆ ---   ┆ ---        ┆   ┆ ---        ┆ ---        ┆ ---        │
│ f64   ┆ f64   ┆ f64   ┆ f64        ┆   ┆ f64        ┆ f64        ┆ f64        │
╞═══════╪═══════╪═══════╪════════════╪═══╪════════════╪════════════╪════════════╡
│ 200.0 ┆ 140.0 ┆ 100.0 ┆ 42.426407  ┆ … ┆ 212.132034 ┆ 13.397460  ┆ 223.205081 │
│ 210.0 ┆ 160.0 ┆ 125.0 ┆ 35.355339  ┆ … ┆ 236.880772 ┆ -3.253175  ┆ 244.365335 │
└───────┴───────┴───────┴────────────┴───┴────────────┴────────────┴────────────┘
compute_column_names()[source]#

Compute column names after initialization.

fit(X, y=None)[source]#

Fit the transformer by identifying columns to flatten.

Parameters:
  • X (DataFrame) – Input dataframe.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

PlanRotationFeatures

transform(X)[source]#

Transform the dataframe X.

Parameters:

X (DataFrame) – Input dataframe.

Returns:

Transformed dataframe.

Return type:

DataFrame

class gators.feature_generation.MathFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates new features by applying mathematical operations to groups of columns.

Parameters:
  • groups (List[List[str]]) – List of groups of column names to apply operations on.

  • operations (List[str]) –

    List of operations to apply to each group of columns. Available operations:

    • ’sum’: Sum of all columns

    • ’mean’: Mean of all columns

    • ’minus’: Subtraction (reduces columns left to right)

    • ’mul’: Product of all columns

    • ’div’: Division (reduces columns left to right)

    • ’min’: Minimum value across columns

    • ’max’: Maximum value across columns

    • ’std’: Standard deviation across columns

    • ’var’: Variance across columns

    • ’median’: Median across columns

    • ’range’: Range (max - min)

    • ’abs_diff’: Absolute difference (reduces columns left to right)

    • ’count_null’: Count of null values

    • ’count_zero’: Count of zero values

    • ’count_nonzero’: Count of non-zero values

    Note: For division operations, consider using RatioFeatures instead, which provides safer division with automatic handling of division by zero and null values.

  • drop_columns (bool, optional) – Whether to drop the original columns after creating the new features, by default False.

  • new_column_names (Optional[List[str]], optional) – List of new column names for the created features, by default None.

Examples

>>> from math_features import MathFeatures
>>> import polars as pl
>>> X ={'A': [1, 2, 3, 4],
...         'B': [4, 3, 2, 1],
...         'C': [1, 2, 1, 2]}
>>> X = pl.DataFrame(X)

Example 1: drop_columns=False

>>> transformer = MathFeatures(groups=[['A', 'B'], ['B', 'C']], operations=['sum', 'mean'])
>>> transformer.fit(X)
MathFeatures(groups=[['A', 'B'], ['B', 'C']], operations=['sum', 'mean'])
>>> result = transformer.transform(X)
>>> result
shape: (4, 6)
┌─────┬─────┬─────┬────────┬─────-───┬────────┐
│  A  │  B  │  C  │ A_B_sum│ A_B_mean│ B_C_sum│
│ i64 │ i64 │ i64 │  f64   │  f64    │  f64   │
├─────┼─────┼─────┼────────┼──────-──┼────────┤
│  1  │  4  │  1  │  5.0   │  2.5    │  5.0   │
│  2  │  3  │  2  │  5.0   │  2.5    │  5.0   │
│  3  │  2  │  1  │  5.0   │  2.5    │  3.0   │
│  4  │  1  │  2  │  5.0   │  2.5    │  3.0   │
└─────┴─────┴─────┴────────┴───────-─┴────────┘

Example 2: drop_columns=True

>>> transformer = MathFeatures(groups=[['A', 'B'], ['B', 'C']], operations=['sum'], drop_columns=True)
>>> transformer.fit(X)
MathFeatures(groups=[['A', 'B'], ['B', 'C']], operations=['sum'], drop_columns=True)
>>> result = transformer.transform(X)
>>> result
shape: (4, 2)
┌────────┬────────┐
│ A_B_sum│ B_C_sum│
│  f64   │  f64   │
├────────┼────────┤
│  5.0   │  5.0   │
│  5.0   │  5.0   │
│  5.0   │  3.0   │
│  5.0   │  3.0   │
└────────┴────────┘
fit(X, y=None)[source]#

Fit the transformer by generating column name mappings.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

MathFeatures

transform(X)[source]#

Transform the input DataFrame by extracting specified components.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame.

Return type:

DataFrame

class gators.feature_generation.RatioFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates ratio features by dividing numerator columns by denominator columns.

This transformer creates ratio features in a 1-to-1 pairing between numerator and denominator columns. Division by zero is handled by replacing the result with null values.

Parameters:
  • numerator_columns (List[str]) – List of column names to use as numerators.

  • denominator_columns (List[str]) – List of column names to use as denominators. Must have the same length as numerator_columns.

  • new_column_names (Optional[List[str]], optional) – List of custom names for the ratio features. If None, names will be automatically generated as ‘{numerator}__div__{denominator}’, by default None.

  • drop_columns (bool, optional) – Whether to drop the original numerator and denominator columns after creating ratios, by default False.

Examples

>>> from gators.feature_generation import RatioFeatures
>>> import polars as pl
>>> X = pl.DataFrame({
...     'revenue': [100, 200, 300, 400],
...     'cost': [80, 100, 150, 0],
...     'clicks': [1000, 2000, 3000, 4000],
...     'impressions': [10000, 20000, 30000, 40000]
... })

Example 1: Basic ratio features

>>> transformer = RatioFeatures(
...     numerator_columns=['revenue', 'clicks'],
...     denominator_columns=['cost', 'impressions']
... )
>>> transformer.fit(X)
RatioFeatures(numerator_columns=['revenue', 'clicks'], denominator_columns=['cost', 'impressions'])
>>> result = transformer.transform(X)
>>> result
shape: (4, 6)
┌─────────┬──────┬────────┬─────────────┬────────────────────┬─────────────────────────┐
│ revenue │ cost │ clicks │ impressions │ revenue__div__cost │ clicks__div__impressions│
│ i64     │ i64  │ i64    │ i64         │ f64                │ f64                     │
├─────────┼──────┼────────┼─────────────┼────────────────────┼─────────────────────────┤
│ 100     │ 80   │ 1000   │ 10000       │ 1.25               │ 0.1                     │
│ 200     │ 100  │ 2000   │ 20000       │ 2.0                │ 0.1                     │
│ 300     │ 150  │ 3000   │ 30000       │ 2.0                │ 0.1                     │
│ 400     │ 0    │ 4000   │ 40000       │ null               │ 0.1                     │
└─────────┴──────┴────────┴─────────────┴────────────────────┴─────────────────────────┘

Example 2: Custom column names

>>> transformer = RatioFeatures(
...     numerator_columns=['revenue'],
...     denominator_columns=['cost'],
...     new_column_names=['profit_margin']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (4, 5)
┌─────────┬──────┬────────┬─────────────┬───────────────┐
│ revenue │ cost │ clicks │ impressions │ profit_margin │
│ i64     │ i64  │ i64    │ i64         │ f64           │
├─────────┼──────┼────────┼─────────────┼───────────────┤
│ 100     │ 80   │ 1000   │ 10000       │ 1.25          │
│ 200     │ 100  │ 2000   │ 20000       │ 2.0           │
│ 300     │ 150  │ 3000   │ 30000       │ 2.0           │
│ 400     │ 0    │ 4000   │ 40000       │ null          │
└─────────┴──────┴────────┴─────────────┴───────────────┘

Example 3: With drop_columns=True

>>> transformer = RatioFeatures(
...     numerator_columns=['revenue'],
...     denominator_columns=['cost'],
...     drop_columns=True
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (4, 3)
┌────────┬─────────────┬────────────────────┐
│ clicks │ impressions │ revenue__div__cost │
│ i64    │ i64         │ f64                │
├────────┼─────────────┼────────────────────┤
│ 1000   │ 10000       │ 1.25               │
│ 2000   │ 20000       │ 2.0                │
│ 3000   │ 30000       │ 2.0                │
│ 4000   │ 40000       │ null               │
└────────┴─────────────┴────────────────────┘

Example 4: Handling null values

>>> X_with_nulls = pl.DataFrame({
...     'A': [10, None, 30, 40],
...     'B': [2, 5, None, 0]
... })
>>> transformer = RatioFeatures(
...     numerator_columns=['A'],
...     denominator_columns=['B']
... )
>>> result = transformer.fit_transform(X_with_nulls)
>>> result
shape: (4, 3)
┌──────┬──────┬──────────────┐
│ A    │ B    │ A__div__B    │
│ i64  │ i64  │ f64          │
├──────┼──────┼──────────────┤
│ 10   │ 2    │ 5.0          │
│ null │ 5    │ null         │
│ 30   │ null │ null         │
│ 40   │ 0    │ null         │
└──────┴──────┴──────────────┘
fit(X, y=None)[source]#

Fit the transformer by generating column name mappings.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

RatioFeatures

transform(X)[source]#

Transform the input DataFrame by creating ratio features.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with ratio features.

Return type:

DataFrame

class gators.feature_generation.GroupScalingFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates group-based scaling features for numerical columns.

This transformer creates features like:

  • value / group_mean (most common: relative position vs average)

  • value / group_median (robust to outliers)

  • (value - group_mean) / group_std (z-score: standardized deviation)

  • (value - group_min) / (group_max - group_min) (min-max: 0-1 normalization)

Importance for Fraud Detection#

Group scaling features are particularly valuable in fraud detection because they capture relative deviations from group-level behavior patterns. Fraudulent transactions often exhibit unusual characteristics compared to the typical behavior within their segments.

  • mean/median ratios: Show multiplicative deviation (e.g., 10x the group average)

  • zscore: Quantifies how many standard deviations away from group mean (e.g., 3σ anomaly)

  • minmax: Shows relative position within observed range (0=min, 1=max, handles negatives)

These features are especially powerful when combined with various grouping dimensions (e.g., by merchant, customer segment, time of day, or geographic location) to capture different aspects of abnormal behavior.

param subset:

List of numerical column names to transform.

type subset:

List[str]

param by:

List of column names to use for groupby operations. Each column will be used for a separate groupby operation (e.g., [‘cat1’, ‘cat2’] creates features grouped by cat1 and separate features grouped by cat2).

type by:

List[str]

param func:

List of scaling functions to apply. Available options: - ‘mean’: value / group_mean (relative position vs average) - ‘median’: value / group_median (robust to outliers) - ‘zscore’: (value - group_mean) / group_std (standardized deviation) - ‘minmax’: (value - group_min) / (group_max - group_min) (0-1 normalization)

type func:

List[str]

param fill_value:

Value to use when denominator is zero or null (safe division/scaling).

type fill_value:

float, default=0.0

param drop_columns:

Whether to drop the original numerical columns after creating scaled features.

type drop_columns:

bool, default=False

param new_column_names:

List of custom names for the scaled feature columns. If None, uses default naming pattern ‘{num_col}__{func}_{groupby_col}’. Must have same length as the total number of features created (subset × by × func).

type new_column_names:

Optional[List[str]], default=None

Examples

>>> from gators.feature_generation import GroupScalingFeatures
>>> import polars as pl
>>> X ={
...     'amount': [100, 200, 150, 300, 250],
...     'cat1': ['A', 'A', 'B', 'B', 'A'],
...     'cat2': ['X', 'Y', 'X', 'X', 'X']
... }
>>> X = pl.DataFrame(X)

Example 1: Single groupby column with multiple scaling functions

>>> transformer = GroupScalingFeatures(
...     subset=['amount'],
...     by=['cat1'],
...     func=['mean', 'zscore']
... )
>>> transformer.fit(X)
GroupScalingFeatures(subset=['amount'], by=['cat1'], func=['mean', 'zscore'])
>>> result = transformer.transform(X)
>>> result
shape: (5, 5)
┌────────┬──────┬──────┬──────────────────┬────────────────────┐
│ amount ┆ cat1 ┆ cat2 ┆ amount__mean_cat1 ┆ amount__zscore_cat1 │
│ ---    ┆ ---  ┆ ---  ┆ ---              ┆ ---                │
│ i64    ┆ str  ┆ str  ┆ f64              ┆ f64                │
╞════════╪══════╪══════╪══════════════════╪════════════════════╡
│ 100    ┆ A    ┆ X    ┆ 0.545455         ┆ -1.069045          │
│ 200    ┆ A    ┆ Y    ┆ 1.090909         ┆ 0.267261           │
│ 150    ┆ B    ┆ X    ┆ 0.666667         ┆ -0.707107          │
│ 300    ┆ B    ┆ X    ┆ 1.333333         ┆ 0.707107           │
│ 250    ┆ A    ┆ X    ┆ 1.363636         ┆ 0.801784           │
└────────┴──────┴──────┴──────────────────┴────────────────────┘

Example 2: Multiple groupby columns

>>> X ={
...     'amount': [100, 200, 150, 300],
...     'value': [50, 100, 75, 150],
...     'cat1': ['A', 'A', 'B', 'B'],
...     'cat2': ['X', 'Y', 'X', 'Y']
... }
>>> X = pl.DataFrame(X)
>>> transformer = GroupScalingFeatures(
...     subset=['amount'],
...     by=['cat1', 'cat2'],
...     func=['mean']
... )
>>> result = transformer.fit_transform(X)
>>> result.columns
['amount', 'value', 'cat1', 'cat2', 'amount__mean_cat1', 'amount__mean_cat2']
# Creates separate features grouped by cat1 and grouped by cat2

Example 3: Min-max scaling

>>> X ={
...     'amount': [100, 200, 150, 300],
...     'cat1': ['A', 'A', 'B', 'B']
... }
>>> X = pl.DataFrame(X)
>>> transformer = GroupScalingFeatures(
...     subset=['amount'],
...     by=['cat1'],
...     func=['minmax']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (4, 3)
┌────────┬──────┬─────────────────────┐
│ amount ┆ cat1 ┆ amount__minmax_cat1 │
│ ---    ┆ ---  ┆ ---                 │
│ i64    ┆ str  ┆ f64                 │
╞════════╪══════╪═════════════════════╡
│ 100    ┆ A    ┆ 0.0                 │
│ 200    ┆ A    ┆ 1.0                 │
│ 150    ┆ B    ┆ 0.0                 │
│ 300    ┆ B    ┆ 1.0                 │
└────────┴──────┴─────────────────────┘
fit(X, y=None)[source]#

Fit the transformer by generating column name mappings.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

GroupScalingFeatures

transform(X)[source]#

Transform the input DataFrame by creating group scaling features.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with group scaling features.

Return type:

DataFrame

class gators.feature_generation.GroupStatisticsFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates statistical aggregation features based on group-level computations.

Unlike GroupRatioFeatures which divides values by group stats, this transformer directly adds the group statistics as new columns.

Parameters:
  • subset (List[str]) – List of numerical column names to aggregate.

  • by (List[str]) – List of column names to use for groupby operations. Each column will be used for a separate groupby operation (e.g., [‘cat1’, ‘cat2’] creates features grouped by cat1 and separate features grouped by cat2).

  • func (List[str]) – List of aggregation functions to apply. Available options: - ‘mean’: Group mean - ‘std’: Group standard deviation - ‘median’: Group median - ‘min’: Group minimum - ‘max’: Group maximum - ‘sum’: Group sum - ‘count’: Group count - ‘range’: Group range (max - min)

  • drop_columns (bool, default=False) – Whether to drop the original numerical columns after creating statistics.

  • new_column_names (Optional[List[str]], default=None) – List of custom names for the statistic columns. If None, uses default naming pattern ‘{agg}_{num_col}__per_{groupby_col}’. Must have same length as the total number of features created (subset × by × func).

Examples

>>> from gators.feature_generation import GroupStatisticsFeatures
>>> import polars as pl
>>> X ={
...     'amount': [100, 200, 150, 300, 250],
...     'cat1': ['A', 'A', 'B', 'B', 'A'],
...     'cat2': ['X', 'Y', 'X', 'X', 'X']
... }
>>> X = pl.DataFrame(X)

Example 1: Basic group statistics

>>> transformer = GroupStatisticsFeatures(
...     subset=['amount'],
...     by=['cat1'],
...     func=['mean', 'count']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (5, 5)
┌────────┬───────┬───────┬───────────────────────┬────────────────────────┐
│ amount ┆ cat1  ┆ cat2  ┆ mean_amount__per_cat1 ┆ count_amount__per_cat1 │
│ ---    ┆ ---   ┆ ---   ┆ ---                   ┆ ---                    │
│ i64    ┆ str   ┆ str   ┆ f64                   ┆ u32                    │
╞════════╪═══════╪═══════╪═══════════════════════╪════════════════════════╡
│ 100    ┆ A     ┆ X     ┆ 183.333333            ┆ 3                      │
│ 200    ┆ A     ┆ Y     ┆ 183.333333            ┆ 3                      │
│ 150    ┆ B     ┆ X     ┆ 225.0                 ┆ 2                      │
│ 300    ┆ B     ┆ X     ┆ 225.0                 ┆ 2                      │
│ 250    ┆ A     ┆ X     ┆ 183.333333            ┆ 3                      │
└────────┴───────┴───────┴───────────────────────┴────────────────────────┘

Example 2: Multiple groupby columns

>>> X ={
...     'amount': [100, 200, 150, 300],
...     'cat1': ['A', 'A', 'B', 'B'],
...     'cat2': ['X', 'Y', 'X', 'Y']
... }
>>> X = pl.DataFrame(X)
>>> transformer = GroupStatisticsFeatures(
...     subset=['amount'],
...     by=['cat1', 'cat2'],
...     func=['mean']
... )
>>> result = transformer.fit_transform(X)
>>> result.columns
['amount', 'cat1', 'cat2', 'mean_amount__per_cat1', 'mean_amount__per_cat2']
# Creates separate features grouped by cat1 and grouped by cat2

Example 3: Multiple func

>>> transformer = GroupStatisticsFeatures(
...     subset=['amount'],
...     by=['cat1'],
...     func=['mean', 'std', 'min', 'max']
... )
>>> result = transformer.fit_transform(X)
>>> result.columns
['amount', 'cat1', 'cat2', 'mean_amount__per_cat1', 'std_amount__per_cat1',
 'min_amount__per_cat1', 'max_amount__per_cat1']
fit(X, y=None)[source]#

Fit the transformer by generating column name mappings.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

GroupStatisticsFeatures

transform(X)[source]#

Transform the input DataFrame by creating group statistic features.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with group statistic features.

Return type:

DataFrame

class gators.feature_generation.GroupLagFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates lag (previous values) and lead (next values) features within groups.

This transformer creates features like:

  • Previous transaction amount for this card

  • Next transaction amount for this card

  • Value N periods ago within group

Useful for time-series analysis and detecting changes in behavior patterns.

Parameters:
  • subset (List[str]) – List of numerical column names to create lag/lead features for.

  • by (List[str]) – List of columns to group by. Lags/leads are computed within each group.

  • lags (List[int]) – List of lag periods. Positive integers create lag features (previous values). Example: [1, 2, 3] creates lag_1, lag_2, lag_3

  • leads (List[int], default=[]) – List of lead periods. Positive integers create lead features (next values). Example: [1, 2] creates lead_1, lead_2

  • fill_value (Optional[float], default=None) – Value to use for missing lag/lead values. If None, uses null.

  • drop_columns (bool, default=False) – Whether to drop the original numerical columns after creating lag features.

  • new_column_names (Optional[List[str]], default=None) – List of custom names for the lag/lead columns. If None, uses default naming pattern ‘{num_col}_lag{n}_{groupby_cols}’ or ‘{num_col}_lead{n}_{groupby_cols}’. Must have same length as the total number of features created.

Examples

>>> from gators.feature_generation import GroupLagFeatures
>>> import polars as pl
>>> X ={
...     'amount': [100, 200, 150, 300, 250, 180],
...     'cat1': ['A', 'A', 'B', 'B', 'A', 'B'],
...     'time': [1, 2, 1, 2, 3, 3]
... }
>>> X = pl.DataFrame(X).sort(['cat1', 'time'])

Example 1: Basic lag features

>>> transformer = GroupLagFeatures(
...     subset=['amount'],
...     by=['cat1'],
...     lags=[1, 2]
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (6, 5)
┌────────┬───────┬──────┬─────────────────────┬─────────────────────┐
│ amount ┆ cat1  ┆ time ┆ amount_lag1_cat1    ┆ amount_lag2_cat1    │
│ ---    ┆ ---   ┆ ---  ┆ ---                 ┆ ---                 │
│ i64    ┆ str   ┆ i64  ┆ i64                 ┆ i64                 │
╞════════╪═══════╪══════╪═════════════════════╪═════════════════════╡
│ 100    ┆ A     ┆ 1    ┆ null                ┆ null                │
│ 200    ┆ A     ┆ 2    ┆ 100                 ┆ null                │
│ 250    ┆ A     ┆ 3    ┆ 200                 ┆ 100                 │
│ 150    ┆ B     ┆ 1    ┆ null                ┆ null                │
│ 300    ┆ B     ┆ 2    ┆ 150                 ┆ null                │
│ 180    ┆ B     ┆ 3    ┆ 300                 ┆ 150                 │
└────────┴───────┴──────┴─────────────────────┴─────────────────────┘

Example 2: Lag and lead features

>>> transformer = GroupLagFeatures(
...     subset=['amount'],
...     by=['cat1'],
...     lags=[1],
...     leads=[1]
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (6, 5)
┌────────┬───────┬──────┬───────────────────┬────────────────────┐
│ amount ┆ cat1  ┆ time ┆ amount_lag1_cat1  ┆ amount_lead1_cat1  │
│ ---    ┆ ---   ┆ ---  ┆ ---               ┆ ---                │
│ i64    ┆ str   ┆ i64  ┆ i64               ┆ i64                │
╞════════╪═══════╪══════╪═══════════════════╪════════════════════╡
│ 100    ┆ A     ┆ 1    ┆ null              ┆ 200                │
│ 200    ┆ A     ┆ 2    ┆ 100               ┆ 250                │
│ 250    ┆ A     ┆ 3    ┆ 200               ┆ null               │
│ 150    ┆ B     ┆ 1    ┆ null              ┆ 300                │
│ 300    ┆ B     ┆ 2    ┆ 150               ┆ 180                │
│ 180    ┆ B     ┆ 3    ┆ 300               ┆ null               │
└────────┴───────┴──────┴───────────────────┴────────────────────┘

Example 3: With fill_value

>>> transformer = GroupLagFeatures(
...     subset=['amount'],
...     by=['cat1'],
...     lags=[1],
...     fill_value=0.0
... )
>>> result = transformer.fit_transform(X)
>>> result['amount_lag1_cat1'][0]  # First row, no previous value
0.0

Notes

  • Data should be sorted by by and time before transformation

  • Lag features look backwards: lag_1 is the previous row within the group

  • Lead features look forwards: lead_1 is the next row within the group

  • First rows in each group will have null (or fill_value) for lag features

  • Last rows in each group will have null (or fill_value) for lead features

fit(X, y=None)[source]#

Fit the transformer by generating column name mappings.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

GroupLagFeatures

transform(X)[source]#

Transform the input DataFrame by creating lag/lead features.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with lag/lead features.

Return type:

DataFrame

class gators.feature_generation.ComparisonFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates binary comparison features between pairs of columns, or unary null checks.

Parameters:
  • subset_a (List[str]) – List of column names for the left side of comparisons (or the only column for unary operators).

  • subset_b (List[str]) – List of column names for the right side of comparisons. For unary operators (‘is_null’, ‘is_not_null’), these values are ignored.

  • operators (List[Literal[">", "<", ">=", "<=", "==", "!=", "is_null", "is_not_null"]]) – List of comparison operators to apply. Must match length of columns. Unary operators: ‘is_null’, ‘is_not_null’ (only use subset_a) Binary operators: ‘>’, ‘<’, ‘>=’, ‘<=’, ‘==’, ‘!=’ (use both subset_a and subset_b)

  • drop_columns (bool, default=False) – Whether to drop the original columns after creating comparisons.

Examples

>>> from gators.feature_generation import ComparisonFeatures
>>> import polars as pl
>>> X ={'A': [10, 20, 30, 40],
...         'B': [15, 10, 30, 35],
...         'C': [5, 25, 20, 50]}
>>> X = pl.DataFrame(X)

Example 1: Single comparison

>>> transformer = ComparisonFeatures(
...     subset_a=['A'],
...     subset_b=['B'],
...     operators=['>']
... )
>>> transformer.fit(X)
ComparisonFeatures(subset_a=['A'], subset_b=['B'], operators=['>'])
>>> result = transformer.transform(X)
>>> result
shape: (4, 4)
┌──────┬──────┬──────┬─────────┐
│  A   │  B   │  C   │ A_gt_B  │
│ i64  │ i64  │ i64  │  bool   │
├──────┼──────┼──────┼─────────┤
│  10  │  15  │  5   │  false  │
│  20  │  10  │  25  │  true   │
│  30  │  30  │  20  │  false  │
│  40  │  35  │  50  │  true   │
└──────┴──────┴──────┴─────────┘

Example 2: Multiple comparisons with different operators

>>> transformer = ComparisonFeatures(
...     subset_a=['A', 'B', 'A'],
...     subset_b=['B', 'C', 'C'],
...     operators=['>', '<', '>=']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (4, 6)
┌──────┬──────┬──────┬─────────┬─────────┬─────────┐
│  A   │  B   │  C   │ A_gt_B  │ B_lt_C  │ A_gte_C │
│ i64  │ i64  │ i64  │  bool   │  bool   │  bool   │
├──────┼──────┼──────┼─────────┼─────────┼─────────┤
│  10  │  15  │  5   │  false  │  false  │  true   │
│  20  │  10  │  25  │  true   │  true   │  false  │
│  30  │  30  │  20  │  false  │  false  │  true   │
│  40  │  35  │  50  │  true   │  true   │  false  │
└──────┴──────┴──────┴─────────┴─────────┴─────────┘

Example 3: Null checks (unary operators)

>>> data_with_nulls = pl.DataFrame({
...     'A': [10, None, 30, None],
...     'B': [15, 10, None, 35]
... })
>>> transformer = ComparisonFeatures(
...     subset_a=['A', 'B'],
...     subset_b=['', ''],  # Ignored for unary operators
...     operators=['is_null', 'is_not_null']
... )
>>> result = transformer.fit_transform(data_with_nulls)
>>> result
shape: (4, 4)
┌──────┬──────┬────────────┬────────────────┐
│  A   │  B   │ A__is_null │ B__is_not_null │
│ i64  │ i64  │  bool      │  bool          │
├──────┼──────┼────────────┼────────────────┤
│  10  │  15  │  false     │  true          │
│ null │  10  │  true      │  true          │
│  30  │ null │  false     │  false         │
│ null │  35  │  true      │  true          │
└──────┴──────┴────────────┴────────────────┘

Example 4: With drop_columns=True

>>> transformer = ComparisonFeatures(
...     subset_a=['A'],
...     subset_b=['B'],
...     operators=['>'],
...     drop_columns=True
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (4, 2)
┌──────┬─────────┐
│  C   │ A_gt_B  │
│ i64  │  bool   │
├──────┼─────────┤
│  5   │  false  │
│  25  │  true   │
│  20  │  false  │
│  50  │  true   │
└──────┴─────────┘
fit(X, y=None)[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

ComparisonFeatures

transform(X)[source]#

Transform the input DataFrame by creating comparison features.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with comparison features.

Return type:

DataFrame

class gators.feature_generation.ConditionFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Creates multiple independent boolean features, one for each condition.

This transformer is designed for creating simple boolean flags without combination logic. Each condition produces exactly one boolean output column. For combining multiple conditions with AND/OR logic, use RuleFeatures instead.

Use Cases:

  • Create simple boolean flags (is_adult, is_weekend, is_premium, etc.)

  • Materialize threshold-based features (is_high_value, is_frequent_user)

  • Feature engineering: Generate independent indicator variables

  • Fraud detection: Create simple risk flags before combining them

When to Use:

  • Need multiple independent boolean columns

  • Each condition stands alone (no AND/OR combination needed)

  • Want cleaner API than RuleFeatures for simple cases

  • Building feature sets for downstream transformers

When NOT to Use:

  • Need to combine conditions with AND/OR (use RuleFeatures)

  • One-off exploratory analysis (use Polars native expressions)

  • Very simple cases with 1-2 conditions (just use .with_columns())

Parameters:
  • conditions (List[Dict[str, Any]]) –

    List of condition dictionaries. Each condition creates one boolean output column.

    Each condition dictionary must contain:

    • ’column’: str - Name of the column to evaluate

    • ’op’: str - Comparison operator. Supported:

      • Binary: ‘>’, ‘<’, ‘>=’, ‘<=’, ‘==’, ‘!=’ (require ‘value’ or ‘other_column’)

      • Unary: ‘is_null’, ‘is_not_null’ (no ‘value’ or ‘other_column’ needed)

    • ’value’: Any (optional) - Scalar value to compare the column against

    • ’other_column’: str (optional) - Name of another column to compare against

    For binary operators: Either ‘value’ or ‘other_column’ must be specified, but not both. For unary operators: Neither ‘value’ nor ‘other_column’ should be specified.

    Examples:

    # Simple conditions:
    [
        {'column': 'age', 'op': '>=', 'value': 18},
        {'column': 'amount', 'op': '>', 'value': 1000}
    ]
    
    # Column comparison:
    [
        {'column': 'velocity_24h', 'op': '>', 'other_column': 'velocity_7d'}
    ]
    
    # Null checks:
    [
        {'column': 'age', 'op': 'is_null'},
        {'column': 'email', 'op': 'is_not_null'}
    ]
    

  • new_column_names (Optional[List[str]], default=None) –

    Names for the resulting boolean feature columns. If provided, must have the same length as conditions. If None, column names are auto-generated in the format:

    • Scalar comparison: {column}_{op_name}_{value} (e.g., ‘age_gte_18’)

    • Column comparison: {column}_{op_name}_{other_column} (e.g., ‘velocity_24h_gt_velocity_7d’)

    • Unary operation: {column}__{op_name} (e.g., ‘age__is_null’)

    Operator name mapping:

    • ’>’ -> ‘gt’

    • ’<’ -> ‘lt’

    • ’>=’ -> ‘gte’

    • ’<=’ -> ‘lte’

    • ’==’ -> ‘eq’

    • ’!=’ -> ‘ne’

    • ’is_null’ -> ‘is_null’

    • ’is_not_null’ -> ‘is_not_null’

Examples

>>> import polars as pl
>>> from gators.feature_generation import ConditionFeatures
>>> X ={
...     'age': [15, 25, 30, 17, 45],
...     'amount': [100, 1500, 500, 200, 2000],
...     'family_size': [1, 3, 1, 4, 2],
...     'fare': [50, 75, 30, 100, 80]
... }
>>> X = pl.DataFrame(X)

Example 1: Create simple boolean flags

>>> transformer = ConditionFeatures(
...     conditions=[
...         {'column': 'age', 'op': '>=', 'value': 18},
...         {'column': 'amount', 'op': '>', 'value': 1000},
...         {'column': 'family_size', 'op': '==', 'value': 1}
...     ],
...     new_column_names=['is_adult', 'is_high_amount', 'is_alone']
... )
>>> result = transformer.fit_transform(X)
>>> result.select(['age', 'amount', 'family_size', 'is_adult', 'is_high_amount', 'is_alone'])
shape: (5, 6)
┌─────┬────────┬─────────────┬──────────┬─────────────────┬──────────┐
│ age ┆ amount ┆ family_size ┆ is_adult ┆ is_high_amount  ┆ is_alone │
│ --- ┆ ---    ┆ ---         ┆ ---      ┆ ---             ┆ ---      │
│ i64 ┆ i64    ┆ i64         ┆ bool     ┆ bool            ┆ bool     │
╞═════╪════════╪═════════════╪══════════╪═════════════════╪══════════╡
│ 15  ┆ 100    ┆ 1           ┆ false    ┆ false           ┆ true     │
│ 25  ┆ 1500   ┆ 3           ┆ true     ┆ true            ┆ false    │
│ 30  ┆ 500    ┆ 1           ┆ true     ┆ false           ┆ true     │
│ 17  ┆ 200    ┆ 4           ┆ false    ┆ false           ┆ false    │
│ 45  ┆ 2000   ┆ 2           ┆ true     ┆ true            ┆ false    │
└─────┴────────┴─────────────┴──────────┴─────────────────┴──────────┘

Example 2: Column-to-column comparison

>>> fare_X ={
...     'fare': [50.0, 100.0, 30.0, 200.0, 80.0],
...     'fare_per_person': [50.0, 33.3, 30.0, 50.0, 40.0]
... }
>>> fare_X = pl.DataFrame(fare_data)
>>> fare_transformer = ConditionFeatures(
...     conditions=[
...         {'column': 'fare', 'op': '>', 'value': 100},
...         {'column': 'fare_per_person', 'op': '>', 'other_column': 'fare'}
...     ],
...     new_column_names=['is_expensive', 'paid_more_per_person']
... )
>>> result = fare_transformer.fit_transform(fare_X)
>>> result
shape: (5, 4)
┌───────┬──────────────────┬──────────────┬──────────────────────┐
│ fare  ┆ fare_per_person  ┆ is_expensive ┆ paid_more_per_person │
│ ---   ┆ ---              ┆ ---          ┆ ---                  │
│ f64   ┆ f64              ┆ bool         ┆ bool                 │
╞═══════╪══════════════════╪══════════════╪══════════════════════╡
│ 50.0  ┆ 50.0             ┆ false        ┆ false                │
│ 100.0 ┆ 33.3             ┆ false        ┆ false                │
│ 30.0  ┆ 30.0             ┆ false        ┆ false                │
│ 200.0 ┆ 50.0             ┆ true         ┆ false                │
│ 80.0  ┆ 40.0             ┆ false        ┆ false                │
└───────┴──────────────────┴──────────────┴──────────────────────┘

Example 3: Titanic-style feature engineering

>>> titanic_X ={
...     'Age': [22.0, 38.0, 26.0, 35.0, 12.0],
...     'Pclass': [3, 1, 3, 1, 3],
...     'SibSp': [1, 1, 0, 1, 0],
...     'Parch': [0, 0, 0, 0, 1]
... }
>>> titanic_X = pl.DataFrame(titanic_data)
>>> # First add family_size
>>> titanic_X = titanic_X.with_columns(
...     (pl.col('SibSp') + pl.col('Parch')).alias('family_size')
... )
>>> titanic_transformer = ConditionFeatures(
...     conditions=[
...         {'column': 'Age', 'op': '<', 'value': 18},
...         {'column': 'Pclass', 'op': '==', 'value': 1},
...         {'column': 'family_size', 'op': '==', 'value': 0}
...     ],
...     new_column_names=['is_child', 'is_first_class', 'is_alone']
... )
>>> result = titanic_transformer.fit_transform(titanic_X)
>>> result.select(['Age', 'Pclass', 'family_size', 'is_child', 'is_first_class', 'is_alone'])
shape: (5, 6)
┌──────┬────────┬─────────────┬──────────┬────────────────┬──────────┐
│ Age  ┆ Pclass ┆ family_size ┆ is_child ┆ is_first_class ┆ is_alone │
│ ---  ┆ ---    ┆ ---         ┆ ---      ┆ ---            ┆ ---      │
│ f64  ┆ i64    ┆ i64         ┆ bool     ┆ bool           ┆ bool     │
╞══════╪════════╪═════════════╪══════════╪════════════════╪══════════╡
│ 22.0 ┆ 3      ┆ 1           ┆ false    ┆ false          ┆ false    │
│ 38.0 ┆ 1      ┆ 1           ┆ false    ┆ true           ┆ false    │
│ 26.0 ┆ 3      ┆ 0           ┆ false    ┆ false          ┆ true     │
│ 35.0 ┆ 1      ┆ 1           ┆ false    ┆ true           ┆ false    │
│ 12.0 ┆ 3      ┆ 1           ┆ true     ┆ false          ┆ false    │
└──────┴────────┴─────────────┴──────────┴────────────────┴──────────┘

Example 4: Auto-generated column names

>>> auto_transformer = ConditionFeatures(
...     conditions=[
...         {'column': 'age', 'op': '>=', 'value': 18},
...         {'column': 'amount', 'op': '>', 'value': 1000},
...         {'column': 'family_size', 'op': '==', 'value': 1}
...     ]
...     # new_column_names not specified - will be auto-generated
... )
>>> result = auto_transformer.fit_transform(X)
>>> result.select(['age', 'amount', 'family_size', 'age_gte_18', 'amount_gt_1000', 'family_size_eq_1'])
shape: (5, 6)
┌─────┬────────┬─────────────┬────────────┬────────────────┬──────────────────┐
│ age ┆ amount ┆ family_size ┆ age_gte_18 ┆ amount_gt_1000 ┆ family_size_eq_1 │
│ --- ┆ ---    ┆ ---         ┆ ---        ┆ ---            ┆ ---              │
│ i64 ┆ i64    ┆ i64         ┆ bool       ┆ bool           ┆ bool             │
╞═════╪════════╪═════════════╪════════════╪════════════════╪══════════════════╡
│ 15  ┆ 100    ┆ 1           ┆ false      ┆ false          ┆ true             │
│ 25  ┆ 1500   ┆ 3           ┆ true       ┆ true           ┆ false            │
│ 30  ┆ 500    ┆ 1           ┆ true       ┆ false          ┆ true             │
│ 17  ┆ 200    ┆ 4           ┆ false      ┆ false          ┆ false            │
│ 45  ┆ 2000   ┆ 2           ┆ true       ┆ true           ┆ false            │
└─────┴────────┴─────────────┴────────────┴────────────────┴──────────────────┘

Example 5: Null checks (unary operators)

>>> data_with_nulls = {
...     'age': [25, None, 30, 17, None],
...     'email': ['a@test.com', 'b@test.com', None, 'd@test.com', None],
...     'amount': [100, 1500, 500, 200, 2000]
... }
>>> X_nulls = pl.DataFrame(data_with_nulls)
>>> null_transformer = ConditionFeatures(
...     conditions=[
...         {'column': 'age', 'op': 'is_null'},
...         {'column': 'email', 'op': 'is_not_null'},
...         {'column': 'amount', 'op': '>', 'value': 1000}
...     ],
...     new_column_names=['age_missing', 'has_email', 'is_high_amount']
... )
>>> result = null_transformer.fit_transform(X_nulls)
>>> result
shape: (5, 6)
┌──────┬─────────────┬────────┬─────────────┬───────────┬─────────────────┐
│ age  ┆ email       ┆ amount ┆ age_missing ┆ has_email ┆ is_high_amount  │
│ ---  ┆ ---         ┆ ---    ┆ ---         ┆ ---       ┆ ---             │
│ i64  ┆ str         ┆ i64    ┆ bool        ┆ bool      ┆ bool            │
╞══════╪═════════════╪════════╪═════════════╪═══════════╪═════════════════╡
│ 25   ┆ a@test.com  ┆ 100    ┆ false       ┆ true      ┆ false           │
│ null ┆ b@test.com  ┆ 1500   ┆ true        ┆ true      ┆ true            │
│ 30   ┆ null        ┆ 500    ┆ false       ┆ false     ┆ false           │
│ 17   ┆ d@test.com  ┆ 200    ┆ false       ┆ true      ┆ false           │
│ null ┆ null        ┆ 2000   ┆ true        ┆ false     ┆ true            │
└──────┴─────────────┴────────┴─────────────┴───────────┴─────────────────┘

Notes

  • Each condition produces exactly one independent boolean column

  • Auto-naming: If new_column_names is None, names are auto-generated as: * Scalar: {column}_{op_name}_{value} (e.g., ‘age_gte_18’) * Column-to-column: {column}_{op_name}_{other_column} (e.g., ‘velocity_24h_gt_velocity_7d’) * Unary: {column}__{op_name} (e.g., ‘age__is_null’)

  • No combination logic - use RuleFeatures if you need AND/OR

  • Simpler API than RuleFeatures for common use cases

  • Missing values (null) in comparisons typically result in null/false

  • Unary operators ‘is_null’ and ‘is_not_null’ explicitly check for null values

  • Can be used as preprocessing step before RuleFeatures for complex logic

See also

RuleFeatures

For combining multiple conditions with AND/OR logic

fit(X, y=None)[source]#

Fit the transformer by generating column names if not provided.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Any | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

ConditionFeatures

transform(X)[source]#

Transform the input DataFrame by creating boolean features for each condition.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with new boolean features (one per condition).

Return type:

DataFrame

class gators.feature_generation.DistanceFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Calculates distances between geographic coordinate pairs.

This transformer computes distances between consecutive pairs of latitude/longitude coordinates using different distance metrics (euclidean, manhattan, haversine) and units (km, miles, meters, feet).

For fraud detection, distance features are valuable for:

  • Detecting location anomalies (billing vs shipping address distance)

  • Identifying suspicious IP geolocation patterns

  • Flagging transactions far from customer’s typical location

  • Calculating travel feasibility (transaction velocity checks)

Parameters:
  • lats (List[str]) – List of latitude column names. Must have at least 2 elements. Coordinates are paired sequentially: (lats[0], longs[0]) to (lats[1], longs[1]), etc.

  • longs (List[str]) – List of longitude column names. Must have same length as lats.

  • unit (Literal["km", "miles", "meters", "feet"], default="km") – Unit for distance output.

  • method (Literal["euclidean", "manhattan", "haversine"], default="haversine") – Distance calculation method: - ‘haversine’: Great-circle distance on a sphere (recommended for lat/long) - ‘euclidean’: Straight-line distance - ‘manhattan’: Sum of absolute differences (taxicab distance)

  • drop_columns (bool, default=True) – Whether to drop the original coordinate columns.

  • new_column_names (Optional[List[str]], default=None) – Custom names for distance columns. If None, uses pattern: ‘distance__{lat1}_to_{lat2}__{method}_{unit}’

Examples

>>> from gators.feature_generation import DistanceFeatures
>>> import polars as pl

Example 1: Haversine distance between two locations

>>> X = pl.DataFrame({
...     'billing_lat': [40.7128, 34.0522, 41.8781],
...     'billing_long': [-74.0060, -118.2437, -87.6298],
...     'shipping_lat': [40.7580, 34.0522, 42.3601],
...     'shipping_long': [-73.9855, -118.2437, -71.0589]
... })
>>> transformer = DistanceFeatures(
...     lats=['billing_lat', 'shipping_lat'],
...     longs=['billing_long', 'shipping_long'],
...     method='haversine',
...     unit='km'
... )
>>> result = transformer.fit_transform(X)
>>> result.columns
['distance__billing_lat_to_shipping_lat__haversine_km']
>>> result['distance__billing_lat_to_shipping_lat__haversine_km'][0]
5.376...

Example 2: Multiple distance pairs

>>> X = pl.DataFrame({
...     'home_lat': [40.7128, 34.0522],
...     'home_long': [-74.0060, -118.2437],
...     'work_lat': [40.7580, 34.0700],
...     'work_long': [-73.9855, -118.3000],
...     'shop_lat': [40.7489, 34.0800],
...     'shop_long': [-73.9680, -118.3500]
... })
>>> transformer = DistanceFeatures(
...     lats=['home_lat', 'work_lat', 'shop_lat'],
...     longs=['home_long', 'work_long', 'shop_long'],
...     method='haversine',
...     unit='miles',
...     drop_columns=False
... )
>>> result = transformer.fit_transform(X)
>>> result.columns
['home_lat', 'home_long', 'work_lat', 'work_long', 'shop_lat', 'shop_long',
 'distance__home_lat_to_work_lat__haversine_miles',
 'distance__work_lat_to_shop_lat__haversine_miles']

Example 3: Euclidean distance

>>> X = pl.DataFrame({
...     'x1': [0.0, 1.0, 2.0],
...     'y1': [0.0, 1.0, 2.0],
...     'x2': [3.0, 4.0, 5.0],
...     'y2': [4.0, 5.0, 6.0]
... })
>>> transformer = DistanceFeatures(
...     lats=['x1', 'x2'],
...     longs=['y1', 'y2'],
...     method='euclidean',
...     unit='meters'
... )
>>> result = transformer.fit_transform(X)
fit(X, y=None)[source]#

Fit the transformer by generating column name mappings.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

DistanceFeatures

transform(X)[source]#

Transform the input DataFrame by calculating distance features.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with distance features.

Return type:

DataFrame

class gators.feature_generation.ScalarMathFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates new features by applying mathematical operations between columns and scalar values.

This transformer performs element-wise operations between a column and a scalar constant. Each operation creates one new feature column. For operations between multiple columns, use MathFeatures instead.

Use Cases:

  • Unit conversions (days to years, meters to feet, Celsius to Fahrenheit)

  • Normalization (divide by constant, multiply by scaling factor)

  • Feature scaling (percentage calculation, ratio computation)

  • Offset adjustments (add/subtract baseline values)

When to Use:

  • Need to apply arithmetic operations with fixed scalar values

  • Creating interpretable transformations (e.g., Age/365 for age_in_years)

  • Scaling features by known constants

  • Building feature sets for downstream models

When NOT to Use:

  • Operations between multiple columns (use MathFeatures)

  • Need learned scaling (use StandardScaler, MinMaxScaler)

  • Complex mathematical functions (use DataFrame.with_columns directly)

Parameters:

operations

List of operation dictionaries. Each operation creates one new feature column.

Each operation dictionary must contain:

new_column_namesOptional[List[str]], default=None

Names for the resulting feature columns. If provided, must have the same length as operations. If None, column names are auto-generated in the format: {column}_{op_name}_{scalar} (e.g., ‘Age_div_365’, ‘Price_mul_1.1’)

Operator name mapping: ‘+’ -> ‘plus’, ‘-’ -> ‘minus’, ‘*’ -> ‘mul’, ‘/’ -> ‘div’, ‘**’ -> ‘pow’, ‘//’ -> ‘floordiv’, ‘%’ -> ‘mod’

Examples

>>> import polars as pl
>>> from gators.feature_generation import ScalarMathFeatures
>>> X ={
...     'Age': [25, 30, 45, 12, 65],
...     'Price': [100.0, 150.0, 200.0, 75.0, 300.0],
...     'Temperature': [20.0, 25.0, 15.0, 30.0, 22.0]
... }
>>> X = pl.DataFrame(X)

Example 1: Unit conversions with custom names

>>> transformer = ScalarMathFeatures(
...     operations=[
...         {'column': 'Age', 'op': '/', 'scalar': 365},
...         {'column': 'Temperature', 'op': '+', 'scalar': 273.15}
...     ],
...     new_column_names=['Age_years', 'Temperature_kelvin']
... )
>>> result = transformer.fit_transform(X)
>>> result.select(['Age', 'Age_years', 'Temperature', 'Temperature_kelvin'])
shape: (5, 4)
┌─────┬───────────┬─────────────┬───────────────────┐
│ Age ┆ Age_years ┆ Temperature ┆ Temperature_kelvin│
│ --- ┆ ---       ┆ ---         ┆ ---               │
│ i64 ┆ f64       ┆ f64         ┆ f64               │
╞═════╪═══════════╪═════════════╪═══════════════════╡
│ 25  ┆ 0.068493  ┆ 20.0        ┆ 293.15            │
│ 30  ┆ 0.082192  ┆ 25.0        ┆ 298.15            │
│ 45  ┆ 0.123288  ┆ 15.0        ┆ 288.15            │
│ 12  ┆ 0.032877  ┆ 30.0        ┆ 303.15            │
│ 65  ┆ 0.178082  ┆ 22.0        ┆ 295.15            │
└─────┴───────────┴─────────────┴───────────────────┘

Example 2: Auto-generated column names

>>> auto_transformer = ScalarMathFeatures(
...     operations=[
...         {'column': 'Price', 'op': '*', 'scalar': 1.1},
...         {'column': 'Price', 'op': '/', 'scalar': 100}
...     ]
...     # new_column_names not specified - will be auto-generated
... )
>>> result = auto_transformer.fit_transform(X)
>>> result.select(['Price', 'Price_mul_1.1', 'Price_div_100'])
shape: (5, 3)
┌───────┬──────────────┬───────────────┐
│ Price ┆ Price_mul_1.1┆ Price_div_100 │
│ ---   ┆ ---          ┆ ---           │
│ f64   ┆ f64          ┆ f64           │
╞═══════╪══════════════╪═══════════════╡
│ 100.0 ┆ 110.0        ┆ 1.0           │
│ 150.0 ┆ 165.0        ┆ 1.5           │
│ 200.0 ┆ 220.0        ┆ 2.0           │
│ 75.0  ┆ 82.5         ┆ 0.75          │
│ 300.0 ┆ 330.0        ┆ 3.0           │
└───────┴──────────────┴───────────────┘

Example 3: Multiple operations (scaling, percentage, tax)

>>> multi_ops = ScalarMathFeatures(
...     operations=[
...         {'column': 'Price', 'op': '*', 'scalar': 1.2},  # 20% markup
...         {'column': 'Price', 'op': '/', 'scalar': 100},  # as percentage of 100
...         {'column': 'Age', 'op': '%', 'scalar': 10}      # age modulo 10
...     ],
...     new_column_names=['Price_with_tax', 'Price_pct', 'Age_decade_offset']
... )
>>> result = multi_ops.fit_transform(X)
>>> result.select(['Price', 'Price_with_tax', 'Price_pct', 'Age', 'Age_decade_offset'])
shape: (5, 5)
┌───────┬────────────────┬───────────┬─────┬───────────────────┐
│ Price ┆ Price_with_tax ┆ Price_pct ┆ Age ┆ Age_decade_offset │
│ ---   ┆ ---            ┆ ---       ┆ --- ┆ ---               │
│ f64   ┆ f64            ┆ f64       ┆ i64 ┆ i64               │
╞═══════╪════════════════╪═══════════╪═════╪═══════════════════╡
│ 100.0 ┆ 120.0          ┆ 1.0       ┆ 25  ┆ 5                 │
│ 150.0 ┆ 180.0          ┆ 1.5       ┆ 30  ┆ 0                 │
│ 200.0 ┆ 240.0          ┆ 2.0       ┆ 45  ┆ 5                 │
│ 75.0  ┆ 90.0           ┆ 0.75      ┆ 12  ┆ 2                 │
│ 300.0 ┆ 360.0          ┆ 3.0       ┆ 65  ┆ 5                 │
└───────┴────────────────┴───────────┴─────┴───────────────────┘

Example 4: Power and floor division

>>> power_ops = ScalarMathFeatures(
...     operations=[
...         {'column': 'Age', 'op': '**', 'scalar': 2},
...         {'column': 'Age', 'op': '//', 'scalar': 10}
...     ],
...     new_column_names=['Age_squared', 'Age_decade']
... )
>>> result = power_ops.fit_transform(X)
>>> result.select(['Age', 'Age_squared', 'Age_decade'])
shape: (5, 3)
┌─────┬─────────────┬────────────┐
│ Age ┆ Age_squared ┆ Age_decade │
│ --- ┆ ---         ┆ ---        │
│ i64 ┆ i64         ┆ i64        │
╞═════╪═════════════╪════════════╡
│ 25  ┆ 625         ┆ 2          │
│ 30  ┆ 900         ┆ 3          │
│ 45  ┆ 2025        ┆ 4          │
│ 12  ┆ 144         ┆ 1          │
│ 65  ┆ 4225        ┆ 6          │
└─────┴─────────────┴────────────┘

Notes

  • Each operation produces exactly one new feature column

  • Auto-naming: If new_column_names is None, names are auto-generated as: {column}_{op_name}_{scalar} (e.g., ‘Age_div_365’)

  • Operations are applied element-wise to each row

  • Division by zero will result in inf or null values (Polars default behavior)

  • Can chain multiple ScalarMathFeatures transformers in a pipeline

  • For learned transformations, consider sklearn scalers instead

See also

MathFeatures

For operations between multiple columns

ConditionFeatures

For creating boolean features from conditions

fit(X, y=None)[source]#

Fit the transformer by generating column names if not provided.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Any | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

ScalarMathFeatures

transform(X)[source]#

Transform the input DataFrame by creating new features from scalar operations.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with new computed features (one per operation).

Return type:

DataFrame

class gators.feature_generation.RuleFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Creates multiple boolean features, each from a group of conditions combined with logical operators.

This transformer is useful for creating multiple rule-based features simultaneously, where each rule represents a distinct business logic or fraud detection pattern. Each rule group produces its own boolean output column.

Use Cases:

  • Fraud detection: Create multiple risk indicators (velocity spike, amount anomaly, etc.)

  • Business rules: Generate several eligibility/qualification flags at once

  • Feature engineering: Build a family of related boolean features

  • Production pipelines: Encapsulate multiple rule definitions in one transformer

When to Use:

  • Building production ML pipelines that need serialization

  • Creating reusable feature engineering templates

  • Working with sklearn-based systems that expect transformers

  • Need version control of feature logic (can serialize to JSON/YAML)

  • Want to create multiple related boolean features efficiently

When NOT to Use:

  • One-off exploratory analysis (use Polars native expressions)

  • Very complex nested logic within a single rule (consider Polars native)

  • Performance-critical scenarios where every microsecond counts

Parameters:
  • rules (List[List[Dict[str, Any]]]) –

    List of rule groups. Each rule group contains condition dictionaries that will be combined to create one boolean output column.

    Each condition dictionary must contain:

    • ’column’: str - Name of the column to evaluate

    • ’op’: str - Comparison operator. Supported: ‘>’, ‘<’, ‘>=’, ‘<=’, ‘==’, ‘!=’

    • ’value’: Any (optional) - Scalar value to compare the column against

    • ’other_column’: str (optional) - Name of another column to compare against

    Either ‘value’ or ‘other_column’ must be specified, but not both.

    Examples:

    # Two rules:
    [
        [{'column': 'age', 'op': '>', 'value': 18}],
        [{'column': 'amount', 'op': '>', 'value': 1000}]
    ]
    
    # Rule with multiple conditions:
    [
        [{'column': 'age', 'op': '>', 'value': 18},
         {'column': 'amount', 'op': '>', 'value': 1000}]
    ]
    

  • rule_logic (Literal['and', 'or'], default='and') –

    How to combine conditions within each rule group:

    • ’and’: All conditions in a group must be True

    • ’or’: At least one condition in a group must be True

  • new_column_names (List[str]) – Names for the resulting boolean feature columns. Must have the same length as rules. Each rule group will produce a column with the corresponding name.

  • drop_conditions (bool, default=False) – Whether to drop intermediate condition columns after combining. Recommended: True for cleaner output.

Examples

>>> import polars as pl
>>> from gators.feature_generation import RuleFeatures
>>> X ={
...     'amount': [100, 500, 1200, 50, 2000],
...     'velocity_24h': [1, 3, 5, 0, 10],
...     'velocity_7d': [5, 8, 10, 2, 15],
...     'is_new_user': [True, False, False, True, False]
... }
>>> X = pl.DataFrame(X)

Example 1: Create two risk indicators in one pass

>>> multi_risk_transformer = RuleFeatures(
...     rules=[
...         # Rule 1: Activity spike (24h > 0 AND 7d == 24h)
...         [
...             {'column': 'velocity_24h', 'op': '>', 'value': 0},
...             {'column': 'velocity_7d', 'op': '==', 'other_column': 'velocity_24h'}
...         ],
...         # Rule 2: High amount (amount > 1000)
...         [
...             {'column': 'amount', 'op': '>', 'value': 1000}
...         ]
...     ],
...     rule_logic='and',
...     new_column_names=['is_activity_spike', 'is_high_amount'],
...     drop_conditions=True
... )
>>> result = multi_risk_transformer.fit_transform(X)
>>> result.select(['velocity_24h', 'velocity_7d', 'amount',
...                'is_activity_spike', 'is_high_amount'])
shape: (5, 5)
┌──────────────┬─────────────┬────────┬────────────────────┬─────────────────┐
│ velocity_24h ┆ velocity_7d ┆ amount ┆ is_activity_spike  ┆ is_high_amount  │
│ ---          ┆ ---         ┆ ---    ┆ ---                ┆ ---             │
│ i64          ┆ i64         ┆ i64    ┆ bool               ┆ bool            │
╞══════════════╪═════════════╪════════╪════════════════════╪═════════════════╡
│ 1            ┆ 5           ┆ 100    ┆ false              ┆ false           │
│ 3            ┆ 8           ┆ 500    ┆ false              ┆ false           │
│ 5            ┆ 10          ┆ 1200   ┆ false              ┆ true            │
│ 0            ┆ 2           ┆ 50     ┆ false              ┆ false           │
│ 10           ┆ 15          ┆ 2000   ┆ false              ┆ true            │
└──────────────┴─────────────┴────────┴────────────────────┴─────────────────┘

Example 2: OR logic within a rule (high amount OR high velocity)

>>> or_transformer = RuleFeatures(
...     rules=[
...         [
...             {'column': 'amount', 'op': '>', 'value': 1000},
...             {'column': 'velocity_24h', 'op': '>=', 'value': 5}
...         ]
...     ],
...     rule_logic='or',
...     new_column_names=['is_high_risk'],
...     drop_conditions=True
... )
>>> result = or_transformer.fit_transform(X)
>>> result.select(['amount', 'velocity_24h', 'is_high_risk'])
shape: (5, 3)
┌────────┬──────────────┬──────────────┐
│ amount ┆ velocity_24h ┆ is_high_risk │
│ ---    ┆ ---          ┆ ---          │
│ i64    ┆ i64          ┆ bool         │
╞════════╪══════════════╪══════════════╡
│ 100    ┆ 1            ┆ false        │
│ 500    ┆ 3            ┆ false        │
│ 1200   ┆ 5            ┆ true         │
│ 50     ┆ 0            ┆ false        │
│ 2000   ┆ 10           ┆ true         │
└────────┴──────────────┴──────────────┘

Example 3: Multiple rules with different logic patterns

>>> complex_transformer = RuleFeatures(
...     rules=[
...         # New user AND high amount AND high velocity
...         [
...             {'column': 'is_new_user', 'op': '==', 'value': True},
...             {'column': 'amount', 'op': '>', 'value': 1000},
...             {'column': 'velocity_24h', 'op': '>', 'value': 3}
...         ],
...         # Very high velocity (simple rule)
...         [
...             {'column': 'velocity_24h', 'op': '>=', 'value': 10}
...         ]
...     ],
...     rule_logic='and',
...     new_column_names=['is_suspicious_new_user', 'is_extreme_velocity']
... )
>>> result = complex_transformer.fit_transform(X)
>>> result.select(['is_new_user', 'amount', 'velocity_24h',
...                'is_suspicious_new_user', 'is_extreme_velocity'])
shape: (5, 5)
┌─────────────┬────────┬──────────────┬─────────────────────────┬──────────────────────┐
│ is_new_user ┆ amount ┆ velocity_24h ┆ is_suspicious_new_user  ┆ is_extreme_velocity  │
│ ---         ┆ ---    ┆ ---          ┆ ---                     ┆ ---                  │
│ bool        ┆ i64    ┆ i64          ┆ bool                    ┆ bool                 │
╞═════════════╪════════╪══════════════╪═════════════════════════╪══════════════════════╡
│ true        ┆ 100    ┆ 1            ┆ false                   ┆ false                │
│ false       ┆ 500    ┆ 3            ┆ false                   ┆ false                │
│ false       ┆ 1200   ┆ 5            ┆ false                   ┆ false                │
│ true        ┆ 50     ┆ 0            ┆ false                   ┆ false                │
│ false       ┆ 2000   ┆ 10           ┆ false                   ┆ true                 │
└─────────────┴────────┴──────────────┴─────────────────────────┴──────────────────────┘

Notes

  • Each rule group produces one boolean output column

  • All conditions within a rule are evaluated independently before combining

  • Missing values (null) in comparisons typically result in null/false

  • Creates intermediate boolean columns, so use drop_conditions=True for cleaner output

  • To create a single column from multiple rules with complex logic (AND of ORs), use this transformer to create intermediate columns, then combine them manually

fit(X, y=None)[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Any | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

RuleFeatures

transform(X)[source]#

Transform the input DataFrame by creating boolean features for each rule.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with new boolean features (one per rule).

Return type:

DataFrame

class gators.feature_generation.RowStatisticsFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates row-level aggregation features across groups of columns.

This transformer computes statistics (min, max, mean, median, std, range) horizontally across specified column groups for each row. Unlike GroupRatioFeatures which aggregates vertically (across rows within groups), this computes statistics across columns within each row.

Importance for Fraud Detection#

Row-level aggregation features are valuable in fraud detection because they capture relationships and patterns across related features within individual transactions. For example:

  • Computing statistics across multiple transaction amounts can reveal unusual patterns (e.g., all amounts being identical might indicate scripted fraud)

  • Aggregating across card verification fields can identify inconsistencies

  • Statistics across temporal features can detect velocity anomalies

  • Range calculations can flag suspiciously uniform or extreme value spreads

These features help models identify transactions where the distribution of values across related fields deviates from normal patterns, which is often indicative of fraudulent behavior.

param column_groups:

Dictionary mapping group names to lists of column names. Each group defines a set of columns over which to compute row-level statistics. Example: {‘card_fields’: [‘card1’, ‘card2’, ‘card3’]}

type column_groups:

Dict[str, List[str]]

param func:

List of aggregation functions to apply. Available options:

  • ‘min’: Row-wise minimum value

  • ‘max’: Row-wise maximum value

  • ‘mean’: Row-wise mean (average)

  • ‘median’: Row-wise median

  • ‘std’: Row-wise standard deviation

  • ‘range’: Row-wise range (max - min)

  • ‘sum’: Row-wise sum

type func:

List[str]

param drop_columns:

Whether to drop the original columns after creating aggregation features.

type drop_columns:

bool, default=False

param new_column_names:

List of custom names for the aggregation columns. If None, uses default naming pattern ‘{group_name}__{func}’. Must have same length as the total number of features created (len(column_groups) × len(func)).

type new_column_names:

Optional[List[str]], default=None

Examples

>>> from gators.feature_generation import RowStatisticsFeatures
>>> import polars as pl

Example 1: Single group with multiple aggregations

>>> X = pl.DataFrame({
...     'A': [9, 9, 7],
...     'B': [3, 4, 5],
...     'C': [6, 7, 8]
... })
>>> transformer = RowStatisticsFeatures(
...     column_groups={'cluster_1': ['A', 'B']},
...     func=['mean', 'std']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (3, 5)
┌─────┬─────┬─────┬───────────────────┬──────────────────┐
│ A   ┆ B   ┆ C   ┆ cluster_1__mean   ┆ cluster_1__std   │
│ --- ┆ --- ┆ --- ┆ ---               ┆ ---              │
│ i64 ┆ i64 ┆ i64 ┆ f64               ┆ f64              │
╞═════╪═════╪═════╪═══════════════════╪══════════════════╡
│ 9   ┆ 3   ┆ 6   ┆ 6.0               ┆ 4.242641         │
│ 9   ┆ 4   ┆ 7   ┆ 6.5               ┆ 3.535534         │
│ 7   ┆ 5   ┆ 8   ┆ 6.0               ┆ 1.414214         │
└─────┴─────┴─────┴───────────────────┴──────────────────┘

Example 2: Multiple groups with different columns

>>> X = pl.DataFrame({
...     'A': [9, 9, 7],
...     'B': [3, 4, 5],
...     'C': [6, 7, 8],
...     'D': [1, 2, 3]
... })
>>> transformer = RowStatisticsFeatures(
...     column_groups={
...         'cluster_1': ['A', 'B'],
...         'cluster_2': ['C', 'D']
...     },
...     func=['min', 'max', 'range']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (3, 10)
┌─────┬─────┬─────┬─────┬──────────────┬──────────────┬─────────────────┬──────────────┬──────────────┬─────────────────┐
│ A   ┆ B   ┆ C   ┆ D   ┆ cluster_1__… ┆ cluster_1__… ┆ cluster_1__ran… ┆ cluster_2__… ┆ cluster_2__… ┆ cluster_2__ran… │
│ --- ┆ --- ┆ --- ┆ --- ┆ ---          ┆ ---          ┆ ---             ┆ ---          ┆ ---          ┆ ---             │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64          ┆ i64          ┆ i64             ┆ i64          ┆ i64          ┆ i64             │
╞═════╪═════╪═════╪═════╪══════════════╪══════════════╪═════════════════╪══════════════╪══════════════╪═════════════════╡
│ 9   ┆ 3   ┆ 6   ┆ 1   ┆ 3            ┆ 9            ┆ 6               ┆ 1            ┆ 6            ┆ 5               │
│ 9   ┆ 4   ┆ 7   ┆ 2   ┆ 4            ┆ 9            ┆ 5               ┆ 2            ┆ 7            ┆ 5               │
│ 7   ┆ 5   ┆ 8   ┆ 3   ┆ 5            ┆ 7            ┆ 2               ┆ 3            ┆ 8            ┆ 5               │
└─────┴─────┴─────┴─────┴──────────────┴──────────────┴─────────────────┴──────────────┴──────────────┴─────────────────┘

Example 3: Using custom column names

>>> X = pl.DataFrame({
...     'amount1': [100, 200, 150],
...     'amount2': [50, 100, 75],
...     'amount3': [25, 50, 30]
... })
>>> transformer = RowStatisticsFeatures(
...     column_groups={'amounts': ['amount1', 'amount2', 'amount3']},
...     func=['mean', 'std'],
...     new_column_names=['avg_amount', 'std_amount']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (3, 5)
┌──────────┬──────────┬──────────┬────────────┬────────────┐
│ amount1  ┆ amount2  ┆ amount3  ┆ avg_amount ┆ std_amount │
│ ---      ┆ ---      ┆ ---      ┆ ---        ┆ ---        │
│ i64      ┆ i64      ┆ i64      ┆ f64        ┆ f64        │
╞══════════╪══════════╪══════════╪════════════╪════════════╡
│ 100      ┆ 50       ┆ 25       ┆ 58.333333  ┆ 30.957...  │
│ 200      ┆ 100      ┆ 50       ┆ 116.666... ┆ 61.914...  │
│ 150      ┆ 75       ┆ 30       ┆ 85.0       ┆ 49.606...  │
└──────────┴──────────┴──────────┴────────────┴────────────┘

Example 4: Fraud detection use case - card verification fields

>>> X = pl.DataFrame({
...     'card_cvv_match': [1, 0, 1, 1],
...     'card_addr_match': [1, 1, 0, 1],
...     'card_zip_match': [1, 1, 1, 0],
...     'is_fraud': [0, 1, 1, 1]
... })
>>> # Aggregate verification fields to detect inconsistencies
>>> transformer = RowStatisticsFeatures(
...     column_groups={'verification': ['card_cvv_match', 'card_addr_match', 'card_zip_match']},
...     func=['mean', 'std'],
...     drop_columns=False
... )
>>> result = transformer.fit_transform(X)
>>> result.select(['verification__mean', 'verification__std', 'is_fraud'])
shape: (4, 3)
┌─────────────────────┬────────────────────┬──────────┐
│ verification__mean  ┆ verification__std  ┆ is_fraud │
│ ---                 ┆ ---                ┆ ---      │
│ f64                 ┆ f64                ┆ i64      │
╞═════════════════════╪════════════════════╪══════════╡
│ 1.0                 ┆ 0.0                ┆ 0        │
│ 0.666667            ┆ 0.471405           ┆ 1        │
│ 0.666667            ┆ 0.471405           ┆ 1        │
│ 0.666667            ┆ 0.471405           ┆ 1        │
└─────────────────────┴────────────────────┴──────────┘
# Notice: legitimate transaction has perfect verification (mean=1, std=0)
# Fraudulent transactions show inconsistent verification patterns
fit(X, y=None)[source]#

Fit the transformer by generating column name mappings.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

RowStatisticsFeatures

transform(X)[source]#

Transform the input DataFrame by creating row-level aggregation features.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with row-level aggregation features.

Return type:

DataFrame