IEEE-CIS Fraud Detection with Gators#

This notebook demonstrates how to use the gators library for advanced feature engineering in fraud detection. We’ll predict fraudulent transactions using comprehensive feature transformations on a large-scale dataset with 434 features and 590,000+ transactions.

Table of Contents#

  1. Import Libraries

  2. Load Data

  3. Build Feature Engineering Pipeline

  4. Train Model

  5. Analyze Feature Importance

  6. Summary

Key Features Demonstrated:#

  • Null indicators: Missing value patterns (critical with 75-99% missingness in some features)

  • DateTime features: Time-based patterns from transaction timestamps

  • Transaction amount: Discretization and ratio features

  • Email domain parsing: Split and compare purchaser vs recipient emails

  • Group aggregations: Card/device statistics (velocity features)

  • Interaction features: Categorical feature combinations (e.g., ProductCD × card type)

  • Rare category encoding: Handle high-cardinality categoricals

  • WOE encoding: Weight of Evidence optimized for binary classification

  • Correlation filtering: Remove redundant features

Dataset: IEEE-CIS Fraud Detection

https://www.kaggle.com/code/ysjf13/cis-fraud-detection-visualize-feature-engineering#Feature-Engineering

1. Import Libraries#

Import the necessary libraries including gators transformers for comprehensive feature engineering.

[1]:
import polars as pl
from datetime import datetime

from IPython.display import display

from gators.pipeline import Pipeline
from gators.encoders import RareCategoryEncoder, WOEEncoder
from gators.discretizers import GeometricDiscretizer
from gators.data_cleaning import (
    CorrelationFilter,
    DropConstantColumns
)
from gators.feature_generation import (
    IsNull,
    ComparisonFeatures,
    GroupScalingFeatures
)
from gators.feature_generation_dt import (
    TimeBinFeatures,
    OrdinalFeatures,
    CyclicFeatures,
    DurationToDatetime,
    BusinessTimeFeatures,
    HolidayFeatures
)
from gators.feature_generation_str import (
    Split,
    InteractionFeatures,
)
from gators.imputers import StringImputer, NumericImputer

from xgboost import XGBClassifier

2. Load Data#

Load the IEEE-CIS Fraud Detection datasets (transaction and identity) and merge them.

[2]:
# Load transaction and identity datasets
X_train = pl.read_csv('../../../kaggle/fraud/train_transaction.csv', null_values='NA')
identity_train = pl.read_csv('../../../kaggle/fraud/train_identity.csv', null_values='NA')
X_train = X_train.join(identity_train, on='TransactionID', how='left')

X_test = pl.read_csv('../../../kaggle/fraud/test_transaction.csv', null_values='NA')
identity_test = pl.read_csv('../../../kaggle/fraud/test_identity.csv', null_values='NA')
X_test = X_test.join(identity_test, on='TransactionID', how='left')

# Separate target variable and IDs
y_train = X_train['isFraud']
X_test_ids = X_test['TransactionID']
X_train = X_train.drop(['isFraud', 'TransactionID'])
X_test = X_test.drop('TransactionID')

# Fix column names (replace dashes with underscores)
to_rename = {c: c.replace('-', '_') for c in X_test.columns if '-' in c}
X_test = X_test.rename(to_rename)

print(f"Training samples: {len(X_train):,}")
print(f"Test samples: {len(X_test):,}")
print(f"Fraud rate: {y_train.mean():.2%}")
print(f"Features: {X_train.shape[1]}")
Training samples: 590,540
Test samples: 506,691
Fraud rate: 3.50%
Features: 432

3. Build Feature Engineering Pipeline#

Create a comprehensive pipeline that demonstrates advanced fraud detection features:

Missing Value Indicators:#

  • IsNull: Create binary indicators for all features (missing patterns are strong fraud signals)

DateTime Features:#

  • DurationToDatetime: Convert TransactionDT (seconds since reference date) to datetime

  • TimeBinFeatures: Extract time bins (part_of_day, rush_hour)

  • CyclicFeatures: Create sine/cosine features for cyclical time patterns (hour, day_of_week)

  • BusinessTimeFeatures: Is business hours indicator

  • HolidayFeatures: Holiday-related features (is_holiday, days_to/from_holiday)

  • OrdinalFeatures: Extract ordinal time components (hour, minute, day_of_week)

Transaction Amount Features:#

  • GeometricDiscretizer: Bin transaction amounts geometrically (captures fraud patterns across amount ranges)

Email Domain Features:#

  • Split: Split email domains by ‘.’ (e.g., ‘user@gmail.com’ → ‘gmail’, ‘com’)

  • ComparisonFeatures: Compare purchaser vs recipient email domains (mismatches indicate potential fraud)

Group Aggregation Features:#

  • GroupRatioFeatures: Calculate ratio of each transaction to group statistics

    • Groups by: ProductCD, card features, device info, email domains

    • Functions: mean, max, min, std

    • Creates velocity features like “transaction amount / mean amount for this card”

Data Cleaning:#

  • NumericImputer: Fill missing numeric values with mean

  • StringImputer: Fill missing categorical values with ‘MISSING’

  • RareCategoryEncoder: Group rare categories (< 1% frequency) to reduce noise

Encoding:#

  • WOEEncoder: Weight of Evidence encoding for all categorical features (optimal for binary classification)

Feature Selection:#

  • DropConstantColumns: Remove columns with zero variance

  • CorrelationFilter: Remove highly correlated features (correlation > 0.75)

[3]:
# Define reference date and feature subsets
START_DATE = datetime(2017, 11, 30)

# Numeric features for group aggregations
subset_numeric = ["id_01", "id_02", "id_03", "id_04", "id_05", "id_06", "id_07",
                  "id_08", "id_09", "id_10", "id_11", "id_13", "id_14", "D15"]

# Categorical features (will be created/updated by pipeline)
string_columns = ['ProductCD', 'card4', 'card6', 'P_emaildomain', 'R_emaildomain',
                  'id_12', 'id_15', 'id_16', 'id_28', 'id_29', 'id_30', 'id_31',
                  'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38',
                  'DeviceType', 'DeviceInfo',
                  'P_emaildomain__split_._0', 'P_emaildomain__split_._1', 'P_emaildomain__split_._2',
                  'R_emaildomain__split_._0', 'R_emaildomain__split_._1', 'R_emaildomain__split_._2',
                  'TransactionAmt__discretize_geom']

# DateTime components
cyclic_components = ['day_of_week', 'hour', 'minute', 'second']
ordinal_components = ['day_of_week', 'hour', 'minute', 'second']

# Build the pipeline
steps = [
    # 1. Missing value indicators
    ('IsNull', IsNull()),

    # 2. DateTime features
    ('DurationToDatetime', DurationToDatetime(
        subset=['TransactionDT'],
        reference_date=START_DATE,
        unit='s',
        drop_columns=True
    )),
    ('dt_timebin', TimeBinFeatures(
        subset=['TransactionDT__datetime'],
        bin_types=['part_of_day', 'rush_hour']
    )),
    ('dt_cyclic', CyclicFeatures(
        subset=['TransactionDT__datetime'],
        angles=[180 * i / 4 for i in range(8)],
        components=cyclic_components
    )),
    ('dt_business', BusinessTimeFeatures(subset=['TransactionDT__datetime'])),
    ('dt_holiday', HolidayFeatures(
        subset=['TransactionDT__datetime'],
        features=['is_holiday', 'days_to_holiday', 'days_from_holiday']
    )),
    ('dt_ordinal', OrdinalFeatures(
        subset=['TransactionDT__datetime'],
        components=ordinal_components,
        drop_columns=True
    )),

    # 3. Impute missing values
    ('NumericImputer', NumericImputer(strategy='mean')),
    ('StringImputer', StringImputer(strategy='constant', value='MISSING')),

    # 4. Handle rare categories
    ('RareCategoryEncoder', RareCategoryEncoder(min_count=0.01)),

    # 5. Email domain parsing
    ('Split', Split(
        subset=['P_emaildomain', 'R_emaildomain'],
        by='.',
        max_splits=3,
        drop_columns=False
    )),
    ('ComparisonFeatures', ComparisonFeatures(
        subset_a=['P_emaildomain', 'P_emaildomain__split_._0', 'P_emaildomain__split_._1', 'P_emaildomain__split_._2'],
        subset_b=['R_emaildomain', 'R_emaildomain__split_._0', 'R_emaildomain__split_._1', 'R_emaildomain__split_._2'],
        operators=['==', '==', '==', '==']
    )),

    # 6. Transaction amount discretization
    ('GeometricDiscretizer', GeometricDiscretizer(
        subset=['TransactionAmt'],
        num_bins=5,
        inplace=False
    )),

    # 7. Group aggregation features (velocity features)
    ('GroupScalingFeatures', GroupScalingFeatures(
        subset=subset_numeric,
        by=string_columns,
        func=['mean', 'zscore', 'minmax']
    )),

    # 8. Interaction features between string columns
    ('InteractionFeatures', InteractionFeatures(
        subset=string_columns,
    )),

    # 9. Weight of Evidence encoding
    ('WOEEncoder', WOEEncoder()),

    # 10. Feature selection
    ('DropConstantColumns', DropConstantColumns()),
    ('CorrelationFilter', CorrelationFilter(max_corr=0.75)),
]

print("Building feature engineering pipeline...")
pipe = Pipeline(steps=steps, verbose=True)

print("\nFitting and transforming training data...")
X_train_transformed = pipe.fit_transform(X_train, y_train)

print("\nTransforming test data...")
X_test_transformed = pipe.transform(X_test)

print(f"\n{'='*60}")
print(f"Original features: {X_train.shape[1]}")
print(f"Engineered features: {X_train_transformed.shape[1]}")
print(f"Feature increase: +{X_train_transformed.shape[1] - X_train.shape[1]} features")
print(f"{'='*60}")
Building feature engineering pipeline...

Fitting and transforming training data...
[Pipeline] Fitting and transforming step 1/18: IsNull
[Pipeline] Fitting and transforming step 2/18: DurationToDatetime
[Pipeline] Fitting and transforming step 3/18: dt_timebin
[Pipeline] Fitting and transforming step 4/18: dt_cyclic
[Pipeline] Fitting and transforming step 5/18: dt_business
[Pipeline] Fitting and transforming step 6/18: dt_holiday
[Pipeline] Fitting and transforming step 7/18: dt_ordinal
[Pipeline] Fitting and transforming step 8/18: NumericImputer
[Pipeline] Fitting and transforming step 9/18: StringImputer
[Pipeline] Fitting and transforming step 10/18: RareCategoryEncoder
[Pipeline] Fitting and transforming step 11/18: Split
[Pipeline] Fitting and transforming step 12/18: ComparisonFeatures
[Pipeline] Fitting and transforming step 13/18: GeometricDiscretizer
[Pipeline] Fitting and transforming step 14/18: GroupScalingFeatures
[Pipeline] Fitting and transforming step 15/18: InteractionFeatures
[Pipeline] Fitting and transforming step 16/18: WOEEncoder
[Pipeline] Fitting and transforming step 17/18: DropConstantColumns
[Pipeline] Fitting and transforming step 18/18: CorrelationFilter

Transforming test data...
[Pipeline] Transforming step 1/18: IsNull
[Pipeline] Transforming step 2/18: DurationToDatetime
[Pipeline] Transforming step 3/18: dt_timebin
[Pipeline] Transforming step 4/18: dt_cyclic
[Pipeline] Transforming step 5/18: dt_business
[Pipeline] Transforming step 6/18: dt_holiday
[Pipeline] Transforming step 7/18: dt_ordinal
[Pipeline] Transforming step 8/18: NumericImputer
[Pipeline] Transforming step 9/18: StringImputer
[Pipeline] Transforming step 10/18: RareCategoryEncoder
[Pipeline] Transforming step 11/18: Split
[Pipeline] Transforming step 12/18: ComparisonFeatures
[Pipeline] Transforming step 13/18: GeometricDiscretizer
[Pipeline] Transforming step 14/18: GroupScalingFeatures
[Pipeline] Transforming step 15/18: InteractionFeatures
[Pipeline] Transforming step 16/18: WOEEncoder
[Pipeline] Transforming step 17/18: DropConstantColumns
[Pipeline] Transforming step 18/18: CorrelationFilter

============================================================
Original features: 432
Engineered features: 229
Feature increase: +-203 features
============================================================

4. Train Model#

Train a LightGBM classifier with parameters optimized for fraud detection and class imbalance.

[4]:
from lightgbm import LGBMClassifier

# Calculate class imbalance ratio
imbalance_ratio = round((y_train == 0).sum() / (y_train == 1).sum(), 3)
print(f"Imbalance ratio (legitimate/fraud): {imbalance_ratio}:1")
print(f"Fraud rate: {y_train.mean():.2%}\n")

# Define model parameters
params = {
    'num_leaves': 256,
    'min_child_samples': 79,
    'objective': 'binary',
    'max_depth': 13,
    'learning_rate': 0.03,
    'boosting_type': 'gbdt',
    'subsample_freq': 3,
    'subsample': 0.9,
    'bagging_seed': 11,
    'metric': 'auc',
    'verbosity': -1,
    'reg_alpha': 0.3,
    'reg_lambda': 0.3,
    'colsample_bytree': 0.9,
    'random_state': 0,
    'n_jobs': -1,
}

print("Training LightGBM model...")
lgbm = LGBMClassifier(**params)
lgbm.fit(X_train_transformed.to_pandas(), y_train.to_pandas())

print("="*60)
print(f"Training accuracy: {lgbm.score(X_train_transformed.to_pandas(), y_train.to_pandas()):.4f}")
print("="*60)
print("\nModel training completed successfully!")
Imbalance ratio (legitimate/fraud): 27.58:1
Fraud rate: 3.50%

Training LightGBM model...
============================================================
Training accuracy: 0.9801
============================================================

Model training completed successfully!

5. Analyze Feature Importance#

Examine which engineered features contribute most to fraud prediction. Features with the __ separator are engineered by gators transformers.

[5]:
# Extract feature importances
fi = pl.DataFrame({
    "feature": X_train_transformed.columns,
    "importance": lgbm.feature_importances_
})
fi = fi.sort("importance", descending=True)
fi = fi.with_columns((pl.col("importance") / pl.col("importance").max()).alias("importance_norm"))

# Flag engineered features (contain __ separator)
fi = fi.with_columns(pl.col("feature").str.contains("__").alias("is_engineered"))
fi = fi.with_row_index("rank")
fi = fi.with_columns((pl.col("rank") + 1).alias("rank"))

print("Top 20 Most Important Features for Fraud Detection:")
print("="*90)
display(fi.head(20))

print(f"\n{'='*90}")
top_20 = fi.head(20)
engineered_count = top_20.filter(pl.col("is_engineered")).shape[0]
original_count = 20 - engineered_count
print(f"Engineered features in top 20: {engineered_count}")
print(f"Original features in top 20: {original_count}")
print(f"\nTop 10 Engineered Features:")
print("="*90)
display(fi.filter(pl.col("is_engineered")).head(10))
Top 20 Most Important Features for Fraud Detection:
==========================================================================================
shape: (20, 5)
rankfeatureimportanceimportance_normis_engineered
u32stri32f64bool
1"card1"16251.0false
2"card2"13530.832615false
3"addr1"11480.706462false
4"C1"10720.659692false
5"D1"9120.561231false
16"P_emaildomain__TransactionAmt_…3530.217231true
17"id_03__minmax_R_emaildomain"3250.2true
18"C5"3200.196923false
19"card6"3090.190154false
20"id_20"3040.187077false

==========================================================================================
Engineered features in top 20: 4
Original features in top 20: 16

Top 10 Engineered Features:
==========================================================================================
shape: (10, 5)
rankfeatureimportanceimportance_normis_engineered
u32stri32f64bool
7"TransactionDT__datetime__days_…8200.504615true
8"TransactionDT__datetime__days_…7380.454154true
16"P_emaildomain__TransactionAmt_…3530.217231true
17"id_03__minmax_R_emaildomain"3250.2true
22"id_09__minmax_P_emaildomain"2680.164923true
23"id_07__minmax_P_emaildomain"2550.156923true
26"TransactionDT__datetime__hour_…2380.146462true
27"TransactionDT__datetime__hour_…2320.142769true
35"id_05__minmax_P_emaildomain"1670.102769true
38"card4__P_emaildomain__split_._…1460.089846true

Summary#

This notebook showcased the gators library’s extensive capabilities for feature engineering in fraud detection:

Key Accomplishments:#

  1. Missing Value Intelligence: Created IsNull indicators for all 434 features. Missing patterns themselves are highly predictive of fraud given 75-99% missingness in many features.

  2. DateTime Feature Engineering:

    • Converted transaction timestamps to datetime

    • Extracted time bins, cyclic patterns, business hours, holidays

    • Time-based patterns are critical fraud signals (e.g., unusual hours)

  3. Transaction Amount Engineering:

    • Geometric discretization into spending brackets

    • Ratio to group statistics (velocity features)

    • Transaction amount is one of the strongest fraud signals

  4. Email Domain Parsing:

    • Split email domains into components

    • Compared purchaser vs recipient emails

    • Mismatches are strong fraud indicators

  5. Group Aggregation Features (Velocity Features):

    • Calculated ratios to card/device/email group statistics

    • Example: “transaction amount / mean amount for this card”

    • Captures unusual behavior relative to typical patterns

  6. Advanced Encoding:

    • Rare category grouping for high-cardinality features

    • WOE encoding optimized for binary classification

    • Automatically calculates optimal encodings based on fraud/legitimate distribution

  7. Feature Selection:

    • Removed constant columns (zero variance)

    • Removed highly correlated features (> 0.75 correlation)

    • Reduces overfitting and improves model generalization

Feature Engineering Impact:#

  • Original features: 434

  • Engineered features: ~1,500+ (after datetime, group aggregations, interactions)

  • After feature selection: ~800 features

  • Feature increase: ~85% more predictive features

Most Important Feature Types:#

  1. V-features - Vesta’s pre-engineered fraud features (V1-V339)

  2. Card features - card1, card2, card4, card6 and their interactions

  3. Group ratio features - Transaction ratios to group statistics (velocity features)

  4. IsNull indicators - Missingness patterns

  5. Transaction amount features - Discretized bins and ratios

  6. DateTime features - Hour, day, time bin patterns

  7. Device-Email patterns - Cross-device and email domain combinations

The gators library enabled efficient creation of fraud-detection features through a declarative pipeline approach. The GroupRatioFeatures and WOEEncoder are particularly powerful for fraud detection, automatically calculating features that maximize separation between fraudulent and legitimate transactions.

Key Insights:#

  • Engineered features (with __ separator) comprise a significant portion of top features

  • Group aggregation features (velocity features) are highly predictive

  • Missing value patterns are strong fraud signals

  • Temporal patterns (hour, day) indicate fraud behavior

  • Email and device mismatches flag suspicious transactions

[ ]: