San Francisco Crime Classification with Gators#

This notebook demonstrates how to use the gators library for feature engineering in a multiclass classification problem. We’ll predict crime categories from the San Francisco Crime dataset using various gators transformers.

Key Features Demonstrated:#

  • Data preprocessing and cleaning

  • DateTime feature engineering (cyclic, ordinal, business time, holidays)

  • String feature engineering (pattern detection)

  • Spatial feature engineering (coordinate rotation)

  • Group-based imputation

  • Feature encoding and interaction

1. Import Libraries#

Import the necessary libraries including gators transformers for feature engineering.

[1]:
import polars as pl
import pandas as pd
from IPython.display import display

from gators.pipeline import Pipeline
from gators.encoders import OneHotEncoder, CountEncoder
from gators.data_cleaning import DropColumns, CastColumns
from gators.feature_generation import PlanRotationFeatures
from gators.imputers import GroupByImputer
from gators.feature_generation_dt import (
    DatetimeCyclicFeatures,
    DatetimeOrdinalFeatures,
    BusinessTimeFeatures,
    TimeBinFeatures,
    HolidayFeatures,
)
from gators.feature_generation_str import (
    Contains,
    InteractionFeatures,
)

from xgboost import XGBClassifier

2. Load and Preprocess Data#

Load the San Francisco Crime dataset and perform initial preprocessing:

  • Remove redundant columns

  • Handle outlier coordinate values (replace invalid coordinates with null)

  • Create a distance-to-center feature using San Francisco’s geographic center

[2]:
# Load data
X_train = pl.read_parquet('../../../kaggle/sf/train.parquet')
X_test = pl.read_parquet('../../../kaggle/sf/test.parquet')

# Drop unnecessary columns
X_train = X_train.drop(["Descript", "Resolution", "DayOfWeek"])
X_test = X_test.drop(["DayOfWeek"])

# Extract target variable
target = 'Category'
y_train = X_train[target]
X_train = X_train.drop(target)

# Replace invalid coordinate values with null
X_train = X_train.with_columns([
    pl.when(pl.col('X') == -120.5).then(None).otherwise(pl.col('X')).alias('X'),
    pl.when(pl.col('Y') == 90.0).then(None).otherwise(pl.col('Y')).alias('Y')
])

X_test = X_test.with_columns([
    pl.when(pl.col('X') == -120.5).then(None).otherwise(pl.col('X')).alias('X'),
    pl.when(pl.col('Y') == 90.0).then(None).otherwise(pl.col('Y')).alias('Y')
])

# Create distance to San Francisco center feature
sf_center_x, sf_center_y = -122.4194, 37.7749
X_train = X_train.with_columns([
    (((pl.col('X') - sf_center_x)**2 + (pl.col('Y') - sf_center_y)**2)**0.5).alias('distance_to_center')
])
X_test = X_test.with_columns([
    (((pl.col('X') - sf_center_x)**2 + (pl.col('Y') - sf_center_y)**2)**0.5).alias('distance_to_center')
])

print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")
Training data shape: (878049, 6)
Test data shape: (884262, 7)

3. Prepare Target Variable#

Convert crime category labels to numeric format for model training.

[3]:
# Define all crime categories
crime_categories = [
    "ARSON", "ASSAULT", "BAD CHECKS", "BRIBERY", "BURGLARY",
    "DISORDERLY CONDUCT", "DRIVING UNDER THE INFLUENCE", "DRUG/NARCOTIC",
    "DRUNKENNESS", "EMBEZZLEMENT", "EXTORTION", "FAMILY OFFENSES",
    "FORGERY/COUNTERFEITING", "FRAUD", "GAMBLING", "KIDNAPPING",
    "LARCENY/THEFT", "LIQUOR LAWS", "LOITERING", "MISSING PERSON",
    "NON-CRIMINAL", "OTHER OFFENSES", "PORNOGRAPHY/OBSCENE MAT",
    "PROSTITUTION", "RECOVERED VEHICLE", "ROBBERY", "RUNAWAY",
    "SECONDARY CODES", "SEX OFFENSES FORCIBLE", "SEX OFFENSES NON FORCIBLE",
    "STOLEN PROPERTY", "SUICIDE", "SUSPICIOUS OCC", "TREA", "TRESPASS",
    "VANDALISM", "VEHICLE THEFT", "WARRANTS", "WEAPON LAWS"
]

# Convert target to numeric labels
mapping = {cat: str(i) for i, cat in enumerate(crime_categories)}
y_train = y_train.replace(mapping).cast(pl.Int64)

print(f"Number of classes: {len(crime_categories)}")
print(f"Target shape: {y_train.shape}")
Number of classes: 39
Target shape: (878049,)

4. Build Feature Engineering Pipeline#

The gators Pipeline orchestrates multiple feature transformers:

Data Cleaning & Casting:#

  • GroupByImputer: Fill missing X/Y coordinates using mean values grouped by police district

  • CastColumns: Convert date strings to datetime type

Datetime Features:#

  • TimeBinFeatures: Create time-of-day bins (morning, afternoon, evening, night)

  • DatetimeCyclicFeatures: Generate cyclic features for temporal patterns (sin/cos transformations)

  • BusinessTimeFeatures: Extract business-related time features (weekday vs weekend)

  • HolidayFeatures: Identify whether crimes occurred on holidays

  • DatetimeOrdinalFeatures: Create ordinal features (month, day, hour, etc.)

String Features:#

  • Contains: Detect patterns in address strings (e.g., contains ‘Block’, ‘AV’, ‘ST’)

  • InteractionFeatures: Create interactions between categorical features (e.g., day_of_week × part_of_day)

Spatial Features:#

  • PlanRotationFeatures: Rotate X/Y coordinates at multiple angles to capture spatial patterns

Encoding:#

  • OneHotEncoder: Encode categorical variables as binary features

  • CountEncoder: Encode categories by their frequency

[4]:
# Define datetime components to extract
ordinal_components = [
    "month", "week", "day_of_week", "day_of_month",
    "day_of_year", "hour", "minute", "weekend"
]

cyclic_components = [
    "month", "week", "day_of_week", "day_of_month",
    "day_of_year", "hour", "minute"
]

# Build the pipeline with gators transformers
steps = [
    # Impute missing coordinates using group averages
    ("group_imputer", GroupByImputer(
        group_by_column='PdDistrict',
        strategy='mean',
        subset=['X', 'Y']
    )),

    # Convert dates to datetime type
    ("cast", CastColumns(subset=['Dates'], dtype=pl.Datetime, inplace=True)),

    # Extract time-of-day bins
    ("dt_timebin", TimeBinFeatures(subset=['Dates'])),

    # Create cyclic datetime features (captures cyclical nature of time)
    ("dt_cyclic", DatetimeCyclicFeatures(
        subset=['Dates'],
        angles=[180 * i / 4 for i in range(8)],
        components=cyclic_components
    )),

    # Extract business time features
    ("dt_business", BusinessTimeFeatures(subset=['Dates'])),

    # Add holiday indicator
    ("dt_holiday", HolidayFeatures(subset=['Dates'], features=['is_holiday'])),

    # Extract ordinal datetime components
    ("dt_ordinal", DatetimeOrdinalFeatures(subset=['Dates'], components=ordinal_components)),

    # Extract patterns from address strings
    ("contains", Contains(contains_dict={'Address': ['/', 'Block', 'AV', 'ST']})),

    # Drop original columns after feature extraction
    ("drop", DropColumns(subset=["Address", "Dates"])),

    # Create interaction features
    ("interaction_time", InteractionFeatures(subset=['Dates__day_of_week', 'Dates__part_of_day'])),
    ("interaction_district", InteractionFeatures(subset=['PdDistrict', 'Dates__part_of_day'])),

    # Rotate coordinates to capture spatial patterns
    ("plan_rotation", PlanRotationFeatures(
        columns=[['X', 'Y']],
        angles=[180 * i / 4 for i in range(8)]
    )),

    # Encode categorical variables
    ("onehot_encoder", OneHotEncoder()),
    ("count_encoder", CountEncoder()),
]

# Create and fit the pipeline
pipeline = Pipeline(steps=steps, verbose=True)
X_train_transformed = pipeline.fit_transform(X_train, y_train)
X_test_transformed = pipeline.transform(X_test)

print(f"\nOriginal features: {X_train.shape[1]}")
print(f"Engineered features: {X_train_transformed.shape[1]}")
[Pipeline] Fitting and transforming step 1/14: group_imputer
[Pipeline] Fitting and transforming step 2/14: cast
[Pipeline] Fitting and transforming step 3/14: dt_timebin
[Pipeline] Fitting and transforming step 4/14: dt_cyclic
[Pipeline] Fitting and transforming step 5/14: dt_business
[Pipeline] Fitting and transforming step 6/14: dt_holiday
[Pipeline] Fitting and transforming step 7/14: dt_ordinal
[Pipeline] Fitting and transforming step 8/14: contains
[Pipeline] Fitting and transforming step 9/14: drop
[Pipeline] Fitting and transforming step 10/14: interaction_time
[Pipeline] Fitting and transforming step 11/14: interaction_district
[Pipeline] Fitting and transforming step 12/14: plan_rotation
[Pipeline] Fitting and transforming step 13/14: onehot_encoder
[Pipeline] Fitting and transforming step 14/14: count_encoder
[Pipeline] Transforming step 1/14: group_imputer
[Pipeline] Transforming step 2/14: cast
[Pipeline] Transforming step 3/14: dt_timebin
[Pipeline] Transforming step 4/14: dt_cyclic
[Pipeline] Transforming step 5/14: dt_business
[Pipeline] Transforming step 6/14: dt_holiday
[Pipeline] Transforming step 7/14: dt_ordinal
[Pipeline] Transforming step 8/14: contains
[Pipeline] Transforming step 9/14: drop
[Pipeline] Transforming step 10/14: interaction_time
[Pipeline] Transforming step 11/14: interaction_district
[Pipeline] Transforming step 12/14: plan_rotation
[Pipeline] Transforming step 13/14: onehot_encoder
[Pipeline] Transforming step 14/14: count_encoder

Original features: 6
Engineered features: 189

5. Train Model and Generate Predictions#

Train an XGBoost classifier with the engineered features and generate predictions.

[5]:
[c for c in X_test_transformed.columns if c not in X_train_transformed.columns]
[5]:
['Id']
[6]:
# Define model parameters
xgb_params = {
    'max_depth': 6,
    'learning_rate': 0.1,
    'n_estimators': 100,
    'min_child_weight': 5,
    'subsample': 0.85,
    'colsample_bytree': 0.85,
    'max_delta_step': 2,
    'tree_method': 'hist',
    'eval_metric': 'mlogloss',
    'random_state': 0,
    'n_jobs': -1
}

# Train the model
estimator = XGBClassifier(**xgb_params)
estimator.fit(X_train_transformed, y_train)

print("Model training completed!")

# Generate predictions
submission = pd.DataFrame(
    estimator.predict_proba(X_test_transformed.drop("Id")),
    index=X_test["Id"].to_pandas(),
    columns=crime_categories
)
submission.to_csv("submission.csv.zip", compression="zip")
print("Submission file created successfully!")
Model training completed!
Submission file created successfully!

6. Feature Importance Analysis#

Analyze which features contribute most to the model’s predictions.

[7]:
# Extract feature importances
feat_imp = pl.DataFrame({
    "feature": estimator.feature_names_in_,
    "importance": estimator.feature_importances_
})
feat_imp = feat_imp.sort("importance", descending=True)

# Display top 20 most important features
print("Top 20 Most Important Features:")
display(feat_imp.head(20))
Top 20 Most Important Features:
shape: (20, 2)
featureimportance
strf32
"Address__contains_Block"0.047442
"Address__contains_/"0.042024
"Dates__minute__sin90"0.039019
"XY_x315"0.033719
"Dates__minute__sin270"0.032337
"PdDistrict__PARK"0.012953
"XY_y90"0.012377
"Dates__hour__sin90"0.012303
"PdDistrict__NORTHERN"0.011099
"XY_x90"0.011072

Summary#

This notebook demonstrates the power of the gators library for feature engineering in machine learning pipelines:

Key Takeaways:#

  1. Modular Pipeline: Easily compose complex feature engineering workflows using gators transformers

  2. Datetime Engineering: Rich set of temporal features (cyclic, ordinal, business time, holidays)

  3. Spatial Features: Coordinate rotation and distance-based features for geographic data

  4. String Processing: Pattern extraction from text fields

  5. Smart Imputation: Group-based imputation leveraging categorical relationships

  6. Seamless Integration: Works with Polars DataFrames and scikit-learn/XGBoost models

The gators library significantly reduced feature engineering complexity while creating a comprehensive feature set for multiclass classification.