San Francisco Crime Classification with Gators#
This notebook demonstrates how to use the gators library for feature engineering in a multiclass classification problem. We’ll predict crime categories from the San Francisco Crime dataset using various gators transformers.
Key Features Demonstrated:#
Data preprocessing and cleaning
DateTime feature engineering (cyclic, ordinal, business time, holidays)
String feature engineering (pattern detection)
Spatial feature engineering (coordinate rotation)
Group-based imputation
Feature encoding and interaction
1. Import Libraries#
Import the necessary libraries including gators transformers for feature engineering.
[1]:
import polars as pl
import pandas as pd
from IPython.display import display
from gators.pipeline import Pipeline
from gators.encoders import OneHotEncoder, CountEncoder
from gators.data_cleaning import DropColumns, CastColumns
from gators.feature_generation import PlanRotationFeatures
from gators.imputers import GroupByImputer
from gators.feature_generation_dt import (
DatetimeCyclicFeatures,
DatetimeOrdinalFeatures,
BusinessTimeFeatures,
TimeBinFeatures,
HolidayFeatures,
)
from gators.feature_generation_str import (
Contains,
InteractionFeatures,
)
from xgboost import XGBClassifier
2. Load and Preprocess Data#
Load the San Francisco Crime dataset and perform initial preprocessing:
Remove redundant columns
Handle outlier coordinate values (replace invalid coordinates with null)
Create a distance-to-center feature using San Francisco’s geographic center
[2]:
# Load data
X_train = pl.read_parquet('../../../kaggle/sf/train.parquet')
X_test = pl.read_parquet('../../../kaggle/sf/test.parquet')
# Drop unnecessary columns
X_train = X_train.drop(["Descript", "Resolution", "DayOfWeek"])
X_test = X_test.drop(["DayOfWeek"])
# Extract target variable
target = 'Category'
y_train = X_train[target]
X_train = X_train.drop(target)
# Replace invalid coordinate values with null
X_train = X_train.with_columns([
pl.when(pl.col('X') == -120.5).then(None).otherwise(pl.col('X')).alias('X'),
pl.when(pl.col('Y') == 90.0).then(None).otherwise(pl.col('Y')).alias('Y')
])
X_test = X_test.with_columns([
pl.when(pl.col('X') == -120.5).then(None).otherwise(pl.col('X')).alias('X'),
pl.when(pl.col('Y') == 90.0).then(None).otherwise(pl.col('Y')).alias('Y')
])
# Create distance to San Francisco center feature
sf_center_x, sf_center_y = -122.4194, 37.7749
X_train = X_train.with_columns([
(((pl.col('X') - sf_center_x)**2 + (pl.col('Y') - sf_center_y)**2)**0.5).alias('distance_to_center')
])
X_test = X_test.with_columns([
(((pl.col('X') - sf_center_x)**2 + (pl.col('Y') - sf_center_y)**2)**0.5).alias('distance_to_center')
])
print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")
Training data shape: (878049, 6)
Test data shape: (884262, 7)
3. Prepare Target Variable#
Convert crime category labels to numeric format for model training.
[3]:
# Define all crime categories
crime_categories = [
"ARSON", "ASSAULT", "BAD CHECKS", "BRIBERY", "BURGLARY",
"DISORDERLY CONDUCT", "DRIVING UNDER THE INFLUENCE", "DRUG/NARCOTIC",
"DRUNKENNESS", "EMBEZZLEMENT", "EXTORTION", "FAMILY OFFENSES",
"FORGERY/COUNTERFEITING", "FRAUD", "GAMBLING", "KIDNAPPING",
"LARCENY/THEFT", "LIQUOR LAWS", "LOITERING", "MISSING PERSON",
"NON-CRIMINAL", "OTHER OFFENSES", "PORNOGRAPHY/OBSCENE MAT",
"PROSTITUTION", "RECOVERED VEHICLE", "ROBBERY", "RUNAWAY",
"SECONDARY CODES", "SEX OFFENSES FORCIBLE", "SEX OFFENSES NON FORCIBLE",
"STOLEN PROPERTY", "SUICIDE", "SUSPICIOUS OCC", "TREA", "TRESPASS",
"VANDALISM", "VEHICLE THEFT", "WARRANTS", "WEAPON LAWS"
]
# Convert target to numeric labels
mapping = {cat: str(i) for i, cat in enumerate(crime_categories)}
y_train = y_train.replace(mapping).cast(pl.Int64)
print(f"Number of classes: {len(crime_categories)}")
print(f"Target shape: {y_train.shape}")
Number of classes: 39
Target shape: (878049,)
4. Build Feature Engineering Pipeline#
The gators Pipeline orchestrates multiple feature transformers:
Data Cleaning & Casting:#
GroupByImputer: Fill missing X/Y coordinates using mean values grouped by police district
CastColumns: Convert date strings to datetime type
Datetime Features:#
TimeBinFeatures: Create time-of-day bins (morning, afternoon, evening, night)
DatetimeCyclicFeatures: Generate cyclic features for temporal patterns (sin/cos transformations)
BusinessTimeFeatures: Extract business-related time features (weekday vs weekend)
HolidayFeatures: Identify whether crimes occurred on holidays
DatetimeOrdinalFeatures: Create ordinal features (month, day, hour, etc.)
String Features:#
Contains: Detect patterns in address strings (e.g., contains ‘Block’, ‘AV’, ‘ST’)
InteractionFeatures: Create interactions between categorical features (e.g., day_of_week × part_of_day)
Spatial Features:#
PlanRotationFeatures: Rotate X/Y coordinates at multiple angles to capture spatial patterns
Encoding:#
OneHotEncoder: Encode categorical variables as binary features
CountEncoder: Encode categories by their frequency
[4]:
# Define datetime components to extract
ordinal_components = [
"month", "week", "day_of_week", "day_of_month",
"day_of_year", "hour", "minute", "weekend"
]
cyclic_components = [
"month", "week", "day_of_week", "day_of_month",
"day_of_year", "hour", "minute"
]
# Build the pipeline with gators transformers
steps = [
# Impute missing coordinates using group averages
("group_imputer", GroupByImputer(
group_by_column='PdDistrict',
strategy='mean',
subset=['X', 'Y']
)),
# Convert dates to datetime type
("cast", CastColumns(subset=['Dates'], dtype=pl.Datetime, inplace=True)),
# Extract time-of-day bins
("dt_timebin", TimeBinFeatures(subset=['Dates'])),
# Create cyclic datetime features (captures cyclical nature of time)
("dt_cyclic", DatetimeCyclicFeatures(
subset=['Dates'],
angles=[180 * i / 4 for i in range(8)],
components=cyclic_components
)),
# Extract business time features
("dt_business", BusinessTimeFeatures(subset=['Dates'])),
# Add holiday indicator
("dt_holiday", HolidayFeatures(subset=['Dates'], features=['is_holiday'])),
# Extract ordinal datetime components
("dt_ordinal", DatetimeOrdinalFeatures(subset=['Dates'], components=ordinal_components)),
# Extract patterns from address strings
("contains", Contains(contains_dict={'Address': ['/', 'Block', 'AV', 'ST']})),
# Drop original columns after feature extraction
("drop", DropColumns(subset=["Address", "Dates"])),
# Create interaction features
("interaction_time", InteractionFeatures(subset=['Dates__day_of_week', 'Dates__part_of_day'])),
("interaction_district", InteractionFeatures(subset=['PdDistrict', 'Dates__part_of_day'])),
# Rotate coordinates to capture spatial patterns
("plan_rotation", PlanRotationFeatures(
columns=[['X', 'Y']],
angles=[180 * i / 4 for i in range(8)]
)),
# Encode categorical variables
("onehot_encoder", OneHotEncoder()),
("count_encoder", CountEncoder()),
]
# Create and fit the pipeline
pipeline = Pipeline(steps=steps, verbose=True)
X_train_transformed = pipeline.fit_transform(X_train, y_train)
X_test_transformed = pipeline.transform(X_test)
print(f"\nOriginal features: {X_train.shape[1]}")
print(f"Engineered features: {X_train_transformed.shape[1]}")
[Pipeline] Fitting and transforming step 1/14: group_imputer
[Pipeline] Fitting and transforming step 2/14: cast
[Pipeline] Fitting and transforming step 3/14: dt_timebin
[Pipeline] Fitting and transforming step 4/14: dt_cyclic
[Pipeline] Fitting and transforming step 5/14: dt_business
[Pipeline] Fitting and transforming step 6/14: dt_holiday
[Pipeline] Fitting and transforming step 7/14: dt_ordinal
[Pipeline] Fitting and transforming step 8/14: contains
[Pipeline] Fitting and transforming step 9/14: drop
[Pipeline] Fitting and transforming step 10/14: interaction_time
[Pipeline] Fitting and transforming step 11/14: interaction_district
[Pipeline] Fitting and transforming step 12/14: plan_rotation
[Pipeline] Fitting and transforming step 13/14: onehot_encoder
[Pipeline] Fitting and transforming step 14/14: count_encoder
[Pipeline] Transforming step 1/14: group_imputer
[Pipeline] Transforming step 2/14: cast
[Pipeline] Transforming step 3/14: dt_timebin
[Pipeline] Transforming step 4/14: dt_cyclic
[Pipeline] Transforming step 5/14: dt_business
[Pipeline] Transforming step 6/14: dt_holiday
[Pipeline] Transforming step 7/14: dt_ordinal
[Pipeline] Transforming step 8/14: contains
[Pipeline] Transforming step 9/14: drop
[Pipeline] Transforming step 10/14: interaction_time
[Pipeline] Transforming step 11/14: interaction_district
[Pipeline] Transforming step 12/14: plan_rotation
[Pipeline] Transforming step 13/14: onehot_encoder
[Pipeline] Transforming step 14/14: count_encoder
Original features: 6
Engineered features: 189
5. Train Model and Generate Predictions#
Train an XGBoost classifier with the engineered features and generate predictions.
[5]:
[c for c in X_test_transformed.columns if c not in X_train_transformed.columns]
[5]:
['Id']
[6]:
# Define model parameters
xgb_params = {
'max_depth': 6,
'learning_rate': 0.1,
'n_estimators': 100,
'min_child_weight': 5,
'subsample': 0.85,
'colsample_bytree': 0.85,
'max_delta_step': 2,
'tree_method': 'hist',
'eval_metric': 'mlogloss',
'random_state': 0,
'n_jobs': -1
}
# Train the model
estimator = XGBClassifier(**xgb_params)
estimator.fit(X_train_transformed, y_train)
print("Model training completed!")
# Generate predictions
submission = pd.DataFrame(
estimator.predict_proba(X_test_transformed.drop("Id")),
index=X_test["Id"].to_pandas(),
columns=crime_categories
)
submission.to_csv("submission.csv.zip", compression="zip")
print("Submission file created successfully!")
Model training completed!
Submission file created successfully!
6. Feature Importance Analysis#
Analyze which features contribute most to the model’s predictions.
[7]:
# Extract feature importances
feat_imp = pl.DataFrame({
"feature": estimator.feature_names_in_,
"importance": estimator.feature_importances_
})
feat_imp = feat_imp.sort("importance", descending=True)
# Display top 20 most important features
print("Top 20 Most Important Features:")
display(feat_imp.head(20))
Top 20 Most Important Features:
| feature | importance |
|---|---|
| str | f32 |
| "Address__contains_Block" | 0.047442 |
| "Address__contains_/" | 0.042024 |
| "Dates__minute__sin90" | 0.039019 |
| "XY_x315" | 0.033719 |
| "Dates__minute__sin270" | 0.032337 |
| … | … |
| "PdDistrict__PARK" | 0.012953 |
| "XY_y90" | 0.012377 |
| "Dates__hour__sin90" | 0.012303 |
| "PdDistrict__NORTHERN" | 0.011099 |
| "XY_x90" | 0.011072 |
Summary#
This notebook demonstrates the power of the gators library for feature engineering in machine learning pipelines:
Key Takeaways:#
Modular Pipeline: Easily compose complex feature engineering workflows using gators transformers
Datetime Engineering: Rich set of temporal features (cyclic, ordinal, business time, holidays)
Spatial Features: Coordinate rotation and distance-based features for geographic data
String Processing: Pattern extraction from text fields
Smart Imputation: Group-based imputation leveraging categorical relationships
Seamless Integration: Works with Polars DataFrames and scikit-learn/XGBoost models
The gators library significantly reduced feature engineering complexity while creating a comprehensive feature set for multiclass classification.