Titanic Survival Prediction with Gators#
This notebook demonstrates how to use the gators library for advanced feature engineering in a binary classification problem. We’ll predict passenger survival on the Titanic using comprehensive feature transformations.
Table of Contents#
Key Features Demonstrated:#
Null indicator features
String parsing and extraction (names, titles)
Mathematical feature engineering
Conditional features
Custom discretization (age binning)
Rare category encoding
Weight of Evidence (WOE) encoding
Feature interactions
1. Import Libraries#
Import the necessary libraries including gators transformers for comprehensive feature engineering.
[1]:
import polars as pl
from IPython.display import display
from gators.pipeline import Pipeline
from gators.encoders import RareCategoryEncoder, WOEEncoder
from gators.discretizers import CustomDiscretizer
from gators.data_cleaning import DropColumns, CastColumns, RenameColumns
from gators.feature_generation import (
IsNull,
MathFeatures,
ConditionFeatures,
ScalarMathFeatures
)
from gators.imputers import StringImputer, NumericImputer
from gators.feature_generation_str import (
Length,
SplitExtract,
ExtractSubstring,
InteractionFeatures,
)
from xgboost import XGBClassifier
2. Load Data#
Load the Titanic dataset and prepare training and test sets.
[2]:
# Load train and test data
train = pl.read_csv('../../../kaggle/titanic/train.csv', null_values='NA')
test = pl.read_csv('../../../kaggle/titanic/test.csv', null_values='NA')
# Prepare training data
train = train.drop("PassengerId")
y_train = train['Survived']
X_train = train.drop('Survived')
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(test)}")
print(f"\nFeatures: {X_train.columns}")
Training samples: 891
Test samples: 418
Features: ['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
3. Build Feature Engineering Pipeline#
Create a comprehensive pipeline that demonstrates the power of gators transformers:
Missing Value Indicators:#
IsNull: Create binary indicators for missing Age and Cabin values
String Feature Engineering:#
Length: Calculate ticket string length
SplitExtract: Extract passenger title from name (e.g., ‘Mr.’, ‘Mrs.’, ‘Miss.’)
ExtractSubstring: Extract cabin deck letter from cabin number
Mathematical Features:#
MathFeatures: Sum SibSp and Parch to get family size components
ScalarMathFeatures: Add 1 to include the passenger (FamilySize = SibSp + Parch + 1)
MathFeatures: Calculate fare per person (Fare / FamilySize)
Conditional Features:#
ConditionFeatures: Create ‘IsAlone’ indicator for passengers traveling solo
Data Cleaning:#
RenameColumns: Give intuitive names to extracted features
DropColumns: Remove raw columns after feature extraction
NumericImputer: Fill missing numeric values with mean
StringImputer: Fill missing categorical values with ‘MISSING’
Discretization:#
CustomDiscretizer: Bin ages into meaningful groups (child, teen, adult, senior)
Encoding:#
RareCategoryEncoder: Group infrequent categories to reduce noise
CastColumns: Convert Pclass to string for categorical treatment
InteractionFeatures: Create feature interactions (e.g., Pclass × Age group)
WOEEncoder: Apply Weight of Evidence encoding for all categorical features
[3]:
# Define the feature engineering pipeline
steps = [
# Create missing value indicators
('IsNull', IsNull(subset=['Age', 'Cabin'])),
# String feature engineering
('Length', Length(subset=['Ticket'])),
('SplitExtractName', SplitExtract(subset=['Name'], by=', ', n=1)),
('SplitExtractTitle', SplitExtract(subset=['Name__split_,__1'], by='.', n=0)),
# Calculate family size
('MathFeatures', MathFeatures(
groups=[['SibSp', 'Parch']],
operations=['sum'],
new_column_names=['Dummy']
)),
('ScalarMathFeatures', ScalarMathFeatures(
operations=[{'column': 'Dummy_sum', 'op': '+', 'scalar': 1}],
new_column_names=["FamilySize"]
)),
# Extract cabin deck
('ExtractSubstring', ExtractSubstring(subset=['Cabin'], start=0, end=1)),
# Rename for clarity
('RenameColumns', RenameColumns(column_mapping={
'Name__split_,__1__split_._0': 'Title',
'Cabin__start0_end1': 'CabinDeck'
})),
# Handle rare categories
('RareCategoryEncoder', RareCategoryEncoder()),
# Calculate fare per person
('MathFeatures2', MathFeatures(
groups=[['Fare', 'FamilySize']],
operations=['div'],
new_column_names=['FarePerPerson']
)),
# Create 'traveling alone' indicator
('ConditionFeatures', ConditionFeatures(
conditions=[{"column": "FamilySize", "op": ">", "value": 1}],
new_column_names=['IsAlone']
)),
# Drop raw columns
('DropColumns', DropColumns(subset=['Cabin', 'Ticket', 'Dummy_sum'])),
# Impute missing values
('NumericImputer', NumericImputer(strategy='mean')),
('StringImputer', StringImputer(strategy='constant', value='MISSING')),
# Discretize age into bins
('CustomDiscretizer', CustomDiscretizer(
bins={'Age': [0, 12, 18, 35, 60, 100]},
inplace=True
)),
# Convert passenger class to categorical
('CastColumns', CastColumns(subset=["Pclass"], dtype=pl.String)),
# Create feature interactions
('InteractionFeatures', InteractionFeatures(
subset=['Pclass', 'Age', 'CabinDeck', 'Embarked']
)),
# Apply Weight of Evidence encoding
('WOEEncoder', WOEEncoder()),
]
# Build and apply the pipeline
pipe = Pipeline(steps=steps, verbose=True)
X_train_transformed = pipe.fit_transform(X_train, y_train)
X_test_transformed = pipe.transform(test)
print(f"\nOriginal features: {X_train.shape[1]}")
print(f"Engineered features: {X_train_transformed.shape[1]}")
[Pipeline] Fitting and transforming step 1/18: IsNull
[Pipeline] Fitting and transforming step 2/18: Length
[Pipeline] Fitting and transforming step 3/18: SplitExtractName
[Pipeline] Fitting and transforming step 4/18: SplitExtractTitle
[Pipeline] Fitting and transforming step 5/18: MathFeatures
[Pipeline] Fitting and transforming step 6/18: ScalarMathFeatures
[Pipeline] Fitting and transforming step 7/18: ExtractSubstring
[Pipeline] Fitting and transforming step 8/18: RenameColumns
[Pipeline] Fitting and transforming step 9/18: RareCategoryEncoder
[Pipeline] Fitting and transforming step 10/18: MathFeatures2
[Pipeline] Fitting and transforming step 11/18: ConditionFeatures
[Pipeline] Fitting and transforming step 12/18: DropColumns
[Pipeline] Fitting and transforming step 13/18: NumericImputer
[Pipeline] Fitting and transforming step 14/18: StringImputer
[Pipeline] Fitting and transforming step 15/18: CustomDiscretizer
[Pipeline] Fitting and transforming step 16/18: CastColumns
[Pipeline] Fitting and transforming step 17/18: InteractionFeatures
[Pipeline] Fitting and transforming step 18/18: WOEEncoder
[Pipeline] Transforming step 1/18: IsNull
[Pipeline] Transforming step 2/18: Length
[Pipeline] Transforming step 3/18: SplitExtractName
[Pipeline] Transforming step 4/18: SplitExtractTitle
[Pipeline] Transforming step 5/18: MathFeatures
[Pipeline] Transforming step 6/18: ScalarMathFeatures
[Pipeline] Transforming step 7/18: ExtractSubstring
[Pipeline] Transforming step 8/18: RenameColumns
[Pipeline] Transforming step 9/18: RareCategoryEncoder
[Pipeline] Transforming step 10/18: MathFeatures2
[Pipeline] Transforming step 11/18: ConditionFeatures
[Pipeline] Transforming step 12/18: DropColumns
[Pipeline] Transforming step 13/18: NumericImputer
[Pipeline] Transforming step 14/18: StringImputer
[Pipeline] Transforming step 15/18: CustomDiscretizer
[Pipeline] Transforming step 16/18: CastColumns
[Pipeline] Transforming step 17/18: InteractionFeatures
[Pipeline] Transforming step 18/18: WOEEncoder
Original features: 10
Engineered features: 21
4. Train Model#
Train an XGBoost classifier with parameters tuned for the Titanic dataset.
[4]:
# Define model parameters
imbalance_ratio = round((y_train == 0).sum() / (y_train == 1).sum(), 3)
params = {
'n_estimators': 150,
'max_depth': 4,
'learning_rate': 0.03,
'subsample': 0.85,
'colsample_bytree': 0.85,
'min_child_weight': 1,
'gamma': 0.05,
'reg_alpha': 0.5,
'reg_lambda': 1.5,
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'scale_pos_weight': imbalance_ratio,
'random_state': 42
}
# Train the model
estimator = XGBClassifier(**params)
estimator.fit(X_train_transformed, y_train)
print("Model training completed successfully!")
print("Accuracy on training set: {:.3f}".format(estimator.score(X_train_transformed, y_train)))
Model training completed successfully!
Accuracy on training set: 0.880
5. Analyze Feature Importance#
Examine which engineered features contribute most to survival predictions.
[5]:
# Extract and display feature importances
feature_importance = pl.DataFrame({
"feature": estimator.feature_names_in_,
"importance": estimator.feature_importances_
}).sort("importance", descending=True)
print("Top 10 Most Important Features:")
display(feature_importance.head(10))
Top 10 Most Important Features:
| feature | importance |
|---|---|
| str | f32 |
| "Sex" | 0.252712 |
| "Title" | 0.2188 |
| "Pclass__Embarked" | 0.096314 |
| "FamilySize" | 0.053011 |
| "Pclass" | 0.047811 |
| "Age__CabinDeck" | 0.045299 |
| "Cabin__is_null" | 0.04449 |
| "Pclass__CabinDeck" | 0.038667 |
| "Pclass__Age" | 0.025938 |
| "FarePerPerson_div" | 0.020867 |
6. Generate Predictions#
Generate survival predictions for the test set and create a submission file.
[6]:
# Generate predictions
y_pred = estimator.predict(X_test_transformed.drop("PassengerId"))
# Create submission file
submission = pl.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": y_pred
})
submission.write_csv("titanic_submission.csv")
print("Submission file created successfully!")
print(f"Predicted survival rate: {y_pred.mean():.2%}")
Submission file created successfully!
Predicted survival rate: 40.91%
Summary#
This notebook showcases the gators library’s extensive capabilities for feature engineering in binary classification:
Key Accomplishments:#
String Processing: Extracted titles from names and cabin decks from cabin numbers
Domain Knowledge Features: Created FamilySize, IsAlone, and FarePerPerson features
Missing Value Intelligence: Created IsNull indicators before imputation to preserve information
Smart Discretization: Binned ages into meaningful life stage categories
Advanced Encoding: Applied WOE encoding for optimal categorical variable handling
Feature Interactions: Generated interaction terms between key categorical features
Rare Category Handling: Automatically grouped infrequent categories to reduce noise
The gators library enabled creation of 20+ engineered features from just 11 original features, demonstrating how domain knowledge can be efficiently encoded through a declarative pipeline approach. The WOEEncoder is particularly powerful for binary classification, automatically calculating optimal encodings based on target distribution.