Titanic Survival Prediction with Gators#

This notebook demonstrates how to use the gators library for advanced feature engineering in a binary classification problem. We’ll predict passenger survival on the Titanic using comprehensive feature transformations.

Table of Contents#

  1. Import Libraries

  2. Load Data

  3. Build Feature Engineering Pipeline

  4. Train Model

  5. Analyze Feature Importance

  6. Generate Predictions

  7. Summary

Key Features Demonstrated:#

  • Null indicator features

  • String parsing and extraction (names, titles)

  • Mathematical feature engineering

  • Conditional features

  • Custom discretization (age binning)

  • Rare category encoding

  • Weight of Evidence (WOE) encoding

  • Feature interactions

Dataset: Kaggle Titanic - Machine Learning from Disaster

1. Import Libraries#

Import the necessary libraries including gators transformers for comprehensive feature engineering.

[1]:
import polars as pl
from IPython.display import display

from gators.pipeline import Pipeline
from gators.encoders import RareCategoryEncoder, WOEEncoder
from gators.discretizers import CustomDiscretizer
from gators.data_cleaning import DropColumns, CastColumns, RenameColumns
from gators.feature_generation import (
    IsNull,
    MathFeatures,
    ConditionFeatures,
    ScalarMathFeatures
)
from gators.imputers import StringImputer, NumericImputer
from gators.feature_generation_str import (
    Length,
    SplitExtract,
    ExtractSubstring,
    InteractionFeatures,
)

from xgboost import XGBClassifier

2. Load Data#

Load the Titanic dataset and prepare training and test sets.

[2]:
# Load train and test data
train = pl.read_csv('../../../kaggle/titanic/train.csv', null_values='NA')
test = pl.read_csv('../../../kaggle/titanic/test.csv', null_values='NA')

# Prepare training data
train = train.drop("PassengerId")
y_train = train['Survived']
X_train = train.drop('Survived')

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(test)}")
print(f"\nFeatures: {X_train.columns}")
Training samples: 891
Test samples: 418

Features: ['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

3. Build Feature Engineering Pipeline#

Create a comprehensive pipeline that demonstrates the power of gators transformers:

Missing Value Indicators:#

  • IsNull: Create binary indicators for missing Age and Cabin values

String Feature Engineering:#

  • Length: Calculate ticket string length

  • SplitExtract: Extract passenger title from name (e.g., ‘Mr.’, ‘Mrs.’, ‘Miss.’)

  • ExtractSubstring: Extract cabin deck letter from cabin number

Mathematical Features:#

  • MathFeatures: Sum SibSp and Parch to get family size components

  • ScalarMathFeatures: Add 1 to include the passenger (FamilySize = SibSp + Parch + 1)

  • MathFeatures: Calculate fare per person (Fare / FamilySize)

Conditional Features:#

  • ConditionFeatures: Create ‘IsAlone’ indicator for passengers traveling solo

Data Cleaning:#

  • RenameColumns: Give intuitive names to extracted features

  • DropColumns: Remove raw columns after feature extraction

  • NumericImputer: Fill missing numeric values with mean

  • StringImputer: Fill missing categorical values with ‘MISSING’

Discretization:#

  • CustomDiscretizer: Bin ages into meaningful groups (child, teen, adult, senior)

Encoding:#

  • RareCategoryEncoder: Group infrequent categories to reduce noise

  • CastColumns: Convert Pclass to string for categorical treatment

  • InteractionFeatures: Create feature interactions (e.g., Pclass × Age group)

  • WOEEncoder: Apply Weight of Evidence encoding for all categorical features

[3]:
# Define the feature engineering pipeline
steps = [
    # Create missing value indicators
    ('IsNull', IsNull(subset=['Age', 'Cabin'])),

    # String feature engineering
    ('Length', Length(subset=['Ticket'])),
    ('SplitExtractName', SplitExtract(subset=['Name'], by=', ', n=1)),
    ('SplitExtractTitle', SplitExtract(subset=['Name__split_,__1'], by='.', n=0)),

    # Calculate family size
    ('MathFeatures', MathFeatures(
        groups=[['SibSp', 'Parch']],
        operations=['sum'],
        new_column_names=['Dummy']
    )),
    ('ScalarMathFeatures', ScalarMathFeatures(
        operations=[{'column': 'Dummy_sum', 'op': '+', 'scalar': 1}],
        new_column_names=["FamilySize"]
    )),

    # Extract cabin deck
    ('ExtractSubstring', ExtractSubstring(subset=['Cabin'], start=0, end=1)),

    # Rename for clarity
    ('RenameColumns', RenameColumns(column_mapping={
        'Name__split_,__1__split_._0': 'Title',
        'Cabin__start0_end1': 'CabinDeck'
    })),

    # Handle rare categories
    ('RareCategoryEncoder', RareCategoryEncoder()),

    # Calculate fare per person
    ('MathFeatures2', MathFeatures(
        groups=[['Fare', 'FamilySize']],
        operations=['div'],
        new_column_names=['FarePerPerson']
    )),

    # Create 'traveling alone' indicator
    ('ConditionFeatures', ConditionFeatures(
        conditions=[{"column": "FamilySize", "op": ">", "value": 1}],
        new_column_names=['IsAlone']
    )),

    # Drop raw columns
    ('DropColumns', DropColumns(subset=['Cabin', 'Ticket', 'Dummy_sum'])),

    # Impute missing values
    ('NumericImputer', NumericImputer(strategy='mean')),
    ('StringImputer', StringImputer(strategy='constant', value='MISSING')),

    # Discretize age into bins
    ('CustomDiscretizer', CustomDiscretizer(
        bins={'Age': [0, 12, 18, 35, 60, 100]},
        inplace=True
    )),

    # Convert passenger class to categorical
    ('CastColumns', CastColumns(subset=["Pclass"], dtype=pl.String)),

    # Create feature interactions
    ('InteractionFeatures', InteractionFeatures(
        subset=['Pclass', 'Age', 'CabinDeck', 'Embarked']
    )),

    # Apply Weight of Evidence encoding
    ('WOEEncoder', WOEEncoder()),
]

# Build and apply the pipeline
pipe = Pipeline(steps=steps, verbose=True)
X_train_transformed = pipe.fit_transform(X_train, y_train)
X_test_transformed = pipe.transform(test)

print(f"\nOriginal features: {X_train.shape[1]}")
print(f"Engineered features: {X_train_transformed.shape[1]}")
[Pipeline] Fitting and transforming step 1/18: IsNull
[Pipeline] Fitting and transforming step 2/18: Length
[Pipeline] Fitting and transforming step 3/18: SplitExtractName
[Pipeline] Fitting and transforming step 4/18: SplitExtractTitle
[Pipeline] Fitting and transforming step 5/18: MathFeatures
[Pipeline] Fitting and transforming step 6/18: ScalarMathFeatures
[Pipeline] Fitting and transforming step 7/18: ExtractSubstring
[Pipeline] Fitting and transforming step 8/18: RenameColumns
[Pipeline] Fitting and transforming step 9/18: RareCategoryEncoder
[Pipeline] Fitting and transforming step 10/18: MathFeatures2
[Pipeline] Fitting and transforming step 11/18: ConditionFeatures
[Pipeline] Fitting and transforming step 12/18: DropColumns
[Pipeline] Fitting and transforming step 13/18: NumericImputer
[Pipeline] Fitting and transforming step 14/18: StringImputer
[Pipeline] Fitting and transforming step 15/18: CustomDiscretizer
[Pipeline] Fitting and transforming step 16/18: CastColumns
[Pipeline] Fitting and transforming step 17/18: InteractionFeatures
[Pipeline] Fitting and transforming step 18/18: WOEEncoder
[Pipeline] Transforming step 1/18: IsNull
[Pipeline] Transforming step 2/18: Length
[Pipeline] Transforming step 3/18: SplitExtractName
[Pipeline] Transforming step 4/18: SplitExtractTitle
[Pipeline] Transforming step 5/18: MathFeatures
[Pipeline] Transforming step 6/18: ScalarMathFeatures
[Pipeline] Transforming step 7/18: ExtractSubstring
[Pipeline] Transforming step 8/18: RenameColumns
[Pipeline] Transforming step 9/18: RareCategoryEncoder
[Pipeline] Transforming step 10/18: MathFeatures2
[Pipeline] Transforming step 11/18: ConditionFeatures
[Pipeline] Transforming step 12/18: DropColumns
[Pipeline] Transforming step 13/18: NumericImputer
[Pipeline] Transforming step 14/18: StringImputer
[Pipeline] Transforming step 15/18: CustomDiscretizer
[Pipeline] Transforming step 16/18: CastColumns
[Pipeline] Transforming step 17/18: InteractionFeatures
[Pipeline] Transforming step 18/18: WOEEncoder

Original features: 10
Engineered features: 21

4. Train Model#

Train an XGBoost classifier with parameters tuned for the Titanic dataset.

[4]:
# Define model parameters
imbalance_ratio = round((y_train == 0).sum() / (y_train == 1).sum(), 3)
params = {
    'n_estimators': 150,
    'max_depth': 4,
    'learning_rate': 0.03,
    'subsample': 0.85,
    'colsample_bytree': 0.85,
    'min_child_weight': 1,
    'gamma': 0.05,
    'reg_alpha': 0.5,
    'reg_lambda': 1.5,
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'scale_pos_weight': imbalance_ratio,
    'random_state': 42
}

# Train the model
estimator = XGBClassifier(**params)
estimator.fit(X_train_transformed, y_train)


print("Model training completed successfully!")
print("Accuracy on training set: {:.3f}".format(estimator.score(X_train_transformed, y_train)))
Model training completed successfully!
Accuracy on training set: 0.880

5. Analyze Feature Importance#

Examine which engineered features contribute most to survival predictions.

[5]:
# Extract and display feature importances
feature_importance = pl.DataFrame({
    "feature": estimator.feature_names_in_,
    "importance": estimator.feature_importances_
}).sort("importance", descending=True)

print("Top 10 Most Important Features:")
display(feature_importance.head(10))
Top 10 Most Important Features:
shape: (10, 2)
featureimportance
strf32
"Sex"0.252712
"Title"0.2188
"Pclass__Embarked"0.096314
"FamilySize"0.053011
"Pclass"0.047811
"Age__CabinDeck"0.045299
"Cabin__is_null"0.04449
"Pclass__CabinDeck"0.038667
"Pclass__Age"0.025938
"FarePerPerson_div"0.020867

6. Generate Predictions#

Generate survival predictions for the test set and create a submission file.

[6]:
# Generate predictions
y_pred = estimator.predict(X_test_transformed.drop("PassengerId"))

# Create submission file
submission = pl.DataFrame({
    "PassengerId": test["PassengerId"],
    "Survived": y_pred
})
submission.write_csv("titanic_submission.csv")

print("Submission file created successfully!")
print(f"Predicted survival rate: {y_pred.mean():.2%}")
Submission file created successfully!
Predicted survival rate: 40.91%

Summary#

This notebook showcases the gators library’s extensive capabilities for feature engineering in binary classification:

Key Accomplishments:#

  1. String Processing: Extracted titles from names and cabin decks from cabin numbers

  2. Domain Knowledge Features: Created FamilySize, IsAlone, and FarePerPerson features

  3. Missing Value Intelligence: Created IsNull indicators before imputation to preserve information

  4. Smart Discretization: Binned ages into meaningful life stage categories

  5. Advanced Encoding: Applied WOE encoding for optimal categorical variable handling

  6. Feature Interactions: Generated interaction terms between key categorical features

  7. Rare Category Handling: Automatically grouped infrequent categories to reduce noise

The gators library enabled creation of 20+ engineered features from just 11 original features, demonstrating how domain knowledge can be efficiently encoded through a declarative pipeline approach. The WOEEncoder is particularly powerful for binary classification, automatically calculating optimal encodings based on target distribution.