Titanic Survival Prediction using Iguanas with Feature Engineering#
This notebook demonstrates a complete end-to-end example of using Iguanas for rule-based classification on the Kaggle Titanic dataset, with advanced feature engineering using the Gators library.
The workflow includes:
Loading and exploring the data
Feature engineering using Gators (new columns, transformations, encoding)
Generating candidate rules using XGBoost
Filtering and selecting high-quality rules
Combining rules using different strategies
Generating predictions for submission
1. Import Libraries#
[1]:
import numpy as np
import polars as pl
from gators.data_cleaning import CastColumns, DropColumns, RenameColumns
from gators.discretizers import CustomDiscretizer
from gators.encoders import RareCategoryEncoder, WOEEncoder
from gators.feature_generation import ConditionFeatures, IsNull, MathFeatures, ScalarMathFeatures
from gators.feature_generation_str import (
ExtractSubstring,
Length,
SplitExtract,
)
from gators.imputers import NumericImputer, StringImputer
from gators.pipeline import Pipeline
from xgboost import XGBClassifier
from iguanas.metrics import compute_metrics
from iguanas.rule_analysis import generate_rule_performance_report
from iguanas.rule_combination import (
combine_rules_beam_search,
combine_rules_cumulative,
combine_rules_greedy,
)
from iguanas.rule_evaluation import apply_rules
from iguanas.rule_generation import rule_grid_search
from iguanas.rule_selection import filter_correlated_rules
2. Load and Prepare Data#
Load the Titanic training data and separate features from the target variable (Survived).
[2]:
train = pl.read_csv("../../../../../kaggle/titanic/train.csv").drop("PassengerId")
X_train = train.drop("Survived")
y_train = train["Survived"]
3. Feature Engineering with Gators#
Build a comprehensive feature engineering pipeline using the Gators library. This pipeline will:
Create missing value indicators for Age and Cabin
Extract string features (name titles, cabin deck, ticket length)
Calculate family size and fare per person
Create categorical bins for age
Generate feature interactions
Apply Weight of Evidence (WOE) encoding for all categorical variables
Key transformations in this pipeline:
Missing value handling: Create indicators for missing Age/Cabin, then impute
Feature extraction: Extract passenger titles from names, cabin deck letters, ticket lengths
Feature creation: Calculate family size, fare per person, traveling alone indicator
Discretization: Convert continuous Age into categorical bins
Interactions: Create combinations of Pclass, Age, CabinDeck, and Embarked
Encoding: Apply WOE encoding to convert all categorical features to numeric values
[3]:
# Define the feature engineering pipeline
steps = [
# Create missing value indicators
("IsNull", IsNull(subset=["Age", "Cabin"])),
# String feature engineering
("Length", Length(subset=["Ticket"])),
("SplitExtractName", SplitExtract(subset=["Name"], by=", ", n=1)),
("SplitExtractTitle", SplitExtract(subset=["Name__split_,__1"], by=".", n=0)),
# Calculate family size (SibSp + Parch + 1)
(
"MathFeatures",
MathFeatures(groups=[["SibSp", "Parch"]], operations=["sum"], new_column_names=["Dummy"]),
),
(
"ScalarMathFeatures",
ScalarMathFeatures(
operations=[{"column": "Dummy_sum", "op": "+", "scalar": 1}],
new_column_names=["FamilySize"],
),
),
# Extract cabin deck (first letter of cabin)
("ExtractSubstring", ExtractSubstring(subset=["Cabin"], start=0, end=1)),
# Rename for clarity
(
"RenameColumns",
RenameColumns(
column_mapping={
"Name__split_,__1__split_._0": "Title",
"Cabin__start0_end1": "CabinDeck",
}
),
),
# Handle rare categories (group infrequent values)
("RareCategoryEncoder", RareCategoryEncoder(min_count=0.01)),
# Calculate fare per person
(
"MathFeatures2",
MathFeatures(
groups=[["Fare", "FamilySize"]], operations=["div"], new_column_names=["FarePerPerson"]
),
),
# Create 'traveling alone' indicator
(
"ConditionFeatures",
ConditionFeatures(
conditions=[{"column": "FamilySize", "op": ">", "value": 1}],
new_column_names=["IsAlone"],
),
),
# Drop raw columns no longer needed
("DropColumns", DropColumns(subset=["Cabin", "Ticket", "Dummy_sum"])),
# Impute missing values
("NumericImputer", NumericImputer(strategy="mean")),
("StringImputer", StringImputer(strategy="constant", value="MISSING")),
# Discretize age into bins
("CustomDiscretizer", CustomDiscretizer(bins={"Age": [0, 12, 18, 35, 60, 100]}, inplace=True)),
# Convert passenger class to categorical
("CastColumns", CastColumns(subset=["Pclass"], dtype=pl.String)),
# Apply Weight of Evidence encoding (converts all categorical features to numeric)
("WOEEncoder", WOEEncoder()),
]
# Build and fit the pipeline
pipe = Pipeline(steps=steps, verbose=True)
X_train_transformed = pipe.fit_transform(X_train, y_train)
print(f"\nOriginal features: {X_train.shape[1]}")
print(f"Engineered features: {X_train_transformed.shape[1]}")
[Pipeline] fit+transform 1/17 · IsNull | in: rows=891 cols=10 nulls=866 → out: rows=891 cols=12 nulls=866 (0.001s)
[Pipeline] fit+transform 2/17 · Length | in: rows=891 cols=12 nulls=866 → out: rows=891 cols=13 nulls=866 (0.001s)
[Pipeline] fit+transform 3/17 · SplitExtractName | in: rows=891 cols=13 nulls=866 → out: rows=891 cols=13 nulls=866 (0.001s)
[Pipeline] fit+transform 4/17 · SplitExtractTitle | in: rows=891 cols=13 nulls=866 → out: rows=891 cols=13 nulls=866 (0.001s)
[Pipeline] fit+transform 5/17 · MathFeatures | in: rows=891 cols=13 nulls=866 → out: rows=891 cols=14 nulls=866 (0.000s)
[Pipeline] fit+transform 6/17 · ScalarMathFeatures | in: rows=891 cols=14 nulls=866 → out: rows=891 cols=15 nulls=866 (0.000s)
[Pipeline] fit+transform 7/17 · ExtractSubstring | in: rows=891 cols=15 nulls=866 → out: rows=891 cols=16 nulls=1553 (0.000s)
[Pipeline] fit+transform 8/17 · RenameColumns | in: rows=891 cols=16 nulls=1553 → out: rows=891 cols=16 nulls=1553 (0.000s)
[Pipeline] fit+transform 9/17 · RareCategoryEncoder | in: rows=891 cols=16 nulls=1553 → out: rows=891 cols=16 nulls=1551 (0.002s)
[Pipeline] fit+transform 10/17 · MathFeatures2 | in: rows=891 cols=16 nulls=1551 → out: rows=891 cols=17 nulls=1551 (0.000s)
[Pipeline] fit+transform 11/17 · ConditionFeatures | in: rows=891 cols=17 nulls=1551 → out: rows=891 cols=18 nulls=1551 (0.000s)
[Pipeline] fit+transform 12/17 · DropColumns | in: rows=891 cols=18 nulls=1551 → out: rows=891 cols=15 nulls=864 (0.000s)
[Pipeline] fit+transform 13/17 · NumericImputer | in: rows=891 cols=15 nulls=864 → out: rows=891 cols=15 nulls=687 (0.000s)
[Pipeline] fit+transform 14/17 · StringImputer | in: rows=891 cols=15 nulls=687 → out: rows=891 cols=15 nulls=0 (0.000s)
[Pipeline] fit+transform 15/17 · CustomDiscretizer | in: rows=891 cols=15 nulls=0 → out: rows=891 cols=15 nulls=0 (0.000s)
[Pipeline] fit+transform 16/17 · CastColumns | in: rows=891 cols=15 nulls=0 → out: rows=891 cols=15 nulls=0 (0.000s)
[Pipeline] fit+transform 17/17 · WOEEncoder | in: rows=891 cols=15 nulls=0 → out: rows=891 cols=15 nulls=0 (0.005s)
Original features: 10
Engineered features: 15
4. Generate Candidate Rules#
Use XGBoost-based grid search to generate candidate rules from the engineered features. The rule_grid_search_parallel_scales function trains models with different scale_pos_weight values and extracts rules from the decision trees.
[4]:
estimator = XGBClassifier(n_estimators=100, max_depth=4, eval_metric="logloss", random_state=0)
rules = rule_grid_search(
estimator, X_train_transformed, y_train, scale_pos_weights=np.logspace(0, 3, 50)
)
[5]:
print(f"Number of rules generated: {len(rules)}")
Number of rules generated: 1971
5. Select High-Quality Rules#
Apply the generated rules to the training data, compute performance metrics, and filter based on:
Minimum precision (> 0.15)
Minimum recall (> 0.15)
Maximum correlation between rules (< 0.8)
This ensures we keep only the most useful and diverse rules.
[6]:
R = apply_rules(X_train_transformed, rules.select("rule").to_series().to_list())
M = compute_metrics(R, y_train)
M = M.filter((pl.col("precision") > 0.15) & (pl.col("recall") > 0.15)).sort(
"accuracy", descending=True
)
importance = dict(zip(M["rule"], M["f0.5"], strict=False))
uncorrelated_rules = filter_correlated_rules(
R[M["rule"].to_list()], importance=importance, max_corr=0.8
)
[7]:
num_rules = len(uncorrelated_rules)
print(f"Number of selected rules: {num_rules}")
Number of selected rules: 33
6. Combine Rules#
Test different rule combination strategies to find the best performing ruleset.
6.1 Cumulative Combination#
Combines rules cumulatively (rule1 OR rule2 OR … OR ruleN):
[8]:
R_combined = combine_rules_cumulative(
R[uncorrelated_rules], output_names=[f"combined_rule_{i}" for i in range(1, num_rules + 1)]
)
M_combined = compute_metrics(R_combined, y_train).sort("accuracy", descending=True)
M_combined.head(3)
[8]:
| rule | TP | FP | TN | FN | precision | recall | accuracy | flagged(%) | good_flagged(%) | f0.25 | f0.5 | f1 | f1.5 | f2 | num_rules |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | i64 | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | u32 |
| "combined_rule_4" | 252 | 57 | 492 | 90 | 0.815534 | 0.736842 | 0.835017 | 34.680135 | 10.382514 | 0.810443 | 0.798479 | 0.774194 | 0.759388 | 0.751342 | 1 |
| "combined_rule_5" | 252 | 59 | 490 | 90 | 0.810289 | 0.736842 | 0.832772 | 34.904602 | 10.746812 | 0.805566 | 0.794451 | 0.771822 | 0.757982 | 0.750447 | 1 |
| "combined_rule_6" | 252 | 59 | 490 | 90 | 0.810289 | 0.736842 | 0.832772 | 34.904602 | 10.746812 | 0.805566 | 0.794451 | 0.771822 | 0.757982 | 0.750447 | 1 |
6.2 Greedy Search#
Uses a greedy algorithm to iteratively select the best rule combination:
[9]:
R_greedy = combine_rules_greedy(R[uncorrelated_rules], y_train, metric="accuracy")
M_greedy = compute_metrics(R_greedy, y_train)
M_greedy
[9]:
| rule | TP | FP | TN | FN | precision | recall | accuracy | flagged(%) | good_flagged(%) | f0.25 | f0.5 | f1 | f1.5 | f2 | num_rules |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | i64 | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | u32 |
| "((X["Title"] >= 0.25029) & (X[… | 255 | 58 | 491 | 87 | 0.814696 | 0.745614 | 0.837262 | 35.129068 | 10.564663 | 0.81028 | 0.799875 | 0.778626 | 0.765589 | 0.758477 | 5 |
6.3 Beam Search#
Uses beam search to explore rule combinations up to a maximum number of rules:
[10]:
R_beam = combine_rules_beam_search(R[uncorrelated_rules], y_train, metric="accuracy", max_rules=10)
M_beam = compute_metrics(R_beam, y_train)
M_beam.head(3)
[10]:
| rule | TP | FP | TN | FN | precision | recall | accuracy | flagged(%) | good_flagged(%) | f0.25 | f0.5 | f1 | f1.5 | f2 | num_rules |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | i64 | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | u32 |
| "((X["Title"] >= 0.25029) & (X[… | 255 | 58 | 491 | 87 | 0.814696 | 0.745614 | 0.837262 | 35.129068 | 10.564663 | 0.81028 | 0.799875 | 0.778626 | 0.765589 | 0.758477 | 4 |
| "((X["Title"] >= 0.25029) & (X[… | 255 | 58 | 491 | 87 | 0.814696 | 0.745614 | 0.837262 | 35.129068 | 10.564663 | 0.81028 | 0.799875 | 0.778626 | 0.765589 | 0.758477 | 5 |
| "((X["Title"] >= 0.25029) & (X[… | 255 | 58 | 491 | 87 | 0.814696 | 0.745614 | 0.837262 | 35.129068 | 10.564663 | 0.81028 | 0.799875 | 0.778626 | 0.765589 | 0.758477 | 5 |
7. Analyze the Best Ruleset#
Generate a detailed report for the best performing ruleset from brute force combination:
[11]:
for r in M_beam["rule"][0].split(" | "):
print(r)
((X["Title"] >= 0.25029) & (X["FamilySize"] < 5.0))
((X["FarePerPerson_div"] >= 9.5) & (X["Sex"] >= 1.52977) & (X["Ticket__length"] >= 5.0) & (X["Fare"] < 151.55))
((X["Title"] >= 0.77539) & (X["Pclass"] >= 0.36447))
((X["Title"] >= 0.77539) & (X["Fare"] >= 31.3875) & (X["Fare"] < 151.55) & (X["Ticket__length"] < 7.0))
[12]:
ruleset = M_beam["rule"][0]
print(f"Selected ruleset: {ruleset}")
report = generate_rule_performance_report(ruleset, X_train_transformed, y_train)
report
Selected ruleset: ((X["Title"] >= 0.25029) & (X["FamilySize"] < 5.0)) | ((X["FarePerPerson_div"] >= 9.5) & (X["Sex"] >= 1.52977) & (X["Ticket__length"] >= 5.0) & (X["Fare"] < 151.55)) | ((X["Title"] >= 0.77539) & (X["Pclass"] >= 0.36447)) | ((X["Title"] >= 0.77539) & (X["Fare"] >= 31.3875) & (X["Fare"] < 151.55) & (X["Ticket__length"] < 7.0))
[12]:
| rule_index | rule | TP | FP | TN | FN | precision | recall | accuracy | flagged(%) | good_flagged(%) | f0.25 | f0.5 | f1 | f1.5 | f2 | num_rules |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | i64 | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | u32 |
| "0" | "((X["Title"] >= 0.25029) & (X[… | 255 | 58 | 491 | 87 | 0.814696 | 0.745614 | 0.837262 | 35.129068 | 10.564663 | 0.81028 | 0.799875 | 0.778626 | 0.765589 | 0.758477 | 4 |
| "0.0" | "(X['Title'] >= 0.25029) & (X['… | 239 | 57 | 492 | 103 | 0.807432 | 0.69883 | 0.820426 | 33.2211 | 10.382514 | 0.800118 | 0.783093 | 0.749216 | 0.729 | 0.718149 | 1 |
| "0.1" | "(X['FarePerPerson_div'] >= 9.5… | 127 | 6 | 543 | 215 | 0.954887 | 0.371345 | 0.751964 | 14.927048 | 1.092896 | 0.874089 | 0.726545 | 0.534737 | 0.457341 | 0.423051 | 1 |
| "0.2" | "(X['Title'] >= 0.77539) & (X['… | 166 | 9 | 540 | 176 | 0.948571 | 0.48538 | 0.792368 | 19.640853 | 1.639344 | 0.898154 | 0.796545 | 0.642166 | 0.571202 | 0.537913 | 1 |
| "0.3" | "(X['Title'] >= 0.77539) & (X['… | 59 | 1 | 548 | 283 | 0.983333 | 0.172515 | 0.681257 | 6.734007 | 0.182149 | 0.770353 | 0.506873 | 0.293532 | 0.231163 | 0.206583 | 1 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "0.2.1" | "(X['Pclass'] >= 0.36447)" | 223 | 177 | 372 | 119 | 0.5575 | 0.652047 | 0.667789 | 44.893378 | 32.240437 | 0.562296 | 0.57415 | 0.601078 | 0.619709 | 0.630656 | 1 |
| "0.3.0" | "(X['Title'] >= 0.77539)" | 249 | 98 | 451 | 93 | 0.717579 | 0.72807 | 0.785634 | 38.945006 | 17.850638 | 0.718188 | 0.719653 | 0.722787 | 0.72481 | 0.725948 | 1 |
| "0.3.1" | "(X['Fare'] >= 31.3875)" | 129 | 86 | 463 | 213 | 0.6 | 0.377193 | 0.664422 | 24.130191 | 15.664845 | 0.579852 | 0.536606 | 0.463196 | 0.425851 | 0.407454 | 1 |
| "0.3.2" | "(X['Fare'] < 151.55)" | 322 | 540 | 9 | 20 | 0.37355 | 0.94152 | 0.371493 | 96.74523 | 98.360656 | 0.387293 | 0.424802 | 0.534884 | 0.641434 | 0.721973 | 1 |
| "0.3.3" | "(X['Ticket__length'] < 7.0)" | 252 | 401 | 148 | 90 | 0.385911 | 0.736842 | 0.448934 | 73.28844 | 73.041894 | 0.397034 | 0.42654 | 0.506533 | 0.575747 | 0.623454 | 1 |
8. Generate Predictions on Test Data#
Apply the same preprocessing pipeline to the test data, then use the best ruleset to generate predictions:
[13]:
X_test = pl.read_csv("../../../../../kaggle/titanic/test.csv")
X_test_transformed = pipe.transform(X_test)
y_pred = eval(ruleset.replace("X", "X_test_transformed"))
[Pipeline] transform 1/17 · IsNull | in: rows=418 cols=11 nulls=414 → out: rows=418 cols=13 nulls=414 (0.000s)
[Pipeline] transform 2/17 · Length | in: rows=418 cols=13 nulls=414 → out: rows=418 cols=14 nulls=414 (0.000s)
[Pipeline] transform 3/17 · SplitExtractName | in: rows=418 cols=14 nulls=414 → out: rows=418 cols=14 nulls=414 (0.001s)
[Pipeline] transform 4/17 · SplitExtractTitle | in: rows=418 cols=14 nulls=414 → out: rows=418 cols=14 nulls=414 (0.001s)
[Pipeline] transform 5/17 · MathFeatures | in: rows=418 cols=14 nulls=414 → out: rows=418 cols=15 nulls=414 (0.000s)
[Pipeline] transform 6/17 · ScalarMathFeatures | in: rows=418 cols=15 nulls=414 → out: rows=418 cols=16 nulls=414 (0.000s)
[Pipeline] transform 7/17 · ExtractSubstring | in: rows=418 cols=16 nulls=414 → out: rows=418 cols=17 nulls=741 (0.000s)
[Pipeline] transform 8/17 · RenameColumns | in: rows=418 cols=17 nulls=741 → out: rows=418 cols=17 nulls=741 (0.000s)
[Pipeline] transform 9/17 · RareCategoryEncoder | in: rows=418 cols=17 nulls=741 → out: rows=418 cols=17 nulls=741 (0.001s)
[Pipeline] transform 10/17 · MathFeatures2 | in: rows=418 cols=17 nulls=741 → out: rows=418 cols=18 nulls=742 (0.000s)
[Pipeline] transform 11/17 · ConditionFeatures | in: rows=418 cols=18 nulls=742 → out: rows=418 cols=19 nulls=742 (0.000s)
[Pipeline] transform 12/17 · DropColumns | in: rows=418 cols=19 nulls=742 → out: rows=418 cols=16 nulls=415 (0.000s)
[Pipeline] transform 13/17 · NumericImputer | in: rows=418 cols=16 nulls=415 → out: rows=418 cols=16 nulls=327 (0.000s)
[Pipeline] transform 14/17 · StringImputer | in: rows=418 cols=16 nulls=327 → out: rows=418 cols=16 nulls=0 (0.000s)
[Pipeline] transform 15/17 · CustomDiscretizer | in: rows=418 cols=16 nulls=0 → out: rows=418 cols=16 nulls=0 (0.000s)
[Pipeline] transform 16/17 · CastColumns | in: rows=418 cols=16 nulls=0 → out: rows=418 cols=16 nulls=0 (0.000s)
[Pipeline] transform 17/17 · WOEEncoder | in: rows=418 cols=16 nulls=0 → out: rows=418 cols=16 nulls=0 (0.001s)
[14]:
# Create submission file (Kaggle leaderboard score: 0.78)
# Note: +25% better than without feature engineering (0.60)
pl.DataFrame({"PassengerId": X_test["PassengerId"], "Survived": y_pred}).with_columns(
pl.col("Survived").cast(pl.Int64)
).write_csv("submission_titanic.csv")
[ ]: