Titanic Survival Prediction using Iguanas with Feature Engineering#

This notebook demonstrates a complete end-to-end example of using Iguanas for rule-based classification on the Kaggle Titanic dataset, with advanced feature engineering using the Gators library.

The workflow includes:

Loading and exploring the data
Feature engineering using Gators (new columns, transformations, encoding)
Generating candidate rules using XGBoost
Filtering and selecting high-quality rules
Combining rules using different strategies
Generating predictions for submission

1. Import Libraries#

[1]:

import numpy as np
import polars as pl
from gators.data_cleaning import CastColumns, DropColumns, RenameColumns
from gators.discretizers import CustomDiscretizer
from gators.encoders import RareCategoryEncoder, WOEEncoder
from gators.feature_generation import ConditionFeatures, IsNull, MathFeatures, ScalarMathFeatures
from gators.feature_generation_str import (
    ExtractSubstring,
    Length,
    SplitExtract,
)
from gators.imputers import NumericImputer, StringImputer
from gators.pipeline import Pipeline
from xgboost import XGBClassifier

from iguanas.metrics import compute_metrics
from iguanas.rule_analysis import generate_rule_performance_report
from iguanas.rule_combination import (
    combine_rules_beam_search,
    combine_rules_cumulative,
    combine_rules_greedy,
)
from iguanas.rule_evaluation import apply_rules
from iguanas.rule_generation import rule_grid_search
from iguanas.rule_selection import filter_correlated_rules

2. Load and Prepare Data#

Load the Titanic training data and separate features from the target variable (Survived).

[2]:

train = pl.read_csv("../../../../../kaggle/titanic/train.csv").drop("PassengerId")
X_train = train.drop("Survived")
y_train = train["Survived"]

3. Feature Engineering with Gators#

Build a comprehensive feature engineering pipeline using the Gators library. This pipeline will:

Create missing value indicators for Age and Cabin
Extract string features (name titles, cabin deck, ticket length)
Calculate family size and fare per person
Create categorical bins for age
Generate feature interactions
Apply Weight of Evidence (WOE) encoding for all categorical variables

Key transformations in this pipeline:

Missing value handling: Create indicators for missing Age/Cabin, then impute
Feature extraction: Extract passenger titles from names, cabin deck letters, ticket lengths
Feature creation: Calculate family size, fare per person, traveling alone indicator
Discretization: Convert continuous Age into categorical bins
Interactions: Create combinations of Pclass, Age, CabinDeck, and Embarked
Encoding: Apply WOE encoding to convert all categorical features to numeric values

[3]:

# Define the feature engineering pipeline
steps = [
    # Create missing value indicators
    ("IsNull", IsNull(subset=["Age", "Cabin"])),
    # String feature engineering
    ("Length", Length(subset=["Ticket"])),
    ("SplitExtractName", SplitExtract(subset=["Name"], by=", ", n=1)),
    ("SplitExtractTitle", SplitExtract(subset=["Name__split_,__1"], by=".", n=0)),
    # Calculate family size (SibSp + Parch + 1)
    (
        "MathFeatures",
        MathFeatures(groups=[["SibSp", "Parch"]], operations=["sum"], new_column_names=["Dummy"]),
    ),
    (
        "ScalarMathFeatures",
        ScalarMathFeatures(
            operations=[{"column": "Dummy_sum", "op": "+", "scalar": 1}],
            new_column_names=["FamilySize"],
        ),
    ),
    # Extract cabin deck (first letter of cabin)
    ("ExtractSubstring", ExtractSubstring(subset=["Cabin"], start=0, end=1)),
    # Rename for clarity
    (
        "RenameColumns",
        RenameColumns(
            column_mapping={
                "Name__split_,__1__split_._0": "Title",
                "Cabin__start0_end1": "CabinDeck",
            }
        ),
    ),
    # Handle rare categories (group infrequent values)
    ("RareCategoryEncoder", RareCategoryEncoder(min_count=0.01)),
    # Calculate fare per person
    (
        "MathFeatures2",
        MathFeatures(
            groups=[["Fare", "FamilySize"]], operations=["div"], new_column_names=["FarePerPerson"]
        ),
    ),
    # Create 'traveling alone' indicator
    (
        "ConditionFeatures",
        ConditionFeatures(
            conditions=[{"column": "FamilySize", "op": ">", "value": 1}],
            new_column_names=["IsAlone"],
        ),
    ),
    # Drop raw columns no longer needed
    ("DropColumns", DropColumns(subset=["Cabin", "Ticket", "Dummy_sum"])),
    # Impute missing values
    ("NumericImputer", NumericImputer(strategy="mean")),
    ("StringImputer", StringImputer(strategy="constant", value="MISSING")),
    # Discretize age into bins
    ("CustomDiscretizer", CustomDiscretizer(bins={"Age": [0, 12, 18, 35, 60, 100]}, inplace=True)),
    # Convert passenger class to categorical
    ("CastColumns", CastColumns(subset=["Pclass"], dtype=pl.String)),
    # Apply Weight of Evidence encoding (converts all categorical features to numeric)
    ("WOEEncoder", WOEEncoder()),
]

# Build and fit the pipeline
pipe = Pipeline(steps=steps, verbose=True)
X_train_transformed = pipe.fit_transform(X_train, y_train)

print(f"\nOriginal features: {X_train.shape[1]}")
print(f"Engineered features: {X_train_transformed.shape[1]}")

[Pipeline] fit+transform   1/17 · IsNull  |  in: rows=891  cols=10  nulls=866  →  out: rows=891  cols=12  nulls=866  (0.001s)
[Pipeline] fit+transform   2/17 · Length  |  in: rows=891  cols=12  nulls=866  →  out: rows=891  cols=13  nulls=866  (0.001s)
[Pipeline] fit+transform   3/17 · SplitExtractName  |  in: rows=891  cols=13  nulls=866  →  out: rows=891  cols=13  nulls=866  (0.001s)
[Pipeline] fit+transform   4/17 · SplitExtractTitle  |  in: rows=891  cols=13  nulls=866  →  out: rows=891  cols=13  nulls=866  (0.001s)
[Pipeline] fit+transform   5/17 · MathFeatures  |  in: rows=891  cols=13  nulls=866  →  out: rows=891  cols=14  nulls=866  (0.000s)
[Pipeline] fit+transform   6/17 · ScalarMathFeatures  |  in: rows=891  cols=14  nulls=866  →  out: rows=891  cols=15  nulls=866  (0.000s)
[Pipeline] fit+transform   7/17 · ExtractSubstring  |  in: rows=891  cols=15  nulls=866  →  out: rows=891  cols=16  nulls=1553  (0.000s)
[Pipeline] fit+transform   8/17 · RenameColumns  |  in: rows=891  cols=16  nulls=1553  →  out: rows=891  cols=16  nulls=1553  (0.000s)
[Pipeline] fit+transform   9/17 · RareCategoryEncoder  |  in: rows=891  cols=16  nulls=1553  →  out: rows=891  cols=16  nulls=1551  (0.002s)
[Pipeline] fit+transform   10/17 · MathFeatures2  |  in: rows=891  cols=16  nulls=1551  →  out: rows=891  cols=17  nulls=1551  (0.000s)
[Pipeline] fit+transform   11/17 · ConditionFeatures  |  in: rows=891  cols=17  nulls=1551  →  out: rows=891  cols=18  nulls=1551  (0.000s)
[Pipeline] fit+transform   12/17 · DropColumns  |  in: rows=891  cols=18  nulls=1551  →  out: rows=891  cols=15  nulls=864  (0.000s)
[Pipeline] fit+transform   13/17 · NumericImputer  |  in: rows=891  cols=15  nulls=864  →  out: rows=891  cols=15  nulls=687  (0.000s)
[Pipeline] fit+transform   14/17 · StringImputer  |  in: rows=891  cols=15  nulls=687  →  out: rows=891  cols=15  nulls=0  (0.000s)
[Pipeline] fit+transform   15/17 · CustomDiscretizer  |  in: rows=891  cols=15  nulls=0  →  out: rows=891  cols=15  nulls=0  (0.000s)
[Pipeline] fit+transform   16/17 · CastColumns  |  in: rows=891  cols=15  nulls=0  →  out: rows=891  cols=15  nulls=0  (0.000s)
[Pipeline] fit+transform   17/17 · WOEEncoder  |  in: rows=891  cols=15  nulls=0  →  out: rows=891  cols=15  nulls=0  (0.005s)

Original features: 10
Engineered features: 15

4. Generate Candidate Rules#

Use XGBoost-based grid search to generate candidate rules from the engineered features. The rule_grid_search_parallel_scales function trains models with different scale_pos_weight values and extracts rules from the decision trees.

[4]:

estimator = XGBClassifier(n_estimators=100, max_depth=4, eval_metric="logloss", random_state=0)
rules = rule_grid_search(
    estimator, X_train_transformed, y_train, scale_pos_weights=np.logspace(0, 3, 50)
)

[5]:

print(f"Number of rules generated: {len(rules)}")

Number of rules generated: 1971

5. Select High-Quality Rules#

Apply the generated rules to the training data, compute performance metrics, and filter based on:

Minimum precision (> 0.15)
Minimum recall (> 0.15)
Maximum correlation between rules (< 0.8)

This ensures we keep only the most useful and diverse rules.

[6]:

R = apply_rules(X_train_transformed, rules.select("rule").to_series().to_list())
M = compute_metrics(R, y_train)
M = M.filter((pl.col("precision") > 0.15) & (pl.col("recall") > 0.15)).sort(
    "accuracy", descending=True
)
importance = dict(zip(M["rule"], M["f0.5"], strict=False))
uncorrelated_rules = filter_correlated_rules(
    R[M["rule"].to_list()], importance=importance, max_corr=0.8
)

[7]:

num_rules = len(uncorrelated_rules)
print(f"Number of selected rules: {num_rules}")

Number of selected rules: 33

6. Combine Rules#

Test different rule combination strategies to find the best performing ruleset.

6.1 Cumulative Combination#

Combines rules cumulatively (rule1 OR rule2 OR … OR ruleN):

[8]:

R_combined = combine_rules_cumulative(
    R[uncorrelated_rules], output_names=[f"combined_rule_{i}" for i in range(1, num_rules + 1)]
)
M_combined = compute_metrics(R_combined, y_train).sort("accuracy", descending=True)
M_combined.head(3)

[8]:

shape: (3, 16)

rule	TP	FP	TN	FN	precision	recall	accuracy	flagged(%)	good_flagged(%)	f0.25	f0.5	f1	f1.5	f2	num_rules
str	i64	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	u32
"combined_rule_4"	252	57	492	90	0.815534	0.736842	0.835017	34.680135	10.382514	0.810443	0.798479	0.774194	0.759388	0.751342	1
"combined_rule_5"	252	59	490	90	0.810289	0.736842	0.832772	34.904602	10.746812	0.805566	0.794451	0.771822	0.757982	0.750447	1
"combined_rule_6"	252	59	490	90	0.810289	0.736842	0.832772	34.904602	10.746812	0.805566	0.794451	0.771822	0.757982	0.750447	1

6.2 Greedy Search#

Uses a greedy algorithm to iteratively select the best rule combination:

[9]:

R_greedy = combine_rules_greedy(R[uncorrelated_rules], y_train, metric="accuracy")
M_greedy = compute_metrics(R_greedy, y_train)
M_greedy

[9]:

shape: (1, 16)

rule	TP	FP	TN	FN	precision	recall	accuracy	flagged(%)	good_flagged(%)	f0.25	f0.5	f1	f1.5	f2	num_rules
str	i64	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	u32
"((X["Title"] >= 0.25029) & (X[…	255	58	491	87	0.814696	0.745614	0.837262	35.129068	10.564663	0.81028	0.799875	0.778626	0.765589	0.758477	5

6.3 Beam Search#

Uses beam search to explore rule combinations up to a maximum number of rules:

[10]:

R_beam = combine_rules_beam_search(R[uncorrelated_rules], y_train, metric="accuracy", max_rules=10)
M_beam = compute_metrics(R_beam, y_train)
M_beam.head(3)

[10]:

shape: (3, 16)

rule	TP	FP	TN	FN	precision	recall	accuracy	flagged(%)	good_flagged(%)	f0.25	f0.5	f1	f1.5	f2	num_rules
str	i64	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	u32
"((X["Title"] >= 0.25029) & (X[…	255	58	491	87	0.814696	0.745614	0.837262	35.129068	10.564663	0.81028	0.799875	0.778626	0.765589	0.758477	4
"((X["Title"] >= 0.25029) & (X[…	255	58	491	87	0.814696	0.745614	0.837262	35.129068	10.564663	0.81028	0.799875	0.778626	0.765589	0.758477	5
"((X["Title"] >= 0.25029) & (X[…	255	58	491	87	0.814696	0.745614	0.837262	35.129068	10.564663	0.81028	0.799875	0.778626	0.765589	0.758477	5

7. Analyze the Best Ruleset#

Generate a detailed report for the best performing ruleset from brute force combination:

[11]:

for r in M_beam["rule"][0].split(" | "):
    print(r)

((X["Title"] >= 0.25029) & (X["FamilySize"] < 5.0))
((X["FarePerPerson_div"] >= 9.5) & (X["Sex"] >= 1.52977) & (X["Ticket__length"] >= 5.0) & (X["Fare"] < 151.55))
((X["Title"] >= 0.77539) & (X["Pclass"] >= 0.36447))
((X["Title"] >= 0.77539) & (X["Fare"] >= 31.3875) & (X["Fare"] < 151.55) & (X["Ticket__length"] < 7.0))

[12]:

ruleset = M_beam["rule"][0]
print(f"Selected ruleset: {ruleset}")
report = generate_rule_performance_report(ruleset, X_train_transformed, y_train)
report

Selected ruleset: ((X["Title"] >= 0.25029) & (X["FamilySize"] < 5.0)) | ((X["FarePerPerson_div"] >= 9.5) & (X["Sex"] >= 1.52977) & (X["Ticket__length"] >= 5.0) & (X["Fare"] < 151.55)) | ((X["Title"] >= 0.77539) & (X["Pclass"] >= 0.36447)) | ((X["Title"] >= 0.77539) & (X["Fare"] >= 31.3875) & (X["Fare"] < 151.55) & (X["Ticket__length"] < 7.0))

[12]:

shape: (17, 17)

rule_index	rule	TP	FP	TN	FN	precision	recall	accuracy	flagged(%)	good_flagged(%)	f0.25	f0.5	f1	f1.5	f2	num_rules
str	str	i64	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	u32
"0"	"((X["Title"] >= 0.25029) & (X[…	255	58	491	87	0.814696	0.745614	0.837262	35.129068	10.564663	0.81028	0.799875	0.778626	0.765589	0.758477	4
"0.0"	"(X['Title'] >= 0.25029) & (X['…	239	57	492	103	0.807432	0.69883	0.820426	33.2211	10.382514	0.800118	0.783093	0.749216	0.729	0.718149	1
"0.1"	"(X['FarePerPerson_div'] >= 9.5…	127	6	543	215	0.954887	0.371345	0.751964	14.927048	1.092896	0.874089	0.726545	0.534737	0.457341	0.423051	1
"0.2"	"(X['Title'] >= 0.77539) & (X['…	166	9	540	176	0.948571	0.48538	0.792368	19.640853	1.639344	0.898154	0.796545	0.642166	0.571202	0.537913	1
"0.3"	"(X['Title'] >= 0.77539) & (X['…	59	1	548	283	0.983333	0.172515	0.681257	6.734007	0.182149	0.770353	0.506873	0.293532	0.231163	0.206583	1
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"0.2.1"	"(X['Pclass'] >= 0.36447)"	223	177	372	119	0.5575	0.652047	0.667789	44.893378	32.240437	0.562296	0.57415	0.601078	0.619709	0.630656	1
"0.3.0"	"(X['Title'] >= 0.77539)"	249	98	451	93	0.717579	0.72807	0.785634	38.945006	17.850638	0.718188	0.719653	0.722787	0.72481	0.725948	1
"0.3.1"	"(X['Fare'] >= 31.3875)"	129	86	463	213	0.6	0.377193	0.664422	24.130191	15.664845	0.579852	0.536606	0.463196	0.425851	0.407454	1
"0.3.2"	"(X['Fare'] < 151.55)"	322	540	9	20	0.37355	0.94152	0.371493	96.74523	98.360656	0.387293	0.424802	0.534884	0.641434	0.721973	1
"0.3.3"	"(X['Ticket__length'] < 7.0)"	252	401	148	90	0.385911	0.736842	0.448934	73.28844	73.041894	0.397034	0.42654	0.506533	0.575747	0.623454	1

8. Generate Predictions on Test Data#

Apply the same preprocessing pipeline to the test data, then use the best ruleset to generate predictions:

[13]:

X_test = pl.read_csv("../../../../../kaggle/titanic/test.csv")
X_test_transformed = pipe.transform(X_test)
y_pred = eval(ruleset.replace("X", "X_test_transformed"))

[Pipeline] transform   1/17 · IsNull  |  in: rows=418  cols=11  nulls=414  →  out: rows=418  cols=13  nulls=414  (0.000s)
[Pipeline] transform   2/17 · Length  |  in: rows=418  cols=13  nulls=414  →  out: rows=418  cols=14  nulls=414  (0.000s)
[Pipeline] transform   3/17 · SplitExtractName  |  in: rows=418  cols=14  nulls=414  →  out: rows=418  cols=14  nulls=414  (0.001s)
[Pipeline] transform   4/17 · SplitExtractTitle  |  in: rows=418  cols=14  nulls=414  →  out: rows=418  cols=14  nulls=414  (0.001s)
[Pipeline] transform   5/17 · MathFeatures  |  in: rows=418  cols=14  nulls=414  →  out: rows=418  cols=15  nulls=414  (0.000s)
[Pipeline] transform   6/17 · ScalarMathFeatures  |  in: rows=418  cols=15  nulls=414  →  out: rows=418  cols=16  nulls=414  (0.000s)
[Pipeline] transform   7/17 · ExtractSubstring  |  in: rows=418  cols=16  nulls=414  →  out: rows=418  cols=17  nulls=741  (0.000s)
[Pipeline] transform   8/17 · RenameColumns  |  in: rows=418  cols=17  nulls=741  →  out: rows=418  cols=17  nulls=741  (0.000s)
[Pipeline] transform   9/17 · RareCategoryEncoder  |  in: rows=418  cols=17  nulls=741  →  out: rows=418  cols=17  nulls=741  (0.001s)
[Pipeline] transform   10/17 · MathFeatures2  |  in: rows=418  cols=17  nulls=741  →  out: rows=418  cols=18  nulls=742  (0.000s)
[Pipeline] transform   11/17 · ConditionFeatures  |  in: rows=418  cols=18  nulls=742  →  out: rows=418  cols=19  nulls=742  (0.000s)
[Pipeline] transform   12/17 · DropColumns  |  in: rows=418  cols=19  nulls=742  →  out: rows=418  cols=16  nulls=415  (0.000s)
[Pipeline] transform   13/17 · NumericImputer  |  in: rows=418  cols=16  nulls=415  →  out: rows=418  cols=16  nulls=327  (0.000s)
[Pipeline] transform   14/17 · StringImputer  |  in: rows=418  cols=16  nulls=327  →  out: rows=418  cols=16  nulls=0  (0.000s)
[Pipeline] transform   15/17 · CustomDiscretizer  |  in: rows=418  cols=16  nulls=0  →  out: rows=418  cols=16  nulls=0  (0.000s)
[Pipeline] transform   16/17 · CastColumns  |  in: rows=418  cols=16  nulls=0  →  out: rows=418  cols=16  nulls=0  (0.000s)
[Pipeline] transform   17/17 · WOEEncoder  |  in: rows=418  cols=16  nulls=0  →  out: rows=418  cols=16  nulls=0  (0.001s)

[14]:

# Create submission file (Kaggle leaderboard score: 0.78)
# Note: +25% better than without feature engineering (0.60)
pl.DataFrame({"PassengerId": X_test["PassengerId"], "Survived": y_pred}).with_columns(
    pl.col("Survived").cast(pl.Int64)
).write_csv("submission_titanic.csv")

[ ]: