Titanic Survival Prediction using Iguanas with Feature Engineering#

This notebook demonstrates a complete end-to-end example of using Iguanas for rule-based classification on the Kaggle Titanic dataset, with advanced feature engineering using the Gators library.

The workflow includes:

  1. Loading and exploring the data

  2. Feature engineering using Gators (new columns, transformations, encoding)

  3. Generating candidate rules using XGBoost

  4. Filtering and selecting high-quality rules

  5. Combining rules using different strategies

  6. Generating predictions for submission

1. Import Libraries#

[1]:
import numpy as np
import polars as pl
from gators.data_cleaning import CastColumns, DropColumns, RenameColumns
from gators.discretizers import CustomDiscretizer
from gators.encoders import RareCategoryEncoder, WOEEncoder
from gators.feature_generation import ConditionFeatures, IsNull, MathFeatures, ScalarMathFeatures
from gators.feature_generation_str import (
    ExtractSubstring,
    Length,
    SplitExtract,
)
from gators.imputers import NumericImputer, StringImputer
from gators.pipeline import Pipeline
from xgboost import XGBClassifier

from iguanas.metrics import compute_metrics
from iguanas.rule_analysis import generate_rule_performance_report
from iguanas.rule_combination import (
    combine_rules_beam_search,
    combine_rules_cumulative,
    combine_rules_greedy,
)
from iguanas.rule_evaluation import apply_rules
from iguanas.rule_generation import rule_grid_search
from iguanas.rule_selection import filter_correlated_rules

2. Load and Prepare Data#

Load the Titanic training data and separate features from the target variable (Survived).

[2]:
train = pl.read_csv("../../../../../kaggle/titanic/train.csv").drop("PassengerId")
X_train = train.drop("Survived")
y_train = train["Survived"]

3. Feature Engineering with Gators#

Build a comprehensive feature engineering pipeline using the Gators library. This pipeline will:

  • Create missing value indicators for Age and Cabin

  • Extract string features (name titles, cabin deck, ticket length)

  • Calculate family size and fare per person

  • Create categorical bins for age

  • Generate feature interactions

  • Apply Weight of Evidence (WOE) encoding for all categorical variables

Key transformations in this pipeline:

  1. Missing value handling: Create indicators for missing Age/Cabin, then impute

  2. Feature extraction: Extract passenger titles from names, cabin deck letters, ticket lengths

  3. Feature creation: Calculate family size, fare per person, traveling alone indicator

  4. Discretization: Convert continuous Age into categorical bins

  5. Interactions: Create combinations of Pclass, Age, CabinDeck, and Embarked

  6. Encoding: Apply WOE encoding to convert all categorical features to numeric values

[3]:
# Define the feature engineering pipeline
steps = [
    # Create missing value indicators
    ("IsNull", IsNull(subset=["Age", "Cabin"])),
    # String feature engineering
    ("Length", Length(subset=["Ticket"])),
    ("SplitExtractName", SplitExtract(subset=["Name"], by=", ", n=1)),
    ("SplitExtractTitle", SplitExtract(subset=["Name__split_,__1"], by=".", n=0)),
    # Calculate family size (SibSp + Parch + 1)
    (
        "MathFeatures",
        MathFeatures(groups=[["SibSp", "Parch"]], operations=["sum"], new_column_names=["Dummy"]),
    ),
    (
        "ScalarMathFeatures",
        ScalarMathFeatures(
            operations=[{"column": "Dummy_sum", "op": "+", "scalar": 1}],
            new_column_names=["FamilySize"],
        ),
    ),
    # Extract cabin deck (first letter of cabin)
    ("ExtractSubstring", ExtractSubstring(subset=["Cabin"], start=0, end=1)),
    # Rename for clarity
    (
        "RenameColumns",
        RenameColumns(
            column_mapping={
                "Name__split_,__1__split_._0": "Title",
                "Cabin__start0_end1": "CabinDeck",
            }
        ),
    ),
    # Handle rare categories (group infrequent values)
    ("RareCategoryEncoder", RareCategoryEncoder(min_count=0.01)),
    # Calculate fare per person
    (
        "MathFeatures2",
        MathFeatures(
            groups=[["Fare", "FamilySize"]], operations=["div"], new_column_names=["FarePerPerson"]
        ),
    ),
    # Create 'traveling alone' indicator
    (
        "ConditionFeatures",
        ConditionFeatures(
            conditions=[{"column": "FamilySize", "op": ">", "value": 1}],
            new_column_names=["IsAlone"],
        ),
    ),
    # Drop raw columns no longer needed
    ("DropColumns", DropColumns(subset=["Cabin", "Ticket", "Dummy_sum"])),
    # Impute missing values
    ("NumericImputer", NumericImputer(strategy="mean")),
    ("StringImputer", StringImputer(strategy="constant", value="MISSING")),
    # Discretize age into bins
    ("CustomDiscretizer", CustomDiscretizer(bins={"Age": [0, 12, 18, 35, 60, 100]}, inplace=True)),
    # Convert passenger class to categorical
    ("CastColumns", CastColumns(subset=["Pclass"], dtype=pl.String)),
    # Apply Weight of Evidence encoding (converts all categorical features to numeric)
    ("WOEEncoder", WOEEncoder()),
]

# Build and fit the pipeline
pipe = Pipeline(steps=steps, verbose=True)
X_train_transformed = pipe.fit_transform(X_train, y_train)

print(f"\nOriginal features: {X_train.shape[1]}")
print(f"Engineered features: {X_train_transformed.shape[1]}")
[Pipeline] fit+transform   1/17 · IsNull  |  in: rows=891  cols=10  nulls=866  →  out: rows=891  cols=12  nulls=866  (0.001s)
[Pipeline] fit+transform   2/17 · Length  |  in: rows=891  cols=12  nulls=866  →  out: rows=891  cols=13  nulls=866  (0.001s)
[Pipeline] fit+transform   3/17 · SplitExtractName  |  in: rows=891  cols=13  nulls=866  →  out: rows=891  cols=13  nulls=866  (0.001s)
[Pipeline] fit+transform   4/17 · SplitExtractTitle  |  in: rows=891  cols=13  nulls=866  →  out: rows=891  cols=13  nulls=866  (0.001s)
[Pipeline] fit+transform   5/17 · MathFeatures  |  in: rows=891  cols=13  nulls=866  →  out: rows=891  cols=14  nulls=866  (0.000s)
[Pipeline] fit+transform   6/17 · ScalarMathFeatures  |  in: rows=891  cols=14  nulls=866  →  out: rows=891  cols=15  nulls=866  (0.000s)
[Pipeline] fit+transform   7/17 · ExtractSubstring  |  in: rows=891  cols=15  nulls=866  →  out: rows=891  cols=16  nulls=1553  (0.000s)
[Pipeline] fit+transform   8/17 · RenameColumns  |  in: rows=891  cols=16  nulls=1553  →  out: rows=891  cols=16  nulls=1553  (0.000s)
[Pipeline] fit+transform   9/17 · RareCategoryEncoder  |  in: rows=891  cols=16  nulls=1553  →  out: rows=891  cols=16  nulls=1551  (0.002s)
[Pipeline] fit+transform   10/17 · MathFeatures2  |  in: rows=891  cols=16  nulls=1551  →  out: rows=891  cols=17  nulls=1551  (0.000s)
[Pipeline] fit+transform   11/17 · ConditionFeatures  |  in: rows=891  cols=17  nulls=1551  →  out: rows=891  cols=18  nulls=1551  (0.000s)
[Pipeline] fit+transform   12/17 · DropColumns  |  in: rows=891  cols=18  nulls=1551  →  out: rows=891  cols=15  nulls=864  (0.000s)
[Pipeline] fit+transform   13/17 · NumericImputer  |  in: rows=891  cols=15  nulls=864  →  out: rows=891  cols=15  nulls=687  (0.000s)
[Pipeline] fit+transform   14/17 · StringImputer  |  in: rows=891  cols=15  nulls=687  →  out: rows=891  cols=15  nulls=0  (0.000s)
[Pipeline] fit+transform   15/17 · CustomDiscretizer  |  in: rows=891  cols=15  nulls=0  →  out: rows=891  cols=15  nulls=0  (0.000s)
[Pipeline] fit+transform   16/17 · CastColumns  |  in: rows=891  cols=15  nulls=0  →  out: rows=891  cols=15  nulls=0  (0.000s)
[Pipeline] fit+transform   17/17 · WOEEncoder  |  in: rows=891  cols=15  nulls=0  →  out: rows=891  cols=15  nulls=0  (0.005s)

Original features: 10
Engineered features: 15

4. Generate Candidate Rules#

Use XGBoost-based grid search to generate candidate rules from the engineered features. The rule_grid_search_parallel_scales function trains models with different scale_pos_weight values and extracts rules from the decision trees.

[4]:
estimator = XGBClassifier(n_estimators=100, max_depth=4, eval_metric="logloss", random_state=0)
rules = rule_grid_search(
    estimator, X_train_transformed, y_train, scale_pos_weights=np.logspace(0, 3, 50)
)
[5]:
print(f"Number of rules generated: {len(rules)}")
Number of rules generated: 1971

5. Select High-Quality Rules#

Apply the generated rules to the training data, compute performance metrics, and filter based on:

  • Minimum precision (> 0.15)

  • Minimum recall (> 0.15)

  • Maximum correlation between rules (< 0.8)

This ensures we keep only the most useful and diverse rules.

[6]:
R = apply_rules(X_train_transformed, rules.select("rule").to_series().to_list())
M = compute_metrics(R, y_train)
M = M.filter((pl.col("precision") > 0.15) & (pl.col("recall") > 0.15)).sort(
    "accuracy", descending=True
)
importance = dict(zip(M["rule"], M["f0.5"], strict=False))
uncorrelated_rules = filter_correlated_rules(
    R[M["rule"].to_list()], importance=importance, max_corr=0.8
)
[7]:
num_rules = len(uncorrelated_rules)
print(f"Number of selected rules: {num_rules}")
Number of selected rules: 33

6. Combine Rules#

Test different rule combination strategies to find the best performing ruleset.

6.1 Cumulative Combination#

Combines rules cumulatively (rule1 OR rule2 OR … OR ruleN):

[8]:
R_combined = combine_rules_cumulative(
    R[uncorrelated_rules], output_names=[f"combined_rule_{i}" for i in range(1, num_rules + 1)]
)
M_combined = compute_metrics(R_combined, y_train).sort("accuracy", descending=True)
M_combined.head(3)
[8]:
shape: (3, 16)
ruleTPFPTNFNprecisionrecallaccuracyflagged(%)good_flagged(%)f0.25f0.5f1f1.5f2num_rules
stri64i64i64i64f64f64f64f64f64f64f64f64f64f64u32
"combined_rule_4"25257492900.8155340.7368420.83501734.68013510.3825140.8104430.7984790.7741940.7593880.7513421
"combined_rule_5"25259490900.8102890.7368420.83277234.90460210.7468120.8055660.7944510.7718220.7579820.7504471
"combined_rule_6"25259490900.8102890.7368420.83277234.90460210.7468120.8055660.7944510.7718220.7579820.7504471

7. Analyze the Best Ruleset#

Generate a detailed report for the best performing ruleset from brute force combination:

[11]:
for r in M_beam["rule"][0].split(" | "):
    print(r)
((X["Title"] >= 0.25029) & (X["FamilySize"] < 5.0))
((X["FarePerPerson_div"] >= 9.5) & (X["Sex"] >= 1.52977) & (X["Ticket__length"] >= 5.0) & (X["Fare"] < 151.55))
((X["Title"] >= 0.77539) & (X["Pclass"] >= 0.36447))
((X["Title"] >= 0.77539) & (X["Fare"] >= 31.3875) & (X["Fare"] < 151.55) & (X["Ticket__length"] < 7.0))
[12]:
ruleset = M_beam["rule"][0]
print(f"Selected ruleset: {ruleset}")
report = generate_rule_performance_report(ruleset, X_train_transformed, y_train)
report
Selected ruleset: ((X["Title"] >= 0.25029) & (X["FamilySize"] < 5.0)) | ((X["FarePerPerson_div"] >= 9.5) & (X["Sex"] >= 1.52977) & (X["Ticket__length"] >= 5.0) & (X["Fare"] < 151.55)) | ((X["Title"] >= 0.77539) & (X["Pclass"] >= 0.36447)) | ((X["Title"] >= 0.77539) & (X["Fare"] >= 31.3875) & (X["Fare"] < 151.55) & (X["Ticket__length"] < 7.0))
[12]:
shape: (17, 17)
rule_indexruleTPFPTNFNprecisionrecallaccuracyflagged(%)good_flagged(%)f0.25f0.5f1f1.5f2num_rules
strstri64i64i64i64f64f64f64f64f64f64f64f64f64f64u32
"0""((X["Title"] >= 0.25029) & (X[…25558491870.8146960.7456140.83726235.12906810.5646630.810280.7998750.7786260.7655890.7584774
"0.0""(X['Title'] >= 0.25029) & (X['…239574921030.8074320.698830.82042633.221110.3825140.8001180.7830930.7492160.7290.7181491
"0.1""(X['FarePerPerson_div'] >= 9.5…12765432150.9548870.3713450.75196414.9270481.0928960.8740890.7265450.5347370.4573410.4230511
"0.2""(X['Title'] >= 0.77539) & (X['…16695401760.9485710.485380.79236819.6408531.6393440.8981540.7965450.6421660.5712020.5379131
"0.3""(X['Title'] >= 0.77539) & (X['…5915482830.9833330.1725150.6812576.7340070.1821490.7703530.5068730.2935320.2311630.2065831
"0.2.1""(X['Pclass'] >= 0.36447)"2231773721190.55750.6520470.66778944.89337832.2404370.5622960.574150.6010780.6197090.6306561
"0.3.0""(X['Title'] >= 0.77539)"24998451930.7175790.728070.78563438.94500617.8506380.7181880.7196530.7227870.724810.7259481
"0.3.1""(X['Fare'] >= 31.3875)"129864632130.60.3771930.66442224.13019115.6648450.5798520.5366060.4631960.4258510.4074541
"0.3.2""(X['Fare'] < 151.55)"3225409200.373550.941520.37149396.7452398.3606560.3872930.4248020.5348840.6414340.7219731
"0.3.3""(X['Ticket__length'] < 7.0)"252401148900.3859110.7368420.44893473.2884473.0418940.3970340.426540.5065330.5757470.6234541

8. Generate Predictions on Test Data#

Apply the same preprocessing pipeline to the test data, then use the best ruleset to generate predictions:

[13]:
X_test = pl.read_csv("../../../../../kaggle/titanic/test.csv")
X_test_transformed = pipe.transform(X_test)
y_pred = eval(ruleset.replace("X", "X_test_transformed"))
[Pipeline] transform   1/17 · IsNull  |  in: rows=418  cols=11  nulls=414  →  out: rows=418  cols=13  nulls=414  (0.000s)
[Pipeline] transform   2/17 · Length  |  in: rows=418  cols=13  nulls=414  →  out: rows=418  cols=14  nulls=414  (0.000s)
[Pipeline] transform   3/17 · SplitExtractName  |  in: rows=418  cols=14  nulls=414  →  out: rows=418  cols=14  nulls=414  (0.001s)
[Pipeline] transform   4/17 · SplitExtractTitle  |  in: rows=418  cols=14  nulls=414  →  out: rows=418  cols=14  nulls=414  (0.001s)
[Pipeline] transform   5/17 · MathFeatures  |  in: rows=418  cols=14  nulls=414  →  out: rows=418  cols=15  nulls=414  (0.000s)
[Pipeline] transform   6/17 · ScalarMathFeatures  |  in: rows=418  cols=15  nulls=414  →  out: rows=418  cols=16  nulls=414  (0.000s)
[Pipeline] transform   7/17 · ExtractSubstring  |  in: rows=418  cols=16  nulls=414  →  out: rows=418  cols=17  nulls=741  (0.000s)
[Pipeline] transform   8/17 · RenameColumns  |  in: rows=418  cols=17  nulls=741  →  out: rows=418  cols=17  nulls=741  (0.000s)
[Pipeline] transform   9/17 · RareCategoryEncoder  |  in: rows=418  cols=17  nulls=741  →  out: rows=418  cols=17  nulls=741  (0.001s)
[Pipeline] transform   10/17 · MathFeatures2  |  in: rows=418  cols=17  nulls=741  →  out: rows=418  cols=18  nulls=742  (0.000s)
[Pipeline] transform   11/17 · ConditionFeatures  |  in: rows=418  cols=18  nulls=742  →  out: rows=418  cols=19  nulls=742  (0.000s)
[Pipeline] transform   12/17 · DropColumns  |  in: rows=418  cols=19  nulls=742  →  out: rows=418  cols=16  nulls=415  (0.000s)
[Pipeline] transform   13/17 · NumericImputer  |  in: rows=418  cols=16  nulls=415  →  out: rows=418  cols=16  nulls=327  (0.000s)
[Pipeline] transform   14/17 · StringImputer  |  in: rows=418  cols=16  nulls=327  →  out: rows=418  cols=16  nulls=0  (0.000s)
[Pipeline] transform   15/17 · CustomDiscretizer  |  in: rows=418  cols=16  nulls=0  →  out: rows=418  cols=16  nulls=0  (0.000s)
[Pipeline] transform   16/17 · CastColumns  |  in: rows=418  cols=16  nulls=0  →  out: rows=418  cols=16  nulls=0  (0.000s)
[Pipeline] transform   17/17 · WOEEncoder  |  in: rows=418  cols=16  nulls=0  →  out: rows=418  cols=16  nulls=0  (0.001s)
[14]:
# Create submission file (Kaggle leaderboard score: 0.78)
# Note: +25% better than without feature engineering (0.60)
pl.DataFrame({"PassengerId": X_test["PassengerId"], "Survived": y_pred}).with_columns(
    pl.col("Survived").cast(pl.Int64)
).write_csv("submission_titanic.csv")
[ ]: