Titanic Survival Prediction using Iguanas#

This notebook demonstrates a complete end-to-end example of using Iguanas for rule-based classification on the Kaggle Titanic dataset.

The workflow includes:

  1. Generating candidate rules using XGBoost

  2. Filtering and selecting high-quality rules

  3. Combining rules using different strategies

  4. Generating predictions for submission

1. Import Libraries#

[1]:
import numpy as np
import polars as pl
from xgboost import XGBClassifier

from iguanas.metrics import compute_metrics
from iguanas.rule_analysis import generate_rule_performance_report
from iguanas.rule_combination import (
    combine_rules_beam_search,
    combine_rules_cumulative,
    combine_rules_greedy,
)
from iguanas.rule_evaluation import apply_rules
from iguanas.rule_generation import rule_grid_search
from iguanas.rule_selection import filter_correlated_rules

2. Load and Prepare Data#

Load the Titanic training data and separate features from the target variable (Survived).

[2]:
train = pl.read_csv("../../../../../kaggle/titanic/train.csv").drop("PassengerId")
X_train = train.drop("Survived")
y_train = train["Survived"]

Extract numeric columns for rule generation:

[3]:
num_columns = [col for col, dtype in X_train.schema.items() if dtype in [pl.Int64, pl.Float64]]
num_columns
[3]:
['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

3. Generate Candidate Rules#

Use XGBoost-based grid search to generate candidate rules. The rule_grid_search function trains models with different scale_pos_weight values and extracts rules from the decision trees.

[4]:
estimator = XGBClassifier(n_estimators=10, max_depth=4, eval_metric="logloss", random_state=0)
rules = rule_grid_search(
    estimator,
    X_train[num_columns].to_pandas(),
    y_train.to_pandas(),
    scale_pos_weights=np.logspace(-3, 3, 50),
)
rules_df = rules.unique("rule")
[5]:
rules_df.head(3)
[5]:
shape: (3, 4)
ruletreetransformationscale_pos_weight
stri64strf64
"(X["Pclass"] < 3.0) & (X["Fare…3"Baseline"59.636233
"(X["Fare"] >= 15.2458) & (X["F…7"Baseline"33.932218
"(X["Fare"] < 15.2458) & (X["Ag…5"Baseline"0.655129
[6]:
rules = rules_df.select("rule").to_series().to_list()
print(f"Number of unique rules: {len(rules)}")
Number of unique rules: 175

4. Filter High-Quality Rules#

Apply the generated rules to the training data, compute performance metrics, and filter based on:

  • Minimum precision (> 0.15)

  • Minimum recall (> 0.15)

  • Maximum correlation between rules (< 0.8)

This ensures we keep only the most useful and diverse rules.

[7]:
R = apply_rules(X_train[num_columns], rules)
M = compute_metrics(R, y_train)
M = M.filter((pl.col("precision") > 0.15) & (pl.col("recall") > 0.15)).sort(
    "accuracy", descending=True
)
importance = dict(zip(M["rule"], M["f0.5"], strict=False))
uncorrelated_rules = filter_correlated_rules(
    R[M["rule"].to_list()], importance=importance, max_corr=0.8
)

5. Combine Rules#

Test different rule combination strategies to find the best performing ruleset.

5.1 Cumulative Combination#

Combines rules cumulatively (rule1 OR rule2 OR … OR ruleN):

[8]:
num_rules = R.shape[1]
[9]:
num_rules = len(uncorrelated_rules)
R_combined = combine_rules_cumulative(
    R[uncorrelated_rules], output_names=[f"combined_rule_{i}" for i in range(1, num_rules + 1)]
)
M_combined = compute_metrics(R_combined, y_train).sort("accuracy", descending=True)
M_combined.head(3)
[9]:
shape: (3, 16)
ruleTPFPTNFNprecisionrecallaccuracyflagged(%)good_flagged(%)f0.25f0.5f1f1.5f2num_rules
stri64i64i64i64f64f64f64f64f64f64f64f64f64f64u32
"combined_rule_1"174874471520.6666670.5337420.72209330.34883716.2921350.6570410.6350360.5928450.5686270.5559111
"combined_rule_2"176914431500.6591760.5398770.71976731.04651217.0411990.6507180.6312770.5935920.5717140.5601531
"combined_rule_3"176914431500.6591760.5398770.71976731.04651217.0411990.6507180.6312770.5935920.5717140.5601531

6. Analyze the Best Ruleset#

Generate a detailed report for the best performing ruleset from brute force combination:

[12]:
for r in M_beam["rule"][0].split(" | "):
    print(r)
((X["Pclass"] < 3.0) & (X["Fare"] >= 13.7917) & (X["Age"] < 61.0))
((X["Fare"] >= 10.5) & (X["SibSp"] < 3.0) & (X["Age"] < 18.0))
((X["Fare"] >= 52.5542) & (X["SibSp"] < 2.0) & (X["Age"] < 63.0))
((X["Fare"] >= 75.25))
[13]:
ruleset = M_beam["rule"][0]
report = generate_rule_performance_report(ruleset, X_train, y_train)
report.head(5)
[13]:
shape: (5, 17)
rule_indexruleTPFPTNFNprecisionrecallaccuracyflagged(%)good_flagged(%)f0.25f0.5f1f1.5f2num_rules
strstri64i64i64i64f64f64f64f64f64f64f64f64f64f64u32
"0""((X["Pclass"] < 3.0) & (X["Far…2061014161100.671010.6518990.74669936.85474219.5357830.6698550.6670980.6613160.6576620.6556334
"0.0""(X['Pclass'] < 3.0) & (X['Fare…174874471520.6666670.5337420.72209330.34883716.2921350.6570410.6350360.5928450.5686270.5559111
"0.1""(X['Fare'] >= 10.5) & (X['SibS…51125032590.8095240.1645160.6715157.6363642.3300970.6578150.4537370.2734580.2179490.1957021
"0.2""(X['Fare'] >= 52.5542) & (X['S…89275192440.7672410.2672670.69169513.1968154.9450550.6911830.5583440.3964370.3342960.307321
"0.3""(X['Fare'] >= 75.25)"74235262680.7628870.2163740.67340110.8866444.1894350.6642030.5068490.337130.2775530.252561

7. Generate Predictions on Test Data#

Apply the best ruleset to the test data and create a submission file:

[14]:
X_test = pl.read_csv("../../../../../kaggle/titanic/test.csv")
y_pred = eval(ruleset.replace("X", "X_test"))
[15]:
# Create submission file (Kaggle leaderboard score: 0.61004)
pl.DataFrame({"PassengerId": X_test["PassengerId"], "Survived": y_pred}).with_columns(
    pl.col("Survived").cast(pl.Int64)
).write_csv("submission_titanic.csv")
[ ]: