Titanic Survival Prediction using Iguanas#
This notebook demonstrates a complete end-to-end example of using Iguanas for rule-based classification on the Kaggle Titanic dataset.
The workflow includes:
Generating candidate rules using XGBoost
Filtering and selecting high-quality rules
Combining rules using different strategies
Generating predictions for submission
1. Import Libraries#
[1]:
import numpy as np
import polars as pl
from xgboost import XGBClassifier
from iguanas.metrics import compute_metrics
from iguanas.rule_analysis import generate_rule_performance_report
from iguanas.rule_combination import (
combine_rules_beam_search,
combine_rules_cumulative,
combine_rules_greedy,
)
from iguanas.rule_evaluation import apply_rules
from iguanas.rule_generation import rule_grid_search
from iguanas.rule_selection import filter_correlated_rules
2. Load and Prepare Data#
Load the Titanic training data and separate features from the target variable (Survived).
[2]:
train = pl.read_csv("../../../../../kaggle/titanic/train.csv").drop("PassengerId")
X_train = train.drop("Survived")
y_train = train["Survived"]
Extract numeric columns for rule generation:
[3]:
num_columns = [col for col, dtype in X_train.schema.items() if dtype in [pl.Int64, pl.Float64]]
num_columns
[3]:
['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
3. Generate Candidate Rules#
Use XGBoost-based grid search to generate candidate rules. The rule_grid_search function trains models with different scale_pos_weight values and extracts rules from the decision trees.
[4]:
estimator = XGBClassifier(n_estimators=10, max_depth=4, eval_metric="logloss", random_state=0)
rules = rule_grid_search(
estimator,
X_train[num_columns].to_pandas(),
y_train.to_pandas(),
scale_pos_weights=np.logspace(-3, 3, 50),
)
rules_df = rules.unique("rule")
[5]:
rules_df.head(3)
[5]:
| rule | tree | transformation | scale_pos_weight |
|---|---|---|---|
| str | i64 | str | f64 |
| "(X["Pclass"] < 3.0) & (X["Fare… | 3 | "Baseline" | 59.636233 |
| "(X["Fare"] >= 15.2458) & (X["F… | 7 | "Baseline" | 33.932218 |
| "(X["Fare"] < 15.2458) & (X["Ag… | 5 | "Baseline" | 0.655129 |
[6]:
rules = rules_df.select("rule").to_series().to_list()
print(f"Number of unique rules: {len(rules)}")
Number of unique rules: 175
4. Filter High-Quality Rules#
Apply the generated rules to the training data, compute performance metrics, and filter based on:
Minimum precision (> 0.15)
Minimum recall (> 0.15)
Maximum correlation between rules (< 0.8)
This ensures we keep only the most useful and diverse rules.
[7]:
R = apply_rules(X_train[num_columns], rules)
M = compute_metrics(R, y_train)
M = M.filter((pl.col("precision") > 0.15) & (pl.col("recall") > 0.15)).sort(
"accuracy", descending=True
)
importance = dict(zip(M["rule"], M["f0.5"], strict=False))
uncorrelated_rules = filter_correlated_rules(
R[M["rule"].to_list()], importance=importance, max_corr=0.8
)
5. Combine Rules#
Test different rule combination strategies to find the best performing ruleset.
5.1 Cumulative Combination#
Combines rules cumulatively (rule1 OR rule2 OR … OR ruleN):
[8]:
num_rules = R.shape[1]
[9]:
num_rules = len(uncorrelated_rules)
R_combined = combine_rules_cumulative(
R[uncorrelated_rules], output_names=[f"combined_rule_{i}" for i in range(1, num_rules + 1)]
)
M_combined = compute_metrics(R_combined, y_train).sort("accuracy", descending=True)
M_combined.head(3)
[9]:
| rule | TP | FP | TN | FN | precision | recall | accuracy | flagged(%) | good_flagged(%) | f0.25 | f0.5 | f1 | f1.5 | f2 | num_rules |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | i64 | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | u32 |
| "combined_rule_1" | 174 | 87 | 447 | 152 | 0.666667 | 0.533742 | 0.722093 | 30.348837 | 16.292135 | 0.657041 | 0.635036 | 0.592845 | 0.568627 | 0.555911 | 1 |
| "combined_rule_2" | 176 | 91 | 443 | 150 | 0.659176 | 0.539877 | 0.719767 | 31.046512 | 17.041199 | 0.650718 | 0.631277 | 0.593592 | 0.571714 | 0.560153 | 1 |
| "combined_rule_3" | 176 | 91 | 443 | 150 | 0.659176 | 0.539877 | 0.719767 | 31.046512 | 17.041199 | 0.650718 | 0.631277 | 0.593592 | 0.571714 | 0.560153 | 1 |
5.2 Greedy Search#
Uses a greedy algorithm to iteratively select the best rule combination:
[10]:
R_greedy = combine_rules_greedy(R[uncorrelated_rules], y_train, metric="accuracy")
M_greedy = compute_metrics(R_greedy, y_train)
M_greedy
[10]:
| rule | TP | FP | TN | FN | precision | recall | accuracy | flagged(%) | good_flagged(%) | f0.25 | f0.5 | f1 | f1.5 | f2 | num_rules |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | i64 | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | u32 |
| "((X["Pclass"] < 3.0) & (X["Far… | 206 | 101 | 416 | 110 | 0.67101 | 0.651899 | 0.746699 | 36.854742 | 19.535783 | 0.669855 | 0.667098 | 0.661316 | 0.657662 | 0.655633 | 4 |
5.3 Beam Search#
Uses beam search to explore rule combinations up to a maximum number of rules:
[11]:
R_beam = combine_rules_beam_search(R[uncorrelated_rules], y_train, metric="accuracy", max_rules=12)
M_beam = compute_metrics(R_beam, y_train)
M_beam.head(3)
[11]:
| rule | TP | FP | TN | FN | precision | recall | accuracy | flagged(%) | good_flagged(%) | f0.25 | f0.5 | f1 | f1.5 | f2 | num_rules |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | i64 | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | u32 |
| "((X["Pclass"] < 3.0) & (X["Far… | 206 | 101 | 416 | 110 | 0.67101 | 0.651899 | 0.746699 | 36.854742 | 19.535783 | 0.669855 | 0.667098 | 0.661316 | 0.657662 | 0.655633 | 4 |
| "((X["Pclass"] < 3.0) & (X["Far… | 206 | 101 | 416 | 110 | 0.67101 | 0.651899 | 0.746699 | 36.854742 | 19.535783 | 0.669855 | 0.667098 | 0.661316 | 0.657662 | 0.655633 | 4 |
| "((X["Pclass"] < 3.0) & (X["Far… | 206 | 101 | 416 | 110 | 0.67101 | 0.651899 | 0.746699 | 36.854742 | 19.535783 | 0.669855 | 0.667098 | 0.661316 | 0.657662 | 0.655633 | 4 |
6. Analyze the Best Ruleset#
Generate a detailed report for the best performing ruleset from brute force combination:
[12]:
for r in M_beam["rule"][0].split(" | "):
print(r)
((X["Pclass"] < 3.0) & (X["Fare"] >= 13.7917) & (X["Age"] < 61.0))
((X["Fare"] >= 10.5) & (X["SibSp"] < 3.0) & (X["Age"] < 18.0))
((X["Fare"] >= 52.5542) & (X["SibSp"] < 2.0) & (X["Age"] < 63.0))
((X["Fare"] >= 75.25))
[13]:
ruleset = M_beam["rule"][0]
report = generate_rule_performance_report(ruleset, X_train, y_train)
report.head(5)
[13]:
| rule_index | rule | TP | FP | TN | FN | precision | recall | accuracy | flagged(%) | good_flagged(%) | f0.25 | f0.5 | f1 | f1.5 | f2 | num_rules |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | i64 | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | u32 |
| "0" | "((X["Pclass"] < 3.0) & (X["Far… | 206 | 101 | 416 | 110 | 0.67101 | 0.651899 | 0.746699 | 36.854742 | 19.535783 | 0.669855 | 0.667098 | 0.661316 | 0.657662 | 0.655633 | 4 |
| "0.0" | "(X['Pclass'] < 3.0) & (X['Fare… | 174 | 87 | 447 | 152 | 0.666667 | 0.533742 | 0.722093 | 30.348837 | 16.292135 | 0.657041 | 0.635036 | 0.592845 | 0.568627 | 0.555911 | 1 |
| "0.1" | "(X['Fare'] >= 10.5) & (X['SibS… | 51 | 12 | 503 | 259 | 0.809524 | 0.164516 | 0.671515 | 7.636364 | 2.330097 | 0.657815 | 0.453737 | 0.273458 | 0.217949 | 0.195702 | 1 |
| "0.2" | "(X['Fare'] >= 52.5542) & (X['S… | 89 | 27 | 519 | 244 | 0.767241 | 0.267267 | 0.691695 | 13.196815 | 4.945055 | 0.691183 | 0.558344 | 0.396437 | 0.334296 | 0.30732 | 1 |
| "0.3" | "(X['Fare'] >= 75.25)" | 74 | 23 | 526 | 268 | 0.762887 | 0.216374 | 0.673401 | 10.886644 | 4.189435 | 0.664203 | 0.506849 | 0.33713 | 0.277553 | 0.25256 | 1 |
7. Generate Predictions on Test Data#
Apply the best ruleset to the test data and create a submission file:
[14]:
X_test = pl.read_csv("../../../../../kaggle/titanic/test.csv")
y_pred = eval(ruleset.replace("X", "X_test"))
[15]:
# Create submission file (Kaggle leaderboard score: 0.61004)
pl.DataFrame({"PassengerId": X_test["PassengerId"], "Survived": y_pred}).with_columns(
pl.col("Survived").cast(pl.Int64)
).write_csv("submission_titanic.csv")
[ ]: