Titanic Survival Prediction using Iguanas#

This notebook demonstrates a complete end-to-end example of using Iguanas for rule-based classification on the Kaggle Titanic dataset.

The workflow includes:

Generating candidate rules using XGBoost
Filtering and selecting high-quality rules
Combining rules using different strategies
Generating predictions for submission

1. Import Libraries#

[1]:

import numpy as np
import polars as pl
from xgboost import XGBClassifier

from iguanas.metrics import compute_metrics
from iguanas.rule_analysis import generate_rule_performance_report
from iguanas.rule_combination import (
    combine_rules_beam_search,
    combine_rules_cumulative,
    combine_rules_greedy,
)
from iguanas.rule_evaluation import apply_rules
from iguanas.rule_generation import rule_grid_search
from iguanas.rule_selection import filter_correlated_rules

2. Load and Prepare Data#

Load the Titanic training data and separate features from the target variable (Survived).

[2]:

train = pl.read_csv("../../../../../kaggle/titanic/train.csv").drop("PassengerId")
X_train = train.drop("Survived")
y_train = train["Survived"]

Extract numeric columns for rule generation:

[3]:

num_columns = [col for col, dtype in X_train.schema.items() if dtype in [pl.Int64, pl.Float64]]
num_columns

[3]:

['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

3. Generate Candidate Rules#

Use XGBoost-based grid search to generate candidate rules. The rule_grid_search function trains models with different scale_pos_weight values and extracts rules from the decision trees.

[4]:

estimator = XGBClassifier(n_estimators=10, max_depth=4, eval_metric="logloss", random_state=0)
rules = rule_grid_search(
    estimator,
    X_train[num_columns].to_pandas(),
    y_train.to_pandas(),
    scale_pos_weights=np.logspace(-3, 3, 50),
)
rules_df = rules.unique("rule")

[5]:

rules_df.head(3)

[5]:

shape: (3, 4)

rule	tree	transformation	scale_pos_weight
str	i64	str	f64
"(X["Pclass"] < 3.0) & (X["Fare…	3	"Baseline"	59.636233
"(X["Fare"] >= 15.2458) & (X["F…	7	"Baseline"	33.932218
"(X["Fare"] < 15.2458) & (X["Ag…	5	"Baseline"	0.655129

[6]:

rules = rules_df.select("rule").to_series().to_list()
print(f"Number of unique rules: {len(rules)}")

Number of unique rules: 175

4. Filter High-Quality Rules#

Apply the generated rules to the training data, compute performance metrics, and filter based on:

Minimum precision (> 0.15)
Minimum recall (> 0.15)
Maximum correlation between rules (< 0.8)

This ensures we keep only the most useful and diverse rules.

[7]:

R = apply_rules(X_train[num_columns], rules)
M = compute_metrics(R, y_train)
M = M.filter((pl.col("precision") > 0.15) & (pl.col("recall") > 0.15)).sort(
    "accuracy", descending=True
)
importance = dict(zip(M["rule"], M["f0.5"], strict=False))
uncorrelated_rules = filter_correlated_rules(
    R[M["rule"].to_list()], importance=importance, max_corr=0.8
)

5. Combine Rules#

Test different rule combination strategies to find the best performing ruleset.

5.1 Cumulative Combination#

Combines rules cumulatively (rule1 OR rule2 OR … OR ruleN):

[8]:

num_rules = R.shape[1]

[9]:

num_rules = len(uncorrelated_rules)
R_combined = combine_rules_cumulative(
    R[uncorrelated_rules], output_names=[f"combined_rule_{i}" for i in range(1, num_rules + 1)]
)
M_combined = compute_metrics(R_combined, y_train).sort("accuracy", descending=True)
M_combined.head(3)

[9]:

shape: (3, 16)

rule	TP	FP	TN	FN	precision	recall	accuracy	flagged(%)	good_flagged(%)	f0.25	f0.5	f1	f1.5	f2	num_rules
str	i64	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	u32
"combined_rule_1"	174	87	447	152	0.666667	0.533742	0.722093	30.348837	16.292135	0.657041	0.635036	0.592845	0.568627	0.555911	1
"combined_rule_2"	176	91	443	150	0.659176	0.539877	0.719767	31.046512	17.041199	0.650718	0.631277	0.593592	0.571714	0.560153	1
"combined_rule_3"	176	91	443	150	0.659176	0.539877	0.719767	31.046512	17.041199	0.650718	0.631277	0.593592	0.571714	0.560153	1

5.2 Greedy Search#

Uses a greedy algorithm to iteratively select the best rule combination:

[10]:

R_greedy = combine_rules_greedy(R[uncorrelated_rules], y_train, metric="accuracy")
M_greedy = compute_metrics(R_greedy, y_train)
M_greedy

[10]:

shape: (1, 16)

rule	TP	FP	TN	FN	precision	recall	accuracy	flagged(%)	good_flagged(%)	f0.25	f0.5	f1	f1.5	f2	num_rules
str	i64	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	u32
"((X["Pclass"] < 3.0) & (X["Far…	206	101	416	110	0.67101	0.651899	0.746699	36.854742	19.535783	0.669855	0.667098	0.661316	0.657662	0.655633	4

5.3 Beam Search#

Uses beam search to explore rule combinations up to a maximum number of rules:

[11]:

R_beam = combine_rules_beam_search(R[uncorrelated_rules], y_train, metric="accuracy", max_rules=12)
M_beam = compute_metrics(R_beam, y_train)
M_beam.head(3)

[11]:

shape: (3, 16)

rule	TP	FP	TN	FN	precision	recall	accuracy	flagged(%)	good_flagged(%)	f0.25	f0.5	f1	f1.5	f2	num_rules
str	i64	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	u32
"((X["Pclass"] < 3.0) & (X["Far…	206	101	416	110	0.67101	0.651899	0.746699	36.854742	19.535783	0.669855	0.667098	0.661316	0.657662	0.655633	4
"((X["Pclass"] < 3.0) & (X["Far…	206	101	416	110	0.67101	0.651899	0.746699	36.854742	19.535783	0.669855	0.667098	0.661316	0.657662	0.655633	4
"((X["Pclass"] < 3.0) & (X["Far…	206	101	416	110	0.67101	0.651899	0.746699	36.854742	19.535783	0.669855	0.667098	0.661316	0.657662	0.655633	4

6. Analyze the Best Ruleset#

Generate a detailed report for the best performing ruleset from brute force combination:

[12]:

for r in M_beam["rule"][0].split(" | "):
    print(r)

((X["Pclass"] < 3.0) & (X["Fare"] >= 13.7917) & (X["Age"] < 61.0))
((X["Fare"] >= 10.5) & (X["SibSp"] < 3.0) & (X["Age"] < 18.0))
((X["Fare"] >= 52.5542) & (X["SibSp"] < 2.0) & (X["Age"] < 63.0))
((X["Fare"] >= 75.25))

[13]:

ruleset = M_beam["rule"][0]
report = generate_rule_performance_report(ruleset, X_train, y_train)
report.head(5)

[13]:

shape: (5, 17)

rule_index	rule	TP	FP	TN	FN	precision	recall	accuracy	flagged(%)	good_flagged(%)	f0.25	f0.5	f1	f1.5	f2	num_rules
str	str	i64	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	u32
"0"	"((X["Pclass"] < 3.0) & (X["Far…	206	101	416	110	0.67101	0.651899	0.746699	36.854742	19.535783	0.669855	0.667098	0.661316	0.657662	0.655633	4
"0.0"	"(X['Pclass'] < 3.0) & (X['Fare…	174	87	447	152	0.666667	0.533742	0.722093	30.348837	16.292135	0.657041	0.635036	0.592845	0.568627	0.555911	1
"0.1"	"(X['Fare'] >= 10.5) & (X['SibS…	51	12	503	259	0.809524	0.164516	0.671515	7.636364	2.330097	0.657815	0.453737	0.273458	0.217949	0.195702	1
"0.2"	"(X['Fare'] >= 52.5542) & (X['S…	89	27	519	244	0.767241	0.267267	0.691695	13.196815	4.945055	0.691183	0.558344	0.396437	0.334296	0.30732	1
"0.3"	"(X['Fare'] >= 75.25)"	74	23	526	268	0.762887	0.216374	0.673401	10.886644	4.189435	0.664203	0.506849	0.33713	0.277553	0.25256	1

7. Generate Predictions on Test Data#

Apply the best ruleset to the test data and create a submission file:

[14]:

X_test = pl.read_csv("../../../../../kaggle/titanic/test.csv")
y_pred = eval(ruleset.replace("X", "X_test"))

[15]:

# Create submission file (Kaggle leaderboard score: 0.61004)
pl.DataFrame({"PassengerId": X_test["PassengerId"], "Survived": y_pred}).with_columns(
    pl.col("Survived").cast(pl.Int64)
).write_csv("submission_titanic.csv")

[ ]: