Titanic Survival Prediction using Iguanas#

This notebook demonstrates a complete end-to-end example of using Iguanas for rule-based classification on the Kaggle Titanic dataset.

The workflow includes:

  1. Generating the best rule based on the given metric

  2. Generating the best ruleset based on the given metric

1. Import Libraries#

[1]:
import numpy as np
import polars as pl
from xgboost import XGBClassifier

from iguanas.metrics import compute_metrics
from iguanas.rule_analysis import generate_rule_performance_report
from iguanas.rule_classifier import RuleClassifier
from iguanas.ruleset_classifier import RulesetClassifier

2. Load#

Load the Titanic training data and separate features from the target variable (Survived).

[2]:
train = pl.read_csv("../../../../../kaggle/titanic/train.csv").drop("PassengerId")
X_train = train.drop("Survived")
y_train = train["Survived"]

3. Generate best rule based on given metric#

[3]:
estimator = XGBClassifier(n_estimators=10, max_depth=4, eval_metric="logloss", random_state=0)
rule_est = RuleClassifier(
    estimator=estimator,
    scale_pos_weights=np.logspace(-3, 3, 10),
)
_ = rule_est.fit(X_train, y_train)
y_pred_train = rule_est.predict(X_train)
[4]:
M_train = compute_metrics(y_pred_train, y_train)
M_train
[4]:
shape: (1, 16)
ruleTPFPTNFNprecisionrecallaccuracyflagged(%)good_flagged(%)f0.25f0.5f1f1.5f2num_rules
stri64i64i64i64f64f64f64f64f64f64f64f64f64f64u32
"(X["Pclass"] < 3.0) & (X["Fare…176914431500.6591760.5398770.71976731.04651217.0411990.6507180.6312770.5935920.5717140.5601531
[5]:
report = generate_rule_performance_report(rule_est._best_rule_, X_train, y_train)
report
[5]:
shape: (4, 17)
rule_indexruleTPFPTNFNprecisionrecallaccuracyflagged(%)good_flagged(%)f0.25f0.5f1f1.5f2num_rules
strstri64i64i64i64f64f64f64f64f64f64f64f64f64f64u32
"0""(X["Pclass"] < 3.0) & (X["Fare…176914431500.6591760.5398770.71976731.04651217.0411990.6507180.6312770.5935920.5717140.5601531
"0.0""(X['Pclass'] < 3.0)"2231773721190.55750.6520470.66778944.89337832.2404370.5622960.574150.6010780.6197090.6306561
"0.1""(X['Fare'] >= 13.7917)"2342243251080.5109170.6842110.62738551.40291840.8014570.5186440.5381780.5850.6195520.6407451
"0.2""(X['Age'] < 64.0)"2894121210.4122680.9965520.42156998.17927297.1698110.4269950.4670330.5832490.6939420.7764641

4. Generate best ruleset based on given metric#

[6]:
estimator = XGBClassifier(n_estimators=10, max_depth=4, eval_metric="logloss", random_state=0)
ruleset_est = RulesetClassifier(
    estimator=estimator,
    scale_pos_weights=np.logspace(-3, 3, 50),
    opt_metric="accuracy",
    metric_thresholds=[{"name": "accuracy", "operator": ">=", "value": 0.5}],
    max_rules=5,
)
_ = ruleset_est.fit(X_train, y_train)
y_pred_train = ruleset_est.predict(X_train)
[7]:
M_train = compute_metrics(y_pred_train, y_train)
M_train
[7]:
shape: (1, 16)
ruleTPFPTNFNprecisionrecallaccuracyflagged(%)good_flagged(%)f0.25f0.5f1f1.5f2num_rules
stri64i64i64i64f64f64f64f64f64f64f64f64f64f64u32
"((X["Pclass"] < 3.0) & (X["Far…193924231170.6771930.6225810.74666734.54545517.8640780.6737170.6655170.6487390.6384220.6327872
[8]:
report = generate_rule_performance_report(ruleset_est._best_ruleset_, X_train, y_train)
report
[8]:
shape: (9, 17)
rule_indexruleTPFPTNFNprecisionrecallaccuracyflagged(%)good_flagged(%)f0.25f0.5f1f1.5f2num_rules
strstri64i64i64i64f64f64f64f64f64f64f64f64f64f64u32
"0""((X["Pclass"] < 3.0) & (X["Far…193924231170.6771930.6225810.74666734.54545517.8640780.6737170.6655170.6487390.6384220.6327872
"0.0""(X['Pclass'] < 3.0) & (X['Fare…174874471520.6666670.5337420.72209330.34883716.2921350.6570410.6350360.5928450.5686270.5559111
"0.1""(X['Fare'] >= 11.1333) & (X['S…4365092670.8775510.138710.6690915.9393941.1650490.668190.4249010.2395540.1872070.1667961
"0.0.0""(X['Pclass'] < 3.0)"2231773721190.55750.6520470.66778944.89337832.2404370.5622960.574150.6010780.6197090.6306561
"0.0.1""(X['Fare'] >= 13.7917)"2342243251080.5109170.6842110.62738551.40291840.8014570.5186440.5381780.5850.6195520.6407451
"0.0.2""(X['Age'] < 61.0)"2854071750.411850.9827590.42296996.91876895.9905660.4264210.4659910.5804480.6889180.7694381
"0.1.0""(X['Fare'] >= 11.1333)"266261288760.5047440.7777780.62177359.14702647.5409840.5153860.5428570.6121980.6667950.7018471
"0.1.1""(X['SibSp'] < 3.0)"3355103970.396450.9795320.41975394.83726292.8961750.4108350.4500270.5644480.6743570.7568911
"0.1.2""(X['Age'] < 16.0)"49343902410.5903610.1689660.61484611.624658.0188680.5148330.3938910.2627350.2165190.1971041

7. Generate Predictions on Test Data#

Apply the best ruleset to the test data and create a submission file:

[9]:
X_test = pl.read_csv("../../../../../kaggle/titanic/test.csv")
y_pred = eval(ruleset_est._best_ruleset_.replace("X", "X_test"))
[10]:
# Create submission file (Kaggle leaderboard score: 0.612)
pl.DataFrame({"PassengerId": X_test["PassengerId"], "Survived": y_pred}).with_columns(
    pl.col("Survived").cast(pl.Int64)
).write_csv("submission_titanic.csv")
[ ]: