Ruleset Classifier#

Classes#

RulesetClassifier#

class iguanas.ruleset_classifier.RulesetClassifier[source]#

Bases: pydantic.main.BaseModel, sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

End-to-end rule-based classification pipeline.

The best ruleset is selected through the following steps:

  1. Rule generation: candidate rules are extracted from XGBoost decision trees trained across a sweep of scale_pos_weight values.

  2. Performance filtering: rules that fail any condition in metric_thresholds are discarded.

  3. Correlation filtering: among rules that are correlated above max_corr, only the one with the highest ranking_metric score is kept.

  4. Greedy combination: starting from the single best rule, rules are added one at a time — each iteration picks the candidate that yields the largest improvement in ranking_metric when combined (via combine_operator) with the already-selected rules. Addition stops when no candidate improves the metric by at least min_improvement or when max_rules rules have been selected.

The resulting combined rule expression is stored in _best_ruleset_ as a string (e.g. "(rule_A) | (rule_B) | (rule_C)").

Parameters:
  • estimator (XGBClassifier) – XGBoost classifier used for rule generation.

  • scale_pos_weights (np.ndarray | list[float], default=np.array([1.0])) – Array of scale_pos_weight values swept during rule generation.

  • ranking_metric (str, default="accuracy") – Metric used to rank and select candidate rules. Must be a column produced by compute_metrics (e.g. “f1”, “precision”, “recall”).

  • max_rules (int, default=10) – Maximum number of rules the greedy search may select. Must be > 0.

  • metric_thresholds (list[dict[str, Any]] | None, default=None) – List of threshold dicts used to filter candidate rules. Each dict must have keys "name" (metric column), "operator" (one of ">=", ">", "<=", "<", "==", "!="), and "value" (numeric threshold). All conditions are combined with AND. If None, the default threshold of apply_and_filter_by_performance is used.

  • max_corr (float, default=0.8) – Maximum pairwise correlation allowed between rules; correlated pairs are pruned to keep only the highest-ranked one. Must be in [0, 1].

  • combine_operator (str, default="or") – Boolean operator used to combine selected rules: “or” or “and”.

  • min_improvement (float, default=0.01) – Minimum improvement in ranking_metric required to add a new rule to the combined ruleset during greedy selection.

fit(X: polars.DataFrame, y: polars.Series) iguanas.ruleset_classifier.RulesetClassifier[source]#

Generate, filter, and select rules from training data.

Parameters:
  • X (pl.DataFrame) – Feature DataFrame. Only numeric columns are used for rule generation.

  • y (pl.Series) – Binary target series.

Returns:

Fitted pipeline instance (self).

Return type:

RulesetClassifier

predict(X: polars.DataFrame) polars.Series[source]#

Predict binary labels for each sample.

A sample is positive if any (OR) or all (AND) selected rules fire, depending on combine_operator.

Parameters:

X (pl.DataFrame) – Feature DataFrame with the same columns seen during fit.

Returns:

Boolean series named “prediction”.

Return type:

pl.Series

predict_proba(X: polars.DataFrame) polars.Series[source]#

Predict rule-coverage probability for each sample.

Probability is a piecewise-linear function of the number of selected rules that fire for each sample:

  • 0 rules fired → 0.0

  • 1 rule fired → 0.5

  • all rules fired → 1.0

  • between 1 and all: linearly interpolated in [0.5, 1.0]

Parameters:

X (pl.DataFrame) – Feature DataFrame with the same columns seen during fit.

Returns:

Float64 series named “proba” with values in [0.0, 1.0].

Return type:

pl.Series

fit_predict(X: polars.DataFrame, y: polars.Series) polars.Series[source]#

Fit pipeline and return binary predictions on the same data.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') iguanas.ruleset_classifier.RulesetClassifier#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object