Rule Selection#

Functions#

extract_feature_names_from_rule#

iguanas.rule_selection.extract_feature_names_from_rule(rule: str) → list[str][source]#

Extract column names from a rule string with X[“column_name”] patterns.

Parameters:: rule (str) – Rule string containing X[“column_name”] patterns.
Returns:: List of unique column names extracted from the rule, in order of appearance.
Return type:: list[str]

Examples

>>> rule = '(X["a"] >= 419) & (X["b"] < 1.0)'
>>> extract_feature_names_from_rule(rule)
['a', 'b']

filter_rules_by_feature_overlap#

iguanas.rule_selection.filter_rules_by_feature_overlap(R: polars.DataFrame, importance: dict[str, float], min_difference: Annotated[int, annotated_types.Gt(gt=0)] = 1, rule_column: str = 'rule') → polars.DataFrame[source]#

Filter out rules that are too similar based on column usage, keeping the most important.

Uses a greedy algorithm that processes rules sequentially. Note that this can result in keeping rules that are transitively similar (A similar to B, B filtered out, C similar to B but not to A, both A and C kept).

Rules with identical column sets are always considered similar regardless of min_difference value (max one-sided difference = 0).

Parameters:

R (pl.DataFrame) – DataFrame with a column containing rule strings (X[“column_name”] patterns).
importance (dict) – Dictionary mapping rule strings to their importance values. Keys: rule strings matching those in R[rule_column] Values: importance values for each rule (missing rules default to 0.0)
min_difference (PositiveInt, default=1) – Minimum number of different columns required between two rules. If two rules differ by fewer than this many columns, only the one with highest importance is kept. Must be >= 1.
rule_column (str, default="rule") – Name of the column containing rule strings.

Returns:

Filtered DataFrame with similar rules removed (keeping highest importance).

Return type:

pl.DataFrame

Examples

>>> import polars as pl
>>> rules_X = pl.DataFrame({
...     'rule': ['(X["a"] > 1) & (X["b"] < 2)',
...              '(X["a"] > 1) & (X["c"] < 3)',
...              '(X["a"] > 1) & (X["b"] < 2)'],
...     'score': [0.9, 0.85, 0.8]
... })
>>> importance = {'(X["a"] > 1) & (X["b"] < 2)': 0.7,
...               '(X["a"] > 1) & (X["c"] < 3)': 0.9}
>>> filter_rules_by_feature_overlap(rules_X, importance, min_difference=1)

filter_correlated_rules#

iguanas.rule_selection.filter_correlated_rules(R: polars.DataFrame, importance: dict, max_corr: float = 0.95, use_abs: bool = True) → list[str][source]#

Filter highly correlated columns, keeping only the most important.

Accepts either a boolean predictions DataFrame (correlation is computed internally) or a pre-computed float correlation matrix. For each pair of columns with correlation above max_corr threshold, keeps only the column with higher importance value.

Parameters:

R (pl.DataFrame) – Either a boolean DataFrame of rule predictions (one column per rule, one row per sample) or a pre-computed n×n float correlation matrix. When boolean, Pearson correlations are computed automatically.
importance (dict) – Dictionary mapping rule names (column names) to their importance values.
max_corr (float, default=0.95) – Maximum correlation threshold. Pairs with correlation above this value will be filtered to keep only the most important rule.
use_abs (bool, default=True) – If True, compares the absolute value of the correlation against max_corr, treating strong negative correlations (e.g. -0.97) the same as strong positive ones. If False, only positive correlations above max_corr trigger filtering.

Returns:

List of selected columns to keep.

Return type:

list[str]

Raises:

ValueError – If length of importance dict doesn’t match number of columns in R.

Examples

>>> import polars as pl
>>> R = pl.DataFrame({
...     "rule_A": [True, False, True, False],
...     "rule_B": [True, False, True, False],  # identical to rule_A
...     "rule_C": [False, True, False, True],
... })
>>> importance = {"rule_A": 0.8, "rule_B": 0.6, "rule_C": 0.9}
>>> filter_correlated_rules(R, importance, max_corr=0.9)
['rule_A', 'rule_C']

select_best_rule_per_column_combination#

iguanas.rule_selection.select_best_rule_per_column_combination(metrics: polars.DataFrame, ranking_metric: str = 'precision') → list[str][source]#

Select the rule with the highest metric score for each unique column combination.

Parameters:

metrics (pl.DataFrame) – DataFrame containing rule performance metrics. Must have a “rule” column and the metric specified in ranking_metric.
ranking_metric (str, default="precision") – Name of the metric column to use for selecting the best rule in each group.

Returns:

Filtered rules with only the best rule for each column combination.

Return type:

list[str]

Examples

>>> metrics = pl.DataFrame({
...     "rule": ['(X["a"] > 1)', '(X["a"] > 2)', '(X["b"] < 3)'],
...     "precision": [0.95, 0.98, 0.96]
... })
>>> select_best_rule_per_column_combination(metrics, ranking_metric="precision")
# Returns the rule with highest precision for column "a" and the rule for column "b"