Rule Selection#
Functions#
extract_feature_names_from_rule#
- iguanas.rule_selection.extract_feature_names_from_rule(rule: str) list[str][source]#
Extract column names from a rule string with X[“column_name”] patterns.
- Parameters:
rule (str) – Rule string containing X[“column_name”] patterns.
- Returns:
List of unique column names extracted from the rule, in order of appearance.
- Return type:
list[str]
Examples
>>> rule = '(X["a"] >= 419) & (X["b"] < 1.0)' >>> extract_feature_names_from_rule(rule) ['a', 'b']
filter_rules_by_feature_overlap#
- iguanas.rule_selection.filter_rules_by_feature_overlap(R: polars.DataFrame, importance: dict[str, float], min_difference: Annotated[int, annotated_types.Gt(gt=0)] = 1, rule_column: str = 'rule') polars.DataFrame[source]#
Filter out rules that are too similar based on column usage, keeping the most important.
Uses a greedy algorithm that processes rules sequentially. Note that this can result in keeping rules that are transitively similar (A similar to B, B filtered out, C similar to B but not to A, both A and C kept).
Rules with identical column sets are always considered similar regardless of min_difference value (max one-sided difference = 0).
- Parameters:
R (pl.DataFrame) – DataFrame with a column containing rule strings (X[“column_name”] patterns).
importance (dict) – Dictionary mapping rule strings to their importance values. Keys: rule strings matching those in R[rule_column] Values: importance values for each rule (missing rules default to 0.0)
min_difference (PositiveInt, default=1) – Minimum number of different columns required between two rules. If two rules differ by fewer than this many columns, only the one with highest importance is kept. Must be >= 1.
rule_column (str, default="rule") – Name of the column containing rule strings.
- Returns:
Filtered DataFrame with similar rules removed (keeping highest importance).
- Return type:
pl.DataFrame
Examples
>>> import polars as pl >>> rules_X = pl.DataFrame({ ... 'rule': ['(X["a"] > 1) & (X["b"] < 2)', ... '(X["a"] > 1) & (X["c"] < 3)', ... '(X["a"] > 1) & (X["b"] < 2)'], ... 'score': [0.9, 0.85, 0.8] ... }) >>> importance = {'(X["a"] > 1) & (X["b"] < 2)': 0.7, ... '(X["a"] > 1) & (X["c"] < 3)': 0.9} >>> filter_rules_by_feature_overlap(rules_X, importance, min_difference=1)
select_best_rule_per_column_combination#
- iguanas.rule_selection.select_best_rule_per_column_combination(metrics: polars.DataFrame, ranking_metric: str = 'precision') list[str][source]#
Select the rule with the highest metric score for each unique column combination.
- Parameters:
metrics (pl.DataFrame) – DataFrame containing rule performance metrics. Must have a “rule” column and the metric specified in ranking_metric.
ranking_metric (str, default="precision") – Name of the metric column to use for selecting the best rule in each group.
- Returns:
Filtered rules with only the best rule for each column combination.
- Return type:
list[str]
Examples
>>> metrics = pl.DataFrame({ ... "rule": ['(X["a"] > 1)', '(X["a"] > 2)', '(X["b"] < 3)'], ... "precision": [0.95, 0.98, 0.96] ... }) >>> select_best_rule_per_column_combination(metrics, ranking_metric="precision") # Returns the rule with highest precision for column "a" and the rule for column "b"