Rule Combination#
Functions#
combine_rules_full_search#
- iguanas.rule_combination.combine_rules_full_search(R: polars.DataFrame, n: int = 3, max_combinations_per_n: int = 200000, batch_size: int = 50000, operator: str = 'or') polars.DataFrame[source]#
Combine rules using logical operations to create new composite rules.
Generates all possible combinations of 2 to n rules and creates new columns where each combination is evaluated using the specified logical operation (OR/AND). The combined rule name reflects the operation between component rules.
Optimized for speed using batch processing and vectorized operations.
- Parameters:
R (pl.DataFrame) – DataFrame containing rule columns to be combined. Each column should represent a boolean or binary rule evaluation. All columns will be used as candidate rules.
n (int, default=3) – Maximum number of rules to combine. Generates all combinations from size 2 up to size n.
max_combinations_per_n (int, default=250_000) – Maximum number of combinations to generate per combination size. If exceeded, only the first max_combinations_per_n are used.
batch_size (int, default=100_000) – Number of combinations to process in each batch to manage memory.
operator (str, default='or') – Boolean operator to apply: ‘or’ for OR operations (any True), ‘and’ for AND operations (all True).
- Returns:
DataFrame containing the original rules plus all generated combined rules. Combined rule columns are named using the pattern:
”(rule1) | (rule2) | …” for OR operations
”(rule1) & (rule2) & …” for AND operations
- Return type:
pl.DataFrame
Examples
>>> import polars as pl >>> R = pl.DataFrame({"rule_A": [1, 0, 1], "rule_B": [0, 1, 1]}) >>> combine_rules_full_search(R, n=2, operator='or') # Returns DataFrame with original columns plus "(rule_A) | (rule_B)" >>> combine_rules_full_search(R, n=2, operator='and') # Returns DataFrame with original columns plus "(rule_A) & (rule_B)"
combine_rules_cumulative#
- iguanas.rule_combination.combine_rules_cumulative(R: polars.DataFrame, output_names: list[str] | None = None, operator: str = 'or') polars.DataFrame[source]#
Compute horizontal cumulative boolean operations across all columns.
- Parameters:
R (pl.DataFrame) – Input DataFrame. All columns will be used in the cumulative operation.
output_names (list[str] | None, default=None) – List of names for the output columns. If None, generates names based on operator. Must have the same length as R.columns.
operator (str, default='or') –
Boolean operator to apply:
’or’: cumulative OR (any True)
’and’: cumulative AND (all True)
- Returns:
DataFrame with boolean values:
If operator=’or’: True if at least one condition is True up to that position
If operator=’and’: True if all conditions are True up to that position
- Return type:
pl.DataFrame
- Raises:
ValueError – If operator is not ‘or’ or ‘and’, or if output_names length doesn’t match columns.
Examples
>>> import polars as pl >>> R = pl.DataFrame({ ... "rule_A": [True, False, True], ... "rule_B": [False, True, True], ... "rule_C": [True, True, False], ... }) >>> combine_rules_cumulative(R, operator="or") # Column 1: rule_A | ...; Column 2: rule_A | rule_B | ...; Column 3: all three >>> combine_rules_cumulative(R, operator="and", output_names=["step1", "step2", "step3"]) # Named columns, each True only if all rules up to that position are True
combine_rules_greedy#
- iguanas.rule_combination.combine_rules_greedy(R: polars.DataFrame, y: polars.Series, metric: str = 'f1', max_rules: int = 5, operator: str = 'or', weights: polars.Series | None = None, min_improvement: float = 0.0) polars.DataFrame[source]#
Greedily select rules that maximize a performance metric.
Starts with the best single rule, then iteratively adds rules that provide the largest metric improvement. Stops when no rule improves the metric by at least min_improvement or when max_rules is reached.
- Parameters:
R (pl.DataFrame) – DataFrame containing boolean rule columns. All columns will be used as candidate rules.
y (pl.Series) – Boolean target series indicating true labels.
metric (str, default="f1") – Performance metric to optimize. Must be a column name produced by compute_metrics (e.g., “f1”, “accuracy”, “precision”, “recall”).
max_rules (int, default=5) – Maximum number of rules to select.
operator (str, default="or") – Boolean operator for combining rules: ‘or’ or ‘and’.
weights (pl.Series | None, default=None) – Optional sample weights for weighted metric computation.
min_improvement (float, default=0.0) – Minimum metric improvement required to add a new rule.
- Returns:
DataFrame with single column containing the combined rule. Column name reflects the selected rules using the operator.
- Return type:
pl.DataFrame
Examples
>>> import polars as pl >>> R = pl.DataFrame({"rule_A": [True, False, True], ... "rule_B": [False, True, True], ... "rule_C": [True, True, False]}) >>> y = pl.Series([True, True, False]) >>> result_R = combine_rules_greedy( ... R, y, metric="f1", max_rules=2 ... ) >>> print(result_R.columns) # e.g., ['(rule_B) | (rule_A)']
- Raises:
ValueError – If operator is not ‘or’ or ‘and’, or if metric column not found.
combine_rules_beam_search#
- iguanas.rule_combination.combine_rules_beam_search(R: polars.DataFrame, y: polars.Series, metric: str = 'f1', beam_width: int = 4, max_rules: int = 5, operator: str = 'or', weights: polars.Series | None = None, min_improvement: float = 0.0, return_top_k: int = 10) polars.DataFrame[source]#
Find top rule combinations using beam search.
Maintains beam_width best partial combinations at each depth level, exploring a broader set of combinations than greedy search while remaining more efficient than exhaustive search.
- Parameters:
R (pl.DataFrame) – DataFrame containing boolean rule columns. All columns will be used as candidate rules.
y (pl.Series) – Boolean target series indicating true labels.
metric (str, default="f1") – Performance metric to optimize. Must be a column name produced by compute_metrics (e.g., “accuracy”, “f1”, “precision”, “recall”).
beam_width (int, default=4) – Number of best candidates to keep at each depth level.
max_rules (int, default=5) – Maximum number of rules in a combination.
operator (str, default="or") – Boolean operator for combining rules: ‘or’ or ‘and’.
weights (pl.Series | None, default=None) – Optional sample weights for weighted metric computation.
min_improvement (float, default=0.0) – Minimum metric improvement required over parent combination to add a new rule. Acts as a pruning criterion to avoid expanding combinations that don’t provide sufficient benefit.
return_top_k (int, default=10) – Number of top combinations to return.
- Returns:
DataFrame containing columns for the top rule combinations found. Each column represents one combination, with the column name showing the combined rule expression.
- Return type:
pl.DataFrame
Examples
>>> import polars as pl >>> R = pl.DataFrame({"rule_A": [True, False, True], ... "rule_B": [False, True, True], ... "rule_C": [True, True, False]}) >>> y = pl.Series([True, True, False]) >>> result_R = combine_rules_beam_search( ... R, y, metric="f1", beam_width=3, max_rules=2 ... ) >>> print(result_R.columns) # Shows top rule combinations
- Raises:
ValueError – If operator is not ‘or’ or ‘and’, or if metric column not found.
combine_rules_a_star#
- iguanas.rule_combination.combine_rules_a_star(R: polars.DataFrame, y: polars.Series, metric: str = 'f1', max_rules: int = 5, operator: str = 'or', weights: polars.Series | None = None, min_improvement: float = 0.0, return_top_k: int = 10) polars.DataFrame[source]#
Find top rule combinations using A* search algorithm.
Uses A* to efficiently explore the space of rule combinations, finding optimal or near-optimal combinations by balancing actual performance (g) with estimated potential (h). More thorough than greedy or beam search when finding the globally best combination is important.
- A* Cost Function:
g(n): Negative metric value (better metrics = lower cost)
h(n): Optimistic estimate of best possible improvement from remaining rules
f(n): g(n) + h(n) (total estimated cost)
- Parameters:
R (pl.DataFrame) – DataFrame containing boolean rule columns. All columns will be used as candidate rules.
y (pl.Series) – Boolean target series indicating true labels.
metric (str, default="f1") – Performance metric to optimize. Must be a column name produced by compute_metrics (e.g., “f1”, “accuracy”, “precision”, “recall”).
max_rules (int, default=5) – Maximum number of rules in a combination.
operator (str, default="or") – Boolean operator for combining rules: ‘or’ or ‘and’.
weights (pl.Series | None, default=None) – Optional sample weights for weighted metric computation.
min_improvement (float, default=0.0) – Minimum metric improvement required over parent combination to expand a node. Acts as a pruning criterion.
return_top_k (int, default=10) – Number of top combinations to return. Set to 1 for single best.
- Returns:
DataFrame containing columns for the top rule combinations found. Each column represents one combination, with the column name showing the combined rule expression. Ordered by metric value (best first).
- Return type:
pl.DataFrame
Examples
>>> import polars as pl >>> R = pl.DataFrame({"rule_A": [True, False, True], ... "rule_B": [False, True, True], ... "rule_C": [True, True, False]}) >>> y = pl.Series([True, True, False]) >>> # Find single best combination >>> best = combine_rules_a_star(R, y, metric="f1", return_top_k=1) >>> # Find top 5 combinations >>> top_5 = combine_rules_a_star(R, y, metric="f1", return_top_k=5)
- Raises:
ValueError – If operator is not ‘or’ or ‘and’, or if metric column not found.
Notes
A* is guaranteed to find the optimal solution if the heuristic is admissible (never overestimates the true cost). The heuristic used here estimates the best possible improvement from remaining rules, which is optimistic and thus admissible.