Rule Generation#

Functions#

extract_rule_by_max_gain#

iguanas.rule_generation.extract_rule_by_max_gain(tree_X: pandas.DataFrame) str[source]#

Extract the rule path to the leaf with maximum gain using bottom-to-top approach.

Finds the leaf node with highest gain value and traces back to the root node, building the rule by reconstructing conditions from child to parent.

Parameters:

tree_X (pd.DataFrame) – Output from estimator._Booster.trees_to_dataframe() filtered for a single tree. Required columns: Tree, Node, ID, Feature, Split, Yes, No, Missing, Gain, Cover.

Returns:

Rule string in format (X[“feat1”] >= Split1) & (X[“feat2”] < Split2). Returns empty string if tree is empty or has no valid leaves.

Return type:

str

extract_rule_with_monotone_constraints#

iguanas.rule_generation.extract_rule_with_monotone_constraints(tree_X: pandas.DataFrame, monotone_constraints: dict[str, int]) str[source]#

Extract rule path following monotone constraints using top-to-bottom approach.

Starts from root and follows tree structure based on monotone constraints. NOTE: Only applicable if ALL features have a monotone constraint of -1 or +1. Features with constraint 0 will raise a ValueError.

Parameters:
  • tree_X (pd.DataFrame) – Output from estimator._Booster.trees_to_dataframe() filtered for a single tree. Required columns: Tree, Node, ID, Feature, Split, Yes, No, Missing.

  • monotone_constraints (dict[str, int]) –

    Dictionary mapping feature names to constraint values:

    • +1 (positive): follow “No” branch (feature >= threshold)

    • -1 (negative): follow “Yes” branch (feature < threshold)

    • 0 (none): raises ValueError - not supported

Returns:

Rule string in format (X[“feat1”] >= Split1) & (X[“feat2”] < Split2). Returns empty string if tree is empty or starts with a leaf.

Return type:

str

Raises:

ValueError – If a feature has no constraint defined or has constraint 0.

extract_rules#

iguanas.rule_generation.extract_rules(estimator: xgboost.sklearn.XGBClassifier, all_features_constrained: bool, **kwargs) pandas.DataFrame[source]#

Generate metrics for rules extracted from XGBoost trees.

Parameters:
  • estimator (XGBClassifier) – Fitted XGBoost classifier from which to extract rules.

  • all_features_constrained (bool) – If True, uses monotone constraint-based extraction (top-to-bottom). If False, uses max gain-based extraction (bottom-to-top).

  • **kwargs (dict) – Additional parameters for rule extraction and metric calculation (e.g., transformation name, scale_pos_weight value).

Returns:

DataFrame containing: - rule: Extracted rule as a string - tree: Tree number from which the rule was extracted - scale_pos_weight: Scale_pos_weight value used for this tree

Return type:

pd.DataFrame

rule_grid_search_sequential#

iguanas.rule_generation.rule_grid_search_sequential(estimator: xgboost.sklearn.XGBClassifier, X_train: polars.DataFrame | pandas.DataFrame, y_train: polars.Series | pandas.Series, scale_pos_weights: list[float] | numpy.ndarray, sample_weights_df: polars.DataFrame | pandas.DataFrame | None = None, verbose: int = 0) polars.DataFrame[source]#

Sequential (single-process) variant of rule_grid_search.

Identical behaviour to rule_grid_search() but runs in a single process without joblib parallelism. Useful for debugging, environments where multiprocessing is unavailable, or small workloads where process-spawn overhead outweighs the benefit of parallelism.

Parameters:
  • estimator (XGBClassifier) – Base XGBoost classifier to use as a template for rule extraction.

  • X_train (pl.DataFrame | pd.DataFrame) – Training feature matrix.

  • y_train (pl.Series | pd.Series) – Training target values.

  • scale_pos_weights (list | np.ndarray) – Array of scale_pos_weight values to try.

  • sample_weights_df (pl.DataFrame | pd.DataFrame | None, default=None) – DataFrame mapping transformation names to sample weight arrays. If None, uses baseline weights of 1.0 for all samples.

  • verbose (int, default=0) – Controls verbosity. 0 = silent, 1 = summary.

Returns:

Same schema as rule_grid_search(): columns rule, tree, scale_pos_weight, transformation.

Return type:

pl.DataFrame

rule_grid_search_parallel_weights#

iguanas.rule_generation.rule_grid_search_parallel_weights(estimator: xgboost.sklearn.XGBClassifier, X_train: polars.DataFrame | pandas.DataFrame, y_train: polars.Series | pandas.Series, scale_pos_weights: list[float] | numpy.ndarray, sample_weights_df: polars.DataFrame | pandas.DataFrame | None = None, n_jobs: int = -1, verbose: int = 0) polars.DataFrame[source]#

Perform grid search over sample weight transformations and scale_pos_weight values to find optimal rules.

This function systematically trains XGBoost models with different combinations of: - sample weights - scale_pos_weight values

For each combination, it extracts rules from the fitted models and returns them as a Polars DataFrame. The weight transformations loop is parallelized using joblib for improved performance.

Parameters:
  • estimator (XGBClassifier) – Base XGBoost classifier to use as a template for rule extraction.

  • X_train (pl.DataFrame | pd.DataFrame) – Training feature matrix.

  • y_train (pl.Series | pd.Series) – Training target values.

  • scale_pos_weights (list | np.ndarray) – Array of scale_pos_weight values to try. Parallelised across workers.

  • sample_weights_df (pl.DataFrame | pd.DataFrame | None, default=None) – DataFrame mapping transformation names to sample weight arrays. If None, uses baseline weights of 1.0 for all samples.

  • n_jobs (int, default=-1) – Number of parallel jobs to run. -1 means using all processors.

  • verbose (int, default=0) –

    Controls the verbosity level:

    • 0: silent (no output)

    • 1: progress information (start/end summary)

    • >=2: detailed progress with live updates from joblib Parallel backend

Returns:

Same schema as rule_grid_search(): columns rule, tree, scale_pos_weight, transformation.

Return type:

pl.DataFrame

Examples

>>> weights_train = generate_sample_weight_transformations(X_train["amount"])
>>> scale_pos_weights = np.logspace(0, np.log10(imbalance_ratio*2), 20)
>>> results = rule_grid_search(
...     estimator, X_train, y_train,
...     scale_weights, weights_train, n_jobs=-1, verbose=1
... )

rule_grid_search_parallel_scales#

iguanas.rule_generation.rule_grid_search_parallel_scales(estimator: xgboost.sklearn.XGBClassifier, X_train: polars.DataFrame | pandas.DataFrame, y_train: polars.Series | pandas.Series, scale_pos_weights: list[float] | numpy.ndarray, sample_weights_df: polars.DataFrame | pandas.DataFrame | None = None, n_jobs: int = -1, verbose: int = 0) polars.DataFrame[source]#

Perform grid search parallelised over scale_pos_weight values.

This function systematically trains XGBoost models with different combinations of: - sample weights - scale_pos_weight values

For each combination, it extracts rules from the fitted models and returns them as a Polars DataFrame. The weight transformations loop is parallelized using joblib for improved performance.

Parameters:
  • estimator (XGBClassifier) – Base XGBoost classifier to use as a template for rule extraction.

  • X_train (pl.DataFrame | pd.DataFrame) – Training feature matrix.

  • y_train (pl.Series | pd.Series) – Training target values.

  • scale_pos_weights (list | np.ndarray) – Array of scale_pos_weight values to try. Parallelised across workers.

  • sample_weights_df (pl.DataFrame | pd.DataFrame | None, default=None) – DataFrame mapping transformation names to sample weight arrays. If None, uses baseline weights of 1.0 for all samples.

  • n_jobs (int, default=-1) – Number of parallel jobs to run. -1 means using all processors.

  • verbose (int, default=0) –

    Controls the verbosity level:

    • 0: silent (no output)

    • 1: progress information (start/end summary)

    • >=2: detailed progress with live updates from joblib Parallel backend

Returns:

Same schema as rule_grid_search(): columns rule, tree, scale_pos_weight, transformation.

Return type:

pl.DataFrame