gators.encoders package#
Module contents#
- class gators.encoders.BinaryEncoder[source]#
Bases:
_BaseEncoderEncodes categorical values using binary representation.
Each category is first encoded as an integer, then converted to binary, with each binary digit becoming a separate column. This is more compact than one-hot encoding for high cardinality features.
- Parameters:
subset (Optional[List[str]], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (Union[int, float], default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__binary_enc_{bit_index}’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
Initialize and use BinaryEncoder.
Example with drop_columns=True and columns=None:
>>> import polars as pl >>> from gators.encoders import BinaryEncoder >>> X = pl.DataFrame({ ... "category": ["A", "B", "C", "D", "A", "B"], ... "value": [1, 2, 3, 4, 5, 6] ... }) >>> encoder = BinaryEncoder(min_count=1, inplace=False, drop_columns=True) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (6, 3) ┌───────┬────────────────────────┬────────────────────────┐ │ value ┆ category__binary_enc_0 ┆ category__binary_enc_1 │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ f64 │ ╞═══════╪════════════════════════╪════════════════════════╡ │ 1 ┆ 1.0 ┆ 1.0 │ │ 2 ┆ 1.0 ┆ 0.0 │ │ 3 ┆ 0.0 ┆ 0.0 │ │ 4 ┆ 0.0 ┆ 1.0 │ │ 5 ┆ 1.0 ┆ 1.0 │ │ 6 ┆ 1.0 ┆ 0.0 │ └───────┴────────────────────────┴────────────────────────┘
Example with drop_columns=False:
>>> X = pl.DataFrame({ ... "category": ["A", "B", "C", "D", "A", "B"], ... "value": [1, 2, 3, 4, 5, 6] ... }) >>> encoder = BinaryEncoder(subset=["category"], inplace=False, drop_columns=False) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (6, 4) ┌──────────┬───────┬────────────────────────┬────────────────────────┐ │ category ┆ value ┆ category__binary_enc_0 ┆ category__binary_enc_1 │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ f64 ┆ f64 │ ╞══════════╪═══════╪════════════════════════╪════════════════════════╡ │ A ┆ 1 ┆ 0.0 ┆ 0.0 │ │ B ┆ 2 ┆ 1.0 ┆ 0.0 │ │ C ┆ 3 ┆ 0.0 ┆ 1.0 │ │ D ┆ 4 ┆ 1.0 ┆ 1.0 │ │ A ┆ 5 ┆ 0.0 ┆ 0.0 │ │ B ┆ 6 ┆ 1.0 ┆ 0.0 │ └──────────┴───────┴────────────────────────┴────────────────────────┘
- class gators.encoders.CatBoostEncoder[source]#
Bases:
_BaseEncoderEncodes categorical values using CatBoost target encoding with ordered statistics.
This encoder implements the CatBoost algorithm’s approach to target encoding, which uses ordered target statistics to prevent target leakage and overfitting. For each category, it calculates the cumulative mean of the target up to (but not including) the current row.
- Parameters:
subset (Optional[List[str]], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (Union[int, float], default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
smoothing (float, default=1.0) – Smoothing parameter for regularization toward the global mean. Higher values increase regularization.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__encode_catboost’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
Initialize and use CatBoostEncoder.
>>> import polars as pl >>> from gators.encoders import CatBoostEncoder >>> X = pl.DataFrame({ ... "category": ["A", "B", "A", "C", "A", "B", "C"], ... "value": [1, 2, 3, 4, 5, 6, 7] ... }) >>> y = pl.Series("target", [1, 0, 1, 0, 0, 1, 1]) >>> encoder = CatBoostEncoder(subset=["category"], smoothing=1.0, inplace=False, drop_columns=True) >>> _ = encoder.fit(X, y) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (7, 3) ┌───────┬────────────────────────────┬───────┐ │ target┆ category__encode_catboost │ value │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ i64 │ ╞═══════╪════════════════════════════╪═══════╡ │ 1 ┆ 0.571429 ┆ 1 │ │ 0 ┆ 0.571429 ┆ 2 │ │ 1 ┆ 0.666667 ┆ 3 │ │ 0 ┆ 0.571429 ┆ 4 │ │ 0 ┆ 0.600000 ┆ 5 │ │ 1 ┆ 0.428571 ┆ 6 │ │ 1 ┆ 0.428571 ┆ 7 │ └───────┴────────────────────────────┴───────┘
- fit(X, y=None)[source]#
Fit the transformer by computing CatBoost ordered target statistics.
- Parameters:
X (
DataFrame) – Input DataFrame with categorical columns.y (
Series|None) – Target series (binary or continuous). Required for CatBoostEncoder.
- Returns:
The fitted transformer instance.
- Return type:
- Raises:
ValueError – If y is None.
- class gators.encoders.CountEncoder[source]#
Bases:
_BaseEncoderEncodes categorical values with their occurrence counts.
- Parameters:
subset (Optional[List[str]], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (Union[int, float], default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__count_enc’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
Initialize and use CountEncoder.
Example with drop_columns=True and columns=None:
>>> import polars as pl >>> from gators.encoders import CountEncoder >>> X = pl.DataFrame({ ... "category": ["A", "B", "A", "C", "C", "A", "B"], ... "value": [1, 2, 3, 4, 5, 6, 7], ... "other": ["foo", "bar", "baz", "qux", "quux", "corge", "grault"] ... }) >>> encoder = CountEncoder(min_count=1, inplace=False) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (7, 3) ┌───────┬─────────────────────┬──────────────────┐ │ value ┆ category__count_enc ┆ other__count_enc │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ f64 │ ╞═══════╪═════════════════════╪══════════════════╡ │ 1 ┆ 3.0 ┆ 1.0 │ │ 2 ┆ 2.0 ┆ 1.0 │ │ 3 ┆ 3.0 ┆ 1.0 │ │ 4 ┆ 2.0 ┆ 1.0 │ │ 5 ┆ 2.0 ┆ 1.0 │ │ 6 ┆ 3.0 ┆ 1.0 │ │ 7 ┆ 2.0 ┆ 1.0 │ └───────┴─────────────────────┴──────────────────┘
Example with drop_columns=True and columns as a subset:
>>> X = pl.DataFrame({ ... "category": ["A", "B", "A", "C", "C", "A", "B"], ... "value": [1, 2, 3, 4, 5, 6, 7], ... "other": ["foo", "bar", "baz", "qux", "quux", "corge", "grault"] ... }) >>> encoder = CountEncoder(subset=["category"], min_count=1, drop_columns=True, inplace=False) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (7, 3) ┌───────┬────────┬────────────────────────┐ │ value ┆ other ┆ category__encode_count │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 │ ╞═══════╪════════╪════════════════════════╡ │ 1 ┆ foo ┆ 3.0 │ │ 2 ┆ bar ┆ 2.0 │ │ 3 ┆ baz ┆ 3.0 │ │ 4 ┆ qux ┆ 2.0 │ │ 5 ┆ quux ┆ 2.0 │ │ 6 ┆ corge ┆ 3.0 │ │ 7 ┆ grault ┆ 2.0 │ └───────┴────────┴────────────────────────┘
Example with drop_columns=False and columns=None:
>>> import polars as pl >>> from gators.encoders import CountEncoder >>> X = pl.DataFrame({ ... "category": ["A", "B", "A", "C", "C", "A", "B"], ... "value": [1, 2, 3, 4, 5, 6, 7], ... "other": ["foo", "bar", "baz", "qux", "quux", "corge", "grault"] ... }) >>> encoder = CountEncoder(min_count=1, drop_columns=False, inplace=False) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (7, 5) ┌──────────┬───────┬────────┬────────────────────────┬─────────────────────┐ │ category ┆ value ┆ other ┆ category__encode_count ┆ other__encode_count │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ str ┆ f64 ┆ f64 │ ╞══════════╪═══════╪════════╪════════════════════════╪═════════════════════╡ │ A ┆ 1 ┆ foo ┆ 3.0 ┆ 1.0 │ │ B ┆ 2 ┆ bar ┆ 2.0 ┆ 1.0 │ │ A ┆ 3 ┆ baz ┆ 3.0 ┆ 1.0 │ │ C ┆ 4 ┆ qux ┆ 2.0 ┆ 1.0 │ │ C ┆ 5 ┆ quux ┆ 2.0 ┆ 1.0 │ │ A ┆ 6 ┆ corge ┆ 3.0 ┆ 1.0 │ │ B ┆ 7 ┆ grault ┆ 2.0 ┆ 1.0 │ └──────────┴───────┴────────┴────────────────────────┴─────────────────────┘
- class gators.encoders.LeaveOneOutEncoder[source]#
Bases:
_BaseEncoderEncodes categorical values using leave-one-out target encoding.
For each row, this encoder calculates the mean of the target variable for the category, excluding the current row. This reduces overfitting compared to standard target encoding by preventing the target value from influencing its own encoding.
- Parameters:
subset (Optional[List[str]], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (Union[int, float], default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
smoothing (float, default=0.0) – Smoothing parameter for regularization toward the global mean. Higher values increase regularization. Use 0 for no smoothing.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__encode_loo’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
Initialize and use LeaveOneOutEncoder.
>>> import polars as pl >>> from gators.encoders import LeaveOneOutEncoder >>> X = pl.DataFrame({ ... "category": ["A", "B", "A", "C", "A", "B", "C"], ... "target": [1, 0, 1, 0, 0, 1, 1], ... "value": [1, 2, 3, 4, 5, 6, 7] ... }) >>> encoder = LeaveOneOutEncoder(subset=["category"], smoothing=1.0) >>> _ = encoder.fit(X, y=X["target"]) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (7, 3) ┌───────┬──────────────────────────┬───────┐ │ target┆ category__encode_loo ┆ value │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ i64 │ ╞═══════╪══════════════════════════╪═══════╡ │ 1 ┆ 0.571429 ┆ 1 │ │ 0 ┆ 0.571429 ┆ 2 │ │ 1 ┆ 0.571429 ┆ 3 │ │ 0 ┆ 0.571429 ┆ 4 │ │ 0 ┆ 0.666667 ┆ 5 │ │ 1 ┆ 0.571429 ┆ 6 │ │ 1 ┆ 0.571429 ┆ 7 │ └───────┴──────────────────────────┴───────┘
Example with no smoothing:
>>> X = pl.DataFrame({ ... "category": ["A", "A", "A", "B", "B"], ... "target": [1, 0, 1, 0, 1], ... "value": [1, 2, 3, 4, 5] ... }) >>> encoder = LeaveOneOutEncoder(subset=["category"], smoothing=0.0) >>> _ = encoder.fit(X, y="target") >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 3) ┌───────┬─────────────────────────┬───────┐ │ target┆ category__encode_loo ┆ value │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ i64 │ ╞═══════╪═════════════════════════╪═══════╡ │ 1 ┆ 0.666667 ┆ 1 │ │ 0 ┆ 0.666667 ┆ 2 │ │ 1 ┆ 0.500000 ┆ 3 │ │ 0 ┆ 0.500000 ┆ 4 │ │ 1 ┆ 0.500000 ┆ 5 │ └───────┴─────────────────────────┴───────┘
- fit(X, y)[source]#
Fit the transformer by computing leave-one-out target statistics.
- Parameters:
X (
DataFrame) – Input DataFrame with categorical columns.y (
Series) – Target series (binary or continuous). Required for LeaveOneOutEncoder.
- Returns:
The fitted transformer instance.
- Return type:
- Raises:
ValueError – If y is None.
- class gators.encoders.OneHotEncoder[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinOne-hot encodes categorical values.
- Parameters:
subset (Optional[List[str]], default=None) – List of string columns to encode. If None, all string columns are selected.
categories (Optional[Dict[str, List[str]]], default=None) – Pre-defined categories for each column. If None, categories are inferred from data during fit.
min_count (Union[PositiveInt, PositiveFloat], default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
drop_columns (bool, default=True) – Whether to drop the original columns after encoding.
Examples
Basic usage:
>>> from gators.encoders import OneHotEncoder >>> import polars as pl >>> X = pl.DataFrame({ ... "A": ["foo", "bar", "foo", "bar", "baz"], ... "B": ["one", "one", "two", "two", "one"], ... }) >>> encoder = OneHotEncoder() >>> encoder.fit(X) OneHotEncoder(...) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 5) ┌───────┬───────┬───────┬───────┬───────┐ │ A|foo │ A|bar │ A|baz │ B|one │ B|two │ │ f64 │ f64 │ f64 │ f64 │ f64 │ ╞═══════╪═══════╪═══════╪═══════╪═══════╡ │ 1.0 │ 0.0 │ 0.0 │ 1.0 │ 0.0 │ │ 0.0 │ 1.0 │ 0.0 │ 1.0 │ 0.0 │ │ 1.0 │ 0.0 │ 0.0 │ 0.0 │ 1.0 │ │ 0.0 │ 1.0 │ 0.0 │ 0.0 │ 1.0 │ │ 0.0 │ 0.0 │ 1.0 │ 1.0 │ 0.0 │ └───────┴───────┴───────┴───────┴───────┘
Drop columns:
>>> encoder = OneHotEncoder(drop_columns=True) >>> encoder.fit(X) OneHotEncoder(...) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 5) ┌────────┬────────┬────────┬────────┬────────┐ │ A__foo │ A__bar │ A__baz │ B__one │ B__two │ │ f64 │ f64 │ f64 │ f64 │ f64 │ ╞════════╪════════╪════════╪════════╪════════╡ │ 1.0 │ 0.0 │ 0.0 │ 1.0 │ 0.0 │ │ 0.0 │ 1.0 │ 0.0 │ 1.0 │ 0.0 │ │ 1.0 │ 0.0 │ 0.0 │ 0.0 │ 1.0 │ │ 0.0 │ 1.0 │ 0.0 │ 0.0 │ 1.0 │ │ 0.0 │ 0.0 │ 1.0 │ 1.0 │ 0.0 │ └────────┴────────┴────────┴────────┴────────┘
Subset of columns:
>>> encoder = OneHotEncoder(subset=["A"]) >>> encoder.fit(X) OneHotEncoder(...) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 3) ┌────────┬────────┬────────┐ │ A__foo │ A__bar │ A__baz │ │ f64 │ f64 │ f64 │ ╞════════╪════════╪════════╡ │ 1.0 │ 0.0 │ 0.0 │ │ 0.0 │ 1.0 │ 0.0 │ │ 1.0 │ 0.0 │ 0.0 │ │ 0.0 │ 1.0 │ 0.0 │ │ 0.0 │ 0.0 │ 1.0 │ └────────┴────────┴────────┘
- class gators.encoders.OrdinalEncoder[source]#
Bases:
_BaseEncoderEncodes categorical values as ordinal.
- Parameters:
subset (Optional[List[str]], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (Union[int, float], default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__ordinal_enc’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
Basic usage:
>>> from gators.encoders import OrdinalEncoder >>> import polars as pl >>> X = pl.DataFrame({ ... "A": ["foo", "bar", "foo", "bar", "baz"], ... "B": [True, False, True, True, False], ... }) >>> encoder = OrdinalEncoder(inplace=False) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 2) ┌───────────────┬───────────────┐ │ A__ordinal_enc│ B__ordinal_enc│ │ f64 │ f64 │ ╞═══════════════╪═══════════════╡ │ 3.0 │ 2.0 │ │ 2.0 │ 1.0 │ │ 3.0 │ 2.0 │ │ 2.0 │ 2.0 │ │ 1.0 │ 1.0 │ └───────────────┴───────────────┘
Drop columns:
>>> encoder = OrdinalEncoder(drop_columns=False, inplace=False) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 4) ┌──────────────┬──────────────┬──────────────┬──────────────┐ │ A │ B │A__ordinal_enc│B__ordinal_enc│ │ str │ bool │f64 │ f64 │ ╞══════════════╪══════════════╪══════════════╪══════════════╡ │ foo │ true │3.0 │ 2.0 │ │ bar │ false │2.0 │ 1.0 │ │ foo │ true │3.0 │ 2.0 │ │ bar │ true │2.0 │ 2.0 │ │ baz │ false │1.0 │ 1.0 │ └──────────────┴──────────────┴──────────────┴──────────────┘
Subset of columns:
>>> encoder = OrdinalEncoder(subset=["A"], inplace=False) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 1) ┌───────────────┐ │ A__ordinal_enc│ │ f64 │ ╞═══════════════╡ │ 3.0 │ │ 2.0 │ │ 3.0 │ │ 2.0 │ │ 1.0 │ └───────────────┘
- fit(X, y=None)[source]#
Fit the transformer by computing ordinal mappings based on category frequency.
- Parameters:
X (
DataFrame) – Input DataFrame with categorical columns.y (
Series|None) – Target series (not used, present for sklearn compatibility).
- Returns:
The fitted transformer instance.
- Return type:
- class gators.encoders.RareCategoryEncoder[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinEncodes rare categories.
- Parameters:
subset (Optional[List[str]], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
default (str, default="RARE") – Value to replace rare categories with.
min_count (Union[PositiveInt, PositiveFloat], default=2) – Minimum count threshold for categories. Categories below this threshold are replaced with default. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__encode_rare’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
>>> import polars as pl >>> from gators.encoders import RareCategoryEncoder
>>> # Sample data >>> X =pl.DataFrame({ ... 'A': ['cat', 'dog', 'cat', 'dog', 'cat'], ... 'B': ['x', 'x', 'y', 'y', 'x'], ... 'target': [1, 0, 1, 1, 0] ... })
>>> encoder = RareCategoryEncoder(inplace=False) >>> encoder.fit(X) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 2) ┌───────────────────┬───────────────────┐ │ A__encode_rare │ B__encode_rare │ │ --- │ --- │ │ str │ str │ ├───────────────────┼───────────────────┤ │ cat │ x │ │ dog │ x │ │ cat │ RARE │ │ dog │ RARE │ │ cat │ x │ └───────────────────┴───────────────────┘
>>> encoder = RareCategoryEncoder(drop_columns=False, inplace=False) >>> encoder.fit(X) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 5) ┌─────┬─────┬────────┬───────────────────┬───────────────────┐ │ A │ B │ target │ A__encode_rare │ B__encode_rare │ │ --- │ --- │ --- │ --- │ --- │ │ str │ str │ i64 │ str │ str │ ├─────┼─────┼────────┼───────────────────┼───────────────────┤ │ cat │ x │ 1 │ cat │ x │ │ dog │ x │ 0 │ dog │ x │ │ cat │ y │ 1 │ cat │ RARE │ │ dog │ y │ 1 │ dog │ RARE │ │ cat │ x │ 0 │ cat │ x │ └─────┴─────┴────────┴───────────────────┴───────────────────┘
>>> encoder = RareCategoryEncoder(subset=['A'], inplace=False) >>> encoder.fit(X) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 4) ┌─────┬─────┬────────┬───────────────────┐ │ A │ B │ target │ A__encode_rare │ │ --- │ --- │ --- │ --- │ │ str │ str │ i64 │ str │ ├─────┼─────┼────────┼───────────────────┤ │ cat │ x │ 1 │ cat │ │ dog │ x │ 0 │ dog │ │ cat │ y │ 1 │ cat │ │ dog │ y │ 1 │ dog │ │ cat │ x │ 0 │ cat │ └─────┴─────┴────────┴───────────────────┘
- class gators.encoders.TargetEncoder[source]#
Bases:
_BaseEncoderTarget-based encoded categorical values.
- Parameters:
subset (Optional[List[str]], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (Union[int, float], default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__target_enc’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
Basic usage:
>>> from gators.encoders import TargetEncoder >>> import polars as pl >>> X = pl.DataFrame({ ... "A": ["foo", "bar", "foo", "bar", "baz"], ... "B": [True, False, True, True, False], ... }) >>> target = pl.Series("target", [1, 0, 1, 1, 0]) >>> encoder = TargetEncoder(inplace=False, drop_columns=True) >>> encoder.fit(X, target) TargetEncoder(...) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 2) ┌───────────────┬───────────────┐ │ B__target_enc ┆ A__target_enc │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞═══════════════╪═══════════════╡ │ 1.0 ┆ 1.0 │ │ 0.0 ┆ 0.5 │ │ 1.0 ┆ 1.0 │ │ 1.0 ┆ 0.5 │ │ 0.0 ┆ 0.0 │ └───────────────┴───────────────┘
Drop columns:
>>> encoder = TargetEncoder(drop_columns=False, inplace=False) >>> encoder.fit(X, target) TargetEncoder(...) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 4) ┌─────────────┬─────────────┬───────────────┬───────────────┐ │ A │ B │ A__target_enc │ B__target_enc │ │ str │ bool │ f64 │ f64 │ ╞═════════════╪═════════════╪═══════════════╪═══════════════╡ │ foo │ true │ 1.0 │ 1.0 │ │ bar │ false │ 1.0 │ 0.0 │ │ foo │ true │ 1.0 │ 1.0 │ │ bar │ true │ 1.0 │ 1.0 │ │ baz │ false │ 0.0 │ 0.0 │ └─────────────┴─────────────┴───────────────┴─────────────┘
Subset of columns:
>>> encoder = TargetEncoder(subset=["A"], inplace=False, drop_columns=True) >>> encoder.fit(X, target) TargetEncoder(...) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 1) ┌───────────────┐ │ A__target_enc │ │ f64 │ ╞═══════════════╡ │ 1.0 │ │ 1.0 │ │ 1.0 │ │ 1.0 │ │ 0.0 │ └───────────────┘
- class gators.encoders.WOEEncoder[source]#
Bases:
_BaseEncoderWeight of Evidence (WOE) encodes categorical variables.
- Parameters:
subset (Optional[List[str]], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
regularization (Optional[float], default=0.01) – Regularization term (0.0-1.0) to prevent division by zero in WOE calculation.
default (float, default=0.0) – Default WOE value for categories with insufficient counts or unseen categories.
min_count (Union[PositiveInt, PositiveFloat], default=1) – Minimum count threshold for categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__encode_woe’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
>>> import polars as pl >>> from gators.encoders import WOEEncoder
>>> # Sample data >>> X = pl.DataFrame({ ... 'A': ['cat', 'dog', 'cat', 'dog', 'cat'], ... 'B': ['x', 'x', 'y', 'y', 'x'] ... }) >>> y = pl.Series('target', [1, 0, 1, 1, 0])
>>> encoder = WOEEncoder(inplace=False, drop_columns=True) >>> _ = encoder.fit(X, y) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 2) ┌────────────────┬────────────────┐ │ A__encode_woe │ B__encode_woe │ │ --- │ --- │ │ f64 │ f64 │ ├────────────────┼────────────────┤ │ 0.287682 │ -1.090344 │ │ -0.402159 │ -1.090344 │ │ 0.287682 │ 4.901146 │ │ -0.402159 │ 4.901146 │ │ 0.287682 │ -1.090344 │ └────────────────┴────────────────┘
>>> # Encoding with drop_columns=False >>> encoder = WOEEncoder(inplace=False, inplace=False, drop_columns=False) >>> encoder.fit(X, y) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 4) ┌─────┬─────┬────────────────┬────────────────┐ │ A │ B │ A__encode_woe │ B__encode_woe │ │ --- │ --- │ --- │ --- │ │ str │ str │ f64 │ f64 │ ├─────┼─────┼────────────────┼────────────────┤ │ cat │ x │ 0.287682 │ 0.287682 │ │ dog │ x │ -1.203973 │ 0.287682 │ │ cat │ y │ 0.287682 │ -1.203973 │ │ dog │ y │ -1.203973 │ -1.203973 │ │ cat │ x │ 0.287682 │ 0.287682 │ └─────┴─────┼────────────────┼────────────────┘
>>> # Encoding with columns as a subset >>> encoder = WOEEncoder(subset=['A'], inplace=False, drop_columns=False) >>> encoder.fit(X, y) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 3) ┌─────┬───────┬───────────────┬───────────────┐ │ A ┆ B ┆ B__target_enc ┆ A__target_enc │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ bool ┆ f64 ┆ f64 │ ╞═════╪═══════╪═══════════════╪═══════════════╡ │ foo ┆ true ┆ 1.0 ┆ 1.0 │ │ bar ┆ false ┆ 0.0 ┆ 0.5 │ │ foo ┆ true ┆ 1.0 ┆ 1.0 │ │ bar ┆ true ┆ 1.0 ┆ 0.5 │ │ baz ┆ false ┆ 0.0 ┆ 0.0 │ └─────┴───────┴───────────────┴───────────────┘