gators.encoders package#
Module contents#
- class gators.encoders.BinaryEncoder[source]#
Bases:
gators.encoders._base_encoder._BaseEncoderEncodes categorical values using binary representation.
Each category is first encoded as an integer, then converted to binary, with each binary digit becoming a separate column. This is more compact than one-hot encoding for high cardinality features.
- Parameters:
subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (int | float, default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__binary_enc_{bit_index}’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
Initialize and use BinaryEncoder.
Example with drop_columns=True and columns=None:
>>> import polars as pl >>> from gators.encoders import BinaryEncoder >>> X = pl.DataFrame({ ... "category": ["A", "B", "C", "D", "A", "B"], ... "value": [1, 2, 3, 4, 5, 6] ... }) >>> encoder = BinaryEncoder(min_count=1, inplace=False, drop_columns=True) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (6, 3) ┌───────┬────────────────────────┬────────────────────────┐ │ value ┆ category__binary_enc_0 ┆ category__binary_enc_1 │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ f64 │ ╞═══════╪════════════════════════╪════════════════════════╡ │ 1 ┆ 1.0 ┆ 1.0 │ │ 2 ┆ 1.0 ┆ 0.0 │ │ 3 ┆ 0.0 ┆ 0.0 │ │ 4 ┆ 0.0 ┆ 1.0 │ │ 5 ┆ 1.0 ┆ 1.0 │ │ 6 ┆ 1.0 ┆ 0.0 │ └───────┴────────────────────────┴────────────────────────┘
Example with drop_columns=False:
>>> X = pl.DataFrame({ ... "category": ["A", "B", "C", "D", "A", "B"], ... "value": [1, 2, 3, 4, 5, 6] ... }) >>> encoder = BinaryEncoder(subset=["category"], inplace=False, drop_columns=False) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (6, 4) ┌──────────┬───────┬────────────────────────┬────────────────────────┐ │ category ┆ value ┆ category__binary_enc_0 ┆ category__binary_enc_1 │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ f64 ┆ f64 │ ╞══════════╪═══════╪════════════════════════╪════════════════════════╡ │ A ┆ 1 ┆ 0.0 ┆ 0.0 │ │ B ┆ 2 ┆ 1.0 ┆ 0.0 │ │ C ┆ 3 ┆ 0.0 ┆ 1.0 │ │ D ┆ 4 ┆ 1.0 ┆ 1.0 │ │ A ┆ 5 ┆ 0.0 ┆ 0.0 │ │ B ┆ 6 ┆ 1.0 ┆ 0.0 │ └──────────┴───────┴────────────────────────┴────────────────────────┘
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.encoders.binary_encoder.BinaryEncoder[source]#
Fit the transformer by computing binary encoding mappings.
- Parameters:
X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).
- Returns:
The fitted transformer instance.
- Return type:
- transform(X: polars.DataFrame) polars.DataFrame[source]#
Transform the input DataFrame by applying binary encoding to categorical columns.
- Parameters:
X (pl.DataFrame) – Input DataFrame with categorical columns.
- Returns:
DataFrame with binary encoded columns (each bit as a separate column).
- Return type:
pl.DataFrame
- class gators.encoders.CatBoostEncoder[source]#
Bases:
gators.encoders._base_encoder._BaseEncoderEncodes categorical values using CatBoost target encoding with ordered statistics.
This encoder implements the CatBoost algorithm’s approach to target encoding, which uses ordered target statistics to prevent target leakage and overfitting. For each category, it calculates the cumulative mean of the target up to (but not including) the current row.
- Parameters:
subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (int | float, default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
smoothing (float, default=1.0) – Smoothing parameter for regularization toward the global mean. Higher values increase regularization.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__encode_catboost’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
Initialize and use CatBoostEncoder.
>>> import polars as pl >>> from gators.encoders import CatBoostEncoder >>> X = pl.DataFrame({ ... "category": ["A", "B", "A", "C", "A", "B", "C"], ... "value": [1, 2, 3, 4, 5, 6, 7] ... }) >>> y = pl.Series("target", [1, 0, 1, 0, 0, 1, 1]) >>> encoder = CatBoostEncoder(subset=["category"], smoothing=1.0, inplace=False, drop_columns=True) >>> _ = encoder.fit(X, y) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (7, 3) ┌───────┬────────────────────────────┬───────┐ │ target┆ category__encode_catboost │ value │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ i64 │ ╞═══════╪════════════════════════════╪═══════╡ │ 1 ┆ 0.571429 ┆ 1 │ │ 0 ┆ 0.571429 ┆ 2 │ │ 1 ┆ 0.666667 ┆ 3 │ │ 0 ┆ 0.571429 ┆ 4 │ │ 0 ┆ 0.600000 ┆ 5 │ │ 1 ┆ 0.428571 ┆ 6 │ │ 1 ┆ 0.428571 ┆ 7 │ └───────┴────────────────────────────┴───────┘
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.encoders.catboost_encoder.CatBoostEncoder[source]#
Fit the transformer by computing CatBoost ordered target statistics.
- Parameters:
X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series, default=None) – Target series (binary or continuous). Required for CatBoostEncoder.
- Returns:
The fitted transformer instance.
- Return type:
- Raises:
ValueError – If y is None.
- class gators.encoders.CountEncoder[source]#
Bases:
gators.encoders._base_encoder._BaseEncoderEncodes categorical values with their occurrence counts.
- Parameters:
subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (int | float, default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__count_enc’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
Initialize and use CountEncoder.
Example with drop_columns=True and columns=None:
>>> import polars as pl >>> from gators.encoders import CountEncoder >>> X = pl.DataFrame({ ... "category": ["A", "B", "A", "C", "C", "A", "B"], ... "value": [1, 2, 3, 4, 5, 6, 7], ... "other": ["foo", "bar", "baz", "qux", "quux", "corge", "grault"] ... }) >>> encoder = CountEncoder(min_count=1, inplace=False) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (7, 3) ┌───────┬─────────────────────┬──────────────────┐ │ value ┆ category__count_enc ┆ other__count_enc │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ f64 │ ╞═══════╪═════════════════════╪══════════════════╡ │ 1 ┆ 3.0 ┆ 1.0 │ │ 2 ┆ 2.0 ┆ 1.0 │ │ 3 ┆ 3.0 ┆ 1.0 │ │ 4 ┆ 2.0 ┆ 1.0 │ │ 5 ┆ 2.0 ┆ 1.0 │ │ 6 ┆ 3.0 ┆ 1.0 │ │ 7 ┆ 2.0 ┆ 1.0 │ └───────┴─────────────────────┴──────────────────┘
Example with drop_columns=True and columns as a subset:
>>> X = pl.DataFrame({ ... "category": ["A", "B", "A", "C", "C", "A", "B"], ... "value": [1, 2, 3, 4, 5, 6, 7], ... "other": ["foo", "bar", "baz", "qux", "quux", "corge", "grault"] ... }) >>> encoder = CountEncoder(subset=["category"], min_count=1, drop_columns=True, inplace=False) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (7, 3) ┌───────┬────────┬────────────────────────┐ │ value ┆ other ┆ category__encode_count │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 │ ╞═══════╪════════╪════════════════════════╡ │ 1 ┆ foo ┆ 3.0 │ │ 2 ┆ bar ┆ 2.0 │ │ 3 ┆ baz ┆ 3.0 │ │ 4 ┆ qux ┆ 2.0 │ │ 5 ┆ quux ┆ 2.0 │ │ 6 ┆ corge ┆ 3.0 │ │ 7 ┆ grault ┆ 2.0 │ └───────┴────────┴────────────────────────┘
Example with drop_columns=False and columns=None:
>>> import polars as pl >>> from gators.encoders import CountEncoder >>> X = pl.DataFrame({ ... "category": ["A", "B", "A", "C", "C", "A", "B"], ... "value": [1, 2, 3, 4, 5, 6, 7], ... "other": ["foo", "bar", "baz", "qux", "quux", "corge", "grault"] ... }) >>> encoder = CountEncoder(min_count=1, drop_columns=False, inplace=False) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (7, 5) ┌──────────┬───────┬────────┬────────────────────────┬─────────────────────┐ │ category ┆ value ┆ other ┆ category__encode_count ┆ other__encode_count │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ str ┆ f64 ┆ f64 │ ╞══════════╪═══════╪════════╪════════════════════════╪═════════════════════╡ │ A ┆ 1 ┆ foo ┆ 3.0 ┆ 1.0 │ │ B ┆ 2 ┆ bar ┆ 2.0 ┆ 1.0 │ │ A ┆ 3 ┆ baz ┆ 3.0 ┆ 1.0 │ │ C ┆ 4 ┆ qux ┆ 2.0 ┆ 1.0 │ │ C ┆ 5 ┆ quux ┆ 2.0 ┆ 1.0 │ │ A ┆ 6 ┆ corge ┆ 3.0 ┆ 1.0 │ │ B ┆ 7 ┆ grault ┆ 2.0 ┆ 1.0 │ └──────────┴───────┴────────┴────────────────────────┴─────────────────────┘
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.encoders.count_encoder.CountEncoder[source]#
Fit the transformer by computing count statistics for each category.
- Parameters:
X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).
- Returns:
The fitted transformer instance.
- Return type:
- class gators.encoders.HashEncoder[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerEncode categorical features via feature hashing (hashing trick).
Maps each category value to an integer bucket in
[0, n_features)using a deterministic hash function. Because no vocabulary is learnt duringfit(), the encoder handles unknown categories at inference time naturally — any unseen value simply hashes to some bucket without raising an error.Each input column produces one output column containing integers in
[0, n_features). The output column name is{col}__hash.- Parameters:
n_features (int, default=16) – Number of hash buckets. Controls the trade-off between collision rate (lower → more collisions) and the number of distinct encodings (higher → fewer collisions but larger range). Must be ≥ 2.
subset (list[str] or None, default=None) – Categorical (String, Categorical, Enum, Boolean) columns to encode. If
None, all such columns are selected automatically.inplace (bool, default=True) – If
Truethe original columns are overwritten with hash values (cast toFloat64for consistency with other encoders). IfFalsenew columns suffixed__hashare added alongside the originals (subject todrop_columns).drop_columns (bool, default=True) – When
inplace=False, whether to drop the original columns after adding the hashed columns. Ignored wheninplace=True.
Notes
The hash is computed with
polars.Expr.hash()usingseed=0for full determinism across processes (unlike Python’s built-inhash()which is randomised byPYTHONHASHSEED). Boolean columns are cast toStringbefore hashing so thatTrue/Falsemap to stable buckets.Examples
>>> import polars as pl >>> from gators.encoders import HashEncoder
>>> X = pl.DataFrame({ ... "color": ["red", "blue", "green", "red", "blue"], ... "size": ["S", "M", "L", "XL", "S"], ... "weight": [1.0, 2.0, 3.0, 4.0, 5.0], ... }) >>> encoder = HashEncoder(n_features=8, inplace=False) >>> encoder.fit(X) >>> X_enc = encoder.transform(X)
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.encoders.hash_encoder.HashEncoder[source]#
Detect the subset of categorical columns (no statistics are learnt).
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Not used; present for sklearn compatibility.
- Returns:
The fitted transformer instance.
- Return type:
- class gators.encoders.LeaveOneOutEncoder[source]#
Bases:
gators.encoders._base_encoder._BaseEncoderEncodes categorical values using leave-one-out target encoding.
For each row, this encoder calculates the mean of the target variable for the category, excluding the current row. This reduces overfitting compared to standard target encoding by preventing the target value from influencing its own encoding.
- Parameters:
subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (int | float, default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
smoothing (float, default=0.0) – Smoothing parameter for regularization toward the global mean. Higher values increase regularization. Use 0 for no smoothing.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__encode_loo’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
Initialize and use LeaveOneOutEncoder.
>>> import polars as pl >>> from gators.encoders import LeaveOneOutEncoder >>> X = pl.DataFrame({ ... "category": ["A", "B", "A", "C", "A", "B", "C"], ... "target": [1, 0, 1, 0, 0, 1, 1], ... "value": [1, 2, 3, 4, 5, 6, 7] ... }) >>> encoder = LeaveOneOutEncoder(subset=["category"], smoothing=1.0) >>> _ = encoder.fit(X, y=X["target"]) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (7, 3) ┌───────┬──────────────────────────┬───────┐ │ target┆ category__encode_loo ┆ value │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ i64 │ ╞═══════╪══════════════════════════╪═══════╡ │ 1 ┆ 0.571429 ┆ 1 │ │ 0 ┆ 0.571429 ┆ 2 │ │ 1 ┆ 0.571429 ┆ 3 │ │ 0 ┆ 0.571429 ┆ 4 │ │ 0 ┆ 0.666667 ┆ 5 │ │ 1 ┆ 0.571429 ┆ 6 │ │ 1 ┆ 0.571429 ┆ 7 │ └───────┴──────────────────────────┴───────┘
Example with no smoothing:
>>> X = pl.DataFrame({ ... "category": ["A", "A", "A", "B", "B"], ... "target": [1, 0, 1, 0, 1], ... "value": [1, 2, 3, 4, 5] ... }) >>> encoder = LeaveOneOutEncoder(subset=["category"], smoothing=0.0) >>> _ = encoder.fit(X, y="target") >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 3) ┌───────┬─────────────────────────┬───────┐ │ target┆ category__encode_loo ┆ value │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ i64 │ ╞═══════╪═════════════════════════╪═══════╡ │ 1 ┆ 0.666667 ┆ 1 │ │ 0 ┆ 0.666667 ┆ 2 │ │ 1 ┆ 0.500000 ┆ 3 │ │ 0 ┆ 0.500000 ┆ 4 │ │ 1 ┆ 0.500000 ┆ 5 │ └───────┴─────────────────────────┴───────┘
- fit(X: polars.DataFrame, y: polars.Series) gators.encoders.leave_one_out_encoder.LeaveOneOutEncoder[source]#
Fit the transformer by computing leave-one-out target statistics.
- Parameters:
X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series) – Target series (binary or continuous). Required for LeaveOneOutEncoder.
- Returns:
The fitted transformer instance.
- Return type:
- Raises:
ValueError – If y is None.
- class gators.encoders.OneHotEncoder[source]#
Bases:
gators.transformer._base_transformer._BaseTransformer- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.encoders.onehot_encoder.OneHotEncoder[source]#
Fit the transformer by identifying categories for one-hot encoding.
- Parameters:
X (pl.DataFrame) – Input DataFrame with string columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).
- Returns:
The fitted transformer instance.
- Return type:
- transform(X: polars.DataFrame) polars.DataFrame[source]#
Transform the input DataFrame by applying one-hot encoding to categorical columns.
- Parameters:
X (pl.DataFrame) – Input DataFrame with string columns.
- Returns:
DataFrame with one-hot encoded columns (one binary column per category).
- Return type:
pl.DataFrame
- class gators.encoders.OrdinalEncoder[source]#
Bases:
gators.encoders._base_encoder._BaseEncoderEncodes categorical values as ordinal.
- Parameters:
subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (int | float, default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__ordinal_enc’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
Basic usage:
>>> from gators.encoders import OrdinalEncoder >>> import polars as pl >>> X = pl.DataFrame({ ... "A": ["foo", "bar", "foo", "bar", "baz"], ... "B": [True, False, True, True, False], ... }) >>> encoder = OrdinalEncoder(inplace=False) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 2) ┌───────────────┬───────────────┐ │ A__ordinal_enc│ B__ordinal_enc│ │ f64 │ f64 │ ╞═══════════════╪═══════════════╡ │ 3.0 │ 2.0 │ │ 2.0 │ 1.0 │ │ 3.0 │ 2.0 │ │ 2.0 │ 2.0 │ │ 1.0 │ 1.0 │ └───────────────┴───────────────┘
Drop columns:
>>> encoder = OrdinalEncoder(drop_columns=False, inplace=False) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 4) ┌──────────────┬──────────────┬──────────────┬──────────────┐ │ A │ B │A__ordinal_enc│B__ordinal_enc│ │ str │ bool │f64 │ f64 │ ╞══════════════╪══════════════╪══════════════╪══════════════╡ │ foo │ true │3.0 │ 2.0 │ │ bar │ false │2.0 │ 1.0 │ │ foo │ true │3.0 │ 2.0 │ │ bar │ true │2.0 │ 2.0 │ │ baz │ false │1.0 │ 1.0 │ └──────────────┴──────────────┴──────────────┴──────────────┘
Subset of columns:
>>> encoder = OrdinalEncoder(subset=["A"], inplace=False) >>> _ = encoder.fit(X) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 1) ┌───────────────┐ │ A__ordinal_enc│ │ f64 │ ╞═══════════════╡ │ 3.0 │ │ 2.0 │ │ 3.0 │ │ 2.0 │ │ 1.0 │ └───────────────┘
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.encoders.ordinal_encoder.OrdinalEncoder[source]#
Fit the transformer by computing ordinal mappings based on category frequency.
- Parameters:
X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).
- Returns:
The fitted transformer instance.
- Return type:
- class gators.encoders.RareCategoryEncoder[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerEncodes rare categories.
- Parameters:
subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
default (str, default="RARE") – Value to replace rare categories with.
min_count (PositiveInt | PositiveFloat, default=2) – Minimum count threshold for categories. Categories below this threshold are replaced with default. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__encode_rare’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
>>> import polars as pl >>> from gators.encoders import RareCategoryEncoder
>>> # Sample data >>> X =pl.DataFrame({ ... 'A': ['cat', 'dog', 'cat', 'dog', 'cat'], ... 'B': ['x', 'x', 'y', 'y', 'x'], ... 'target': [1, 0, 1, 1, 0] ... })
>>> encoder = RareCategoryEncoder(inplace=False) >>> encoder.fit(X) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 2) ┌───────────────────┬───────────────────┐ │ A__encode_rare │ B__encode_rare │ │ --- │ --- │ │ str │ str │ ├───────────────────┼───────────────────┤ │ cat │ x │ │ dog │ x │ │ cat │ RARE │ │ dog │ RARE │ │ cat │ x │ └───────────────────┴───────────────────┘
>>> encoder = RareCategoryEncoder(drop_columns=False, inplace=False) >>> encoder.fit(X) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 5) ┌─────┬─────┬────────┬───────────────────┬───────────────────┐ │ A │ B │ target │ A__encode_rare │ B__encode_rare │ │ --- │ --- │ --- │ --- │ --- │ │ str │ str │ i64 │ str │ str │ ├─────┼─────┼────────┼───────────────────┼───────────────────┤ │ cat │ x │ 1 │ cat │ x │ │ dog │ x │ 0 │ dog │ x │ │ cat │ y │ 1 │ cat │ RARE │ │ dog │ y │ 1 │ dog │ RARE │ │ cat │ x │ 0 │ cat │ x │ └─────┴─────┴────────┴───────────────────┴───────────────────┘
>>> encoder = RareCategoryEncoder(subset=['A'], inplace=False) >>> encoder.fit(X) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 4) ┌─────┬─────┬────────┬───────────────────┐ │ A │ B │ target │ A__encode_rare │ │ --- │ --- │ --- │ --- │ │ str │ str │ i64 │ str │ ├─────┼─────┼────────┼───────────────────┤ │ cat │ x │ 1 │ cat │ │ dog │ x │ 0 │ dog │ │ cat │ y │ 1 │ cat │ │ dog │ y │ 1 │ dog │ │ cat │ x │ 0 │ cat │ └─────┴─────┴────────┴───────────────────┘
- fit(X: polars.DataFrame, y: polars.Series | None = None)[source]#
Fit the transformer by identifying rare categories.
- Parameters:
X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).
- Returns:
The fitted transformer instance.
- Return type:
- class gators.encoders.TargetEncoder[source]#
Bases:
gators.encoders._base_encoder._BaseEncoderTarget-based encoded categorical values.
- Parameters:
subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (int | float, default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__target_enc’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
Basic usage:
>>> from gators.encoders import TargetEncoder >>> import polars as pl >>> X = pl.DataFrame({ ... "A": ["foo", "bar", "foo", "bar", "baz"], ... "B": [True, False, True, True, False], ... }) >>> target = pl.Series("target", [1, 0, 1, 1, 0]) >>> encoder = TargetEncoder(inplace=False, drop_columns=True) >>> encoder.fit(X, target) TargetEncoder(...) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 2) ┌───────────────┬───────────────┐ │ B__target_enc ┆ A__target_enc │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞═══════════════╪═══════════════╡ │ 1.0 ┆ 1.0 │ │ 0.0 ┆ 0.5 │ │ 1.0 ┆ 1.0 │ │ 1.0 ┆ 0.5 │ │ 0.0 ┆ 0.0 │ └───────────────┴───────────────┘
Drop columns:
>>> encoder = TargetEncoder(drop_columns=False, inplace=False) >>> encoder.fit(X, target) TargetEncoder(...) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 4) ┌─────────────┬─────────────┬───────────────┬───────────────┐ │ A │ B │ A__target_enc │ B__target_enc │ │ str │ bool │ f64 │ f64 │ ╞═════════════╪═════════════╪═══════════════╪═══════════════╡ │ foo │ true │ 1.0 │ 1.0 │ │ bar │ false │ 1.0 │ 0.0 │ │ foo │ true │ 1.0 │ 1.0 │ │ bar │ true │ 1.0 │ 1.0 │ │ baz │ false │ 0.0 │ 0.0 │ └─────────────┴─────────────┴───────────────┴─────────────┘
Subset of columns:
>>> encoder = TargetEncoder(subset=["A"], inplace=False, drop_columns=True) >>> encoder.fit(X, target) TargetEncoder(...) >>> transformed_X = encoder.transform(X) >>> print(transformed_X) shape: (5, 1) ┌───────────────┐ │ A__target_enc │ │ f64 │ ╞═══════════════╡ │ 1.0 │ │ 1.0 │ │ 1.0 │ │ 1.0 │ │ 0.0 │ └───────────────┘
- fit(X: polars.DataFrame, y: polars.Series) gators.encoders.target_encoder.TargetEncoder[source]#
Fit the transformer by computing target mean for each category.
- Parameters:
X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series) – Target series (binary or continuous).
- Returns:
The fitted transformer instance.
- Return type:
- class gators.encoders.WOEEncoder[source]#
Bases:
gators.encoders._base_encoder._BaseEncoderWeight of Evidence (WOE) encodes categorical variables.
- Parameters:
subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
regularization (float, default=0.01) – Regularization term (0.0-1.0) to prevent division by zero in WOE calculation.
default (float, default=0.0) – Default WOE value for categories with insufficient counts or unseen categories.
min_count (PositiveInt | PositiveFloat, default=1) – Minimum count threshold for categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__encode_woe’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.
Examples
>>> import polars as pl >>> from gators.encoders import WOEEncoder
>>> # Sample data >>> X = pl.DataFrame({ ... 'A': ['cat', 'dog', 'cat', 'dog', 'cat'], ... 'B': ['x', 'x', 'y', 'y', 'x'] ... }) >>> y = pl.Series('target', [1, 0, 1, 1, 0])
>>> encoder = WOEEncoder(inplace=False, drop_columns=True) >>> _ = encoder.fit(X, y) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 2) ┌────────────────┬────────────────┐ │ A__encode_woe │ B__encode_woe │ │ --- │ --- │ │ f64 │ f64 │ ├────────────────┼────────────────┤ │ 0.287682 │ -1.090344 │ │ -0.402159 │ -1.090344 │ │ 0.287682 │ 4.901146 │ │ -0.402159 │ 4.901146 │ │ 0.287682 │ -1.090344 │ └────────────────┴────────────────┘
>>> # Encoding with drop_columns=False >>> encoder = WOEEncoder(inplace=False, inplace=False, drop_columns=False) >>> encoder.fit(X, y) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 4) ┌─────┬─────┬────────────────┬────────────────┐ │ A │ B │ A__encode_woe │ B__encode_woe │ │ --- │ --- │ --- │ --- │ │ str │ str │ f64 │ f64 │ ├─────┼─────┼────────────────┼────────────────┤ │ cat │ x │ 0.287682 │ 0.287682 │ │ dog │ x │ -1.203973 │ 0.287682 │ │ cat │ y │ 0.287682 │ -1.203973 │ │ dog │ y │ -1.203973 │ -1.203973 │ │ cat │ x │ 0.287682 │ 0.287682 │ └─────┴─────┼────────────────┼────────────────┘
>>> # Encoding with columns as a subset >>> encoder = WOEEncoder(subset=['A'], inplace=False, drop_columns=False) >>> encoder.fit(X, y) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 3) ┌─────┬───────┬───────────────┬───────────────┐ │ A ┆ B ┆ B__target_enc ┆ A__target_enc │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ bool ┆ f64 ┆ f64 │ ╞═════╪═══════╪═══════════════╪═══════════════╡ │ foo ┆ true ┆ 1.0 ┆ 1.0 │ │ bar ┆ false ┆ 0.0 ┆ 0.5 │ │ foo ┆ true ┆ 1.0 ┆ 1.0 │ │ bar ┆ true ┆ 1.0 ┆ 0.5 │ │ baz ┆ false ┆ 0.0 ┆ 0.0 │ └─────┴───────┴───────────────┴───────────────┘
- fit(X: polars.DataFrame, y: polars.Series) gators.encoders.woe_encoder.WOEEncoder[source]#
Fit the transformer by computing Weight of Evidence values for each category.
- Parameters:
X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series) – Binary target series (must contain 0s and 1s).
- Returns:
The fitted transformer instance.
- Return type: