gators.encoders package#

Module contents#

class gators.encoders.BinaryEncoder[source]#

Bases: gators.encoders._base_encoder._BaseEncoder

Encodes categorical values using binary representation.

Each category is first encoded as an integer, then converted to binary, with each binary digit becoming a separate column. This is more compact than one-hot encoding for high cardinality features.

Parameters:

subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (int | float, default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__binary_enc_{bit_index}’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.

Examples

Initialize and use BinaryEncoder.

Example with drop_columns=True and columns=None:

>>> import polars as pl
>>> from gators.encoders import BinaryEncoder
>>> X = pl.DataFrame({
...     "category": ["A", "B", "C", "D", "A", "B"],
...     "value": [1, 2, 3, 4, 5, 6]
... })
>>> encoder = BinaryEncoder(min_count=1, inplace=False, drop_columns=True)
>>> _ = encoder.fit(X)
>>> transformed_X = encoder.transform(X)
>>> print(transformed_X)
shape: (6, 3)
┌───────┬────────────────────────┬────────────────────────┐
│ value ┆ category__binary_enc_0 ┆ category__binary_enc_1 │
│ ---   ┆ ---                    ┆ ---                    │
│ i64   ┆ f64                    ┆ f64                    │
╞═══════╪════════════════════════╪════════════════════════╡
│ 1     ┆ 1.0                    ┆ 1.0                    │
│ 2     ┆ 1.0                    ┆ 0.0                    │
│ 3     ┆ 0.0                    ┆ 0.0                    │
│ 4     ┆ 0.0                    ┆ 1.0                    │
│ 5     ┆ 1.0                    ┆ 1.0                    │
│ 6     ┆ 1.0                    ┆ 0.0                    │
└───────┴────────────────────────┴────────────────────────┘

Example with drop_columns=False:

>>> X = pl.DataFrame({
...     "category": ["A", "B", "C", "D", "A", "B"],
...     "value": [1, 2, 3, 4, 5, 6]
... })
>>> encoder = BinaryEncoder(subset=["category"], inplace=False, drop_columns=False)
>>> _ = encoder.fit(X)
>>> transformed_X = encoder.transform(X)
>>> print(transformed_X)
shape: (6, 4)
┌──────────┬───────┬────────────────────────┬────────────────────────┐
│ category ┆ value ┆ category__binary_enc_0 ┆ category__binary_enc_1 │
│ ---      ┆ ---   ┆ ---                    ┆ ---                    │
│ str      ┆ i64   ┆ f64                    ┆ f64                    │
╞══════════╪═══════╪════════════════════════╪════════════════════════╡
│ A        ┆ 1     ┆ 0.0                    ┆ 0.0                    │
│ B        ┆ 2     ┆ 1.0                    ┆ 0.0                    │
│ C        ┆ 3     ┆ 0.0                    ┆ 1.0                    │
│ D        ┆ 4     ┆ 1.0                    ┆ 1.0                    │
│ A        ┆ 5     ┆ 0.0                    ┆ 0.0                    │
│ B        ┆ 6     ┆ 1.0                    ┆ 0.0                    │
└──────────┴───────┴────────────────────────┴────────────────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.encoders.binary_encoder.BinaryEncoder[source]#

Fit the transformer by computing binary encoding mappings.

Parameters:

X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

BinaryEncoder

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by applying binary encoding to categorical columns.

Parameters:: X (pl.DataFrame) – Input DataFrame with categorical columns.
Returns:: DataFrame with binary encoded columns (each bit as a separate column).
Return type:: pl.DataFrame

class gators.encoders.CatBoostEncoder[source]#

Bases: gators.encoders._base_encoder._BaseEncoder

Encodes categorical values using CatBoost target encoding with ordered statistics.

This encoder implements the CatBoost algorithm’s approach to target encoding, which uses ordered target statistics to prevent target leakage and overfitting. For each category, it calculates the cumulative mean of the target up to (but not including) the current row.

Parameters:

subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (int | float, default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
smoothing (float, default=1.0) – Smoothing parameter for regularization toward the global mean. Higher values increase regularization.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__encode_catboost’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.

Examples

Initialize and use CatBoostEncoder.

>>> import polars as pl
>>> from gators.encoders import CatBoostEncoder
>>> X = pl.DataFrame({
...     "category": ["A", "B", "A", "C", "A", "B", "C"],
...     "value": [1, 2, 3, 4, 5, 6, 7]
... })
>>> y = pl.Series("target", [1, 0, 1, 0, 0, 1, 1])
>>> encoder = CatBoostEncoder(subset=["category"], smoothing=1.0, inplace=False, drop_columns=True)
>>> _ = encoder.fit(X, y)
>>> transformed_X = encoder.transform(X)
>>> print(transformed_X)
shape: (7, 3)
┌───────┬────────────────────────────┬───────┐
│ target┆ category__encode_catboost  │ value │
│ ---   ┆ ---                        ┆ ---   │
│ i64   ┆ f64                        ┆ i64   │
╞═══════╪════════════════════════════╪═══════╡
│ 1     ┆ 0.571429                   ┆ 1     │
│ 0     ┆ 0.571429                   ┆ 2     │
│ 1     ┆ 0.666667                   ┆ 3     │
│ 0     ┆ 0.571429                   ┆ 4     │
│ 0     ┆ 0.600000                   ┆ 5     │
│ 1     ┆ 0.428571                   ┆ 6     │
│ 1     ┆ 0.428571                   ┆ 7     │
└───────┴────────────────────────────┴───────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.encoders.catboost_encoder.CatBoostEncoder[source]#

Fit the transformer by computing CatBoost ordered target statistics.

Parameters:

X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series, default=None) – Target series (binary or continuous). Required for CatBoostEncoder.

Returns:

The fitted transformer instance.

Return type:

CatBoostEncoder

Raises:

ValueError – If y is None.

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame using CatBoost encoding.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with CatBoost encoded columns.
Return type:: pl.DataFrame

class gators.encoders.CountEncoder[source]#

Bases: gators.encoders._base_encoder._BaseEncoder

Encodes categorical values with their occurrence counts.

Parameters:

subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (int | float, default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__count_enc’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.

Examples

Initialize and use CountEncoder.

Example with drop_columns=True and columns=None:

>>> import polars as pl
>>> from gators.encoders import CountEncoder
>>> X = pl.DataFrame({
...     "category": ["A", "B", "A", "C", "C", "A", "B"],
...     "value": [1, 2, 3, 4, 5, 6, 7],
...     "other": ["foo", "bar", "baz", "qux", "quux", "corge", "grault"]
... })
>>> encoder = CountEncoder(min_count=1, inplace=False)
>>> _ = encoder.fit(X)
>>> transformed_X = encoder.transform(X)
>>> print(transformed_X)
shape: (7, 3)
┌───────┬─────────────────────┬──────────────────┐
│ value ┆ category__count_enc ┆ other__count_enc │
│ ---   ┆ ---                 ┆ ---              │
│ i64   ┆ f64                 ┆ f64              │
╞═══════╪═════════════════════╪══════════════════╡
│ 1     ┆ 3.0                 ┆ 1.0              │
│ 2     ┆ 2.0                 ┆ 1.0              │
│ 3     ┆ 3.0                 ┆ 1.0              │
│ 4     ┆ 2.0                 ┆ 1.0              │
│ 5     ┆ 2.0                 ┆ 1.0              │
│ 6     ┆ 3.0                 ┆ 1.0              │
│ 7     ┆ 2.0                 ┆ 1.0              │
└───────┴─────────────────────┴──────────────────┘

Example with drop_columns=True and columns as a subset:

>>> X = pl.DataFrame({
...     "category": ["A", "B", "A", "C", "C", "A", "B"],
...     "value": [1, 2, 3, 4, 5, 6, 7],
...     "other": ["foo", "bar", "baz", "qux", "quux", "corge", "grault"]
... })
>>> encoder = CountEncoder(subset=["category"], min_count=1, drop_columns=True, inplace=False)
>>> _ = encoder.fit(X)
>>> transformed_X = encoder.transform(X)
>>> print(transformed_X)
shape: (7, 3)
┌───────┬────────┬────────────────────────┐
│ value ┆ other  ┆ category__encode_count │
│ ---   ┆ ---    ┆ ---                    │
│ i64   ┆ str    ┆ f64                    │
╞═══════╪════════╪════════════════════════╡
│ 1     ┆ foo    ┆ 3.0                    │
│ 2     ┆ bar    ┆ 2.0                    │
│ 3     ┆ baz    ┆ 3.0                    │
│ 4     ┆ qux    ┆ 2.0                    │
│ 5     ┆ quux   ┆ 2.0                    │
│ 6     ┆ corge  ┆ 3.0                    │
│ 7     ┆ grault ┆ 2.0                    │
└───────┴────────┴────────────────────────┘

Example with drop_columns=False and columns=None:

>>> import polars as pl
>>> from gators.encoders import CountEncoder
>>> X = pl.DataFrame({
...     "category": ["A", "B", "A", "C", "C", "A", "B"],
...     "value": [1, 2, 3, 4, 5, 6, 7],
...     "other": ["foo", "bar", "baz", "qux", "quux", "corge", "grault"]
... })
>>> encoder = CountEncoder(min_count=1, drop_columns=False, inplace=False)
>>> _ = encoder.fit(X)
>>> transformed_X = encoder.transform(X)
>>> print(transformed_X)
shape: (7, 5)
┌──────────┬───────┬────────┬────────────────────────┬─────────────────────┐
│ category ┆ value ┆ other  ┆ category__encode_count ┆ other__encode_count │
│ ---      ┆ ---   ┆ ---    ┆ ---                    ┆ ---                 │
│ str      ┆ i64   ┆ str    ┆ f64                    ┆ f64                 │
╞══════════╪═══════╪════════╪════════════════════════╪═════════════════════╡
│ A        ┆ 1     ┆ foo    ┆ 3.0                    ┆ 1.0                 │
│ B        ┆ 2     ┆ bar    ┆ 2.0                    ┆ 1.0                 │
│ A        ┆ 3     ┆ baz    ┆ 3.0                    ┆ 1.0                 │
│ C        ┆ 4     ┆ qux    ┆ 2.0                    ┆ 1.0                 │
│ C        ┆ 5     ┆ quux   ┆ 2.0                    ┆ 1.0                 │
│ A        ┆ 6     ┆ corge  ┆ 3.0                    ┆ 1.0                 │
│ B        ┆ 7     ┆ grault ┆ 2.0                    ┆ 1.0                 │
└──────────┴───────┴────────┴────────────────────────┴─────────────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.encoders.count_encoder.CountEncoder[source]#

Fit the transformer by computing count statistics for each category.

Parameters:

X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

CountEncoder

class gators.encoders.HashEncoder[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Encode categorical features via feature hashing (hashing trick).

Maps each category value to an integer bucket in [0, n_features) using a deterministic hash function. Because no vocabulary is learnt during fit(), the encoder handles unknown categories at inference time naturally — any unseen value simply hashes to some bucket without raising an error.

Each input column produces one output column containing integers in [0, n_features). The output column name is {col}__hash.

Parameters:

n_features (int, default=16) – Number of hash buckets. Controls the trade-off between collision rate (lower → more collisions) and the number of distinct encodings (higher → fewer collisions but larger range). Must be ≥ 2.
subset (list[str] or None, default=None) – Categorical (String, Categorical, Enum, Boolean) columns to encode. If None, all such columns are selected automatically.
inplace (bool, default=True) – If True the original columns are overwritten with hash values (cast to Float64 for consistency with other encoders). If False new columns suffixed __hash are added alongside the originals (subject to drop_columns).
drop_columns (bool, default=True) – When inplace=False, whether to drop the original columns after adding the hashed columns. Ignored when inplace=True.

Notes

The hash is computed with polars.Expr.hash() using seed=0 for full determinism across processes (unlike Python’s built-in hash() which is randomised by PYTHONHASHSEED). Boolean columns are cast to String before hashing so that True / False map to stable buckets.

Examples

>>> import polars as pl
>>> from gators.encoders import HashEncoder

>>> X = pl.DataFrame({
...     "color":  ["red", "blue", "green", "red", "blue"],
...     "size":   ["S", "M", "L", "XL", "S"],
...     "weight": [1.0, 2.0, 3.0, 4.0, 5.0],
... })
>>> encoder = HashEncoder(n_features=8, inplace=False)
>>> encoder.fit(X)
>>> X_enc = encoder.transform(X)

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.encoders.hash_encoder.HashEncoder[source]#

Detect the subset of categorical columns (no statistics are learnt).

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Not used; present for sklearn compatibility.

Returns:

The fitted transformer instance.

Return type:

HashEncoder

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Encode categorical columns using the hashing trick.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: DataFrame with hash-encoded columns.
Return type:: pl.DataFrame

class gators.encoders.LeaveOneOutEncoder[source]#

Bases: gators.encoders._base_encoder._BaseEncoder

Encodes categorical values using leave-one-out target encoding.

For each row, this encoder calculates the mean of the target variable for the category, excluding the current row. This reduces overfitting compared to standard target encoding by preventing the target value from influencing its own encoding.

Parameters:

subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (int | float, default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
smoothing (float, default=0.0) – Smoothing parameter for regularization toward the global mean. Higher values increase regularization. Use 0 for no smoothing.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__encode_loo’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.

Examples

Initialize and use LeaveOneOutEncoder.

>>> import polars as pl
>>> from gators.encoders import LeaveOneOutEncoder
>>> X = pl.DataFrame({
...     "category": ["A", "B", "A", "C", "A", "B", "C"],
...     "target": [1, 0, 1, 0, 0, 1, 1],
...     "value": [1, 2, 3, 4, 5, 6, 7]
... })
>>> encoder = LeaveOneOutEncoder(subset=["category"], smoothing=1.0)
>>> _ = encoder.fit(X, y=X["target"])
>>> transformed_X = encoder.transform(X)
>>> print(transformed_X)
shape: (7, 3)
┌───────┬──────────────────────────┬───────┐
│ target┆ category__encode_loo     ┆ value │
│ ---   ┆ ---                      ┆ ---   │
│ i64   ┆ f64                      ┆ i64   │
╞═══════╪══════════════════════════╪═══════╡
│ 1     ┆ 0.571429                 ┆ 1     │
│ 0     ┆ 0.571429                 ┆ 2     │
│ 1     ┆ 0.571429                 ┆ 3     │
│ 0     ┆ 0.571429                 ┆ 4     │
│ 0     ┆ 0.666667                 ┆ 5     │
│ 1     ┆ 0.571429                 ┆ 6     │
│ 1     ┆ 0.571429                 ┆ 7     │
└───────┴──────────────────────────┴───────┘

Example with no smoothing:

>>> X = pl.DataFrame({
...     "category": ["A", "A", "A", "B", "B"],
...     "target": [1, 0, 1, 0, 1],
...     "value": [1, 2, 3, 4, 5]
... })
>>> encoder = LeaveOneOutEncoder(subset=["category"], smoothing=0.0)
>>> _ = encoder.fit(X, y="target")
>>> transformed_X = encoder.transform(X)
>>> print(transformed_X)
shape: (5, 3)
┌───────┬─────────────────────────┬───────┐
│ target┆ category__encode_loo    ┆ value │
│ ---   ┆ ---                     ┆ ---   │
│ i64   ┆ f64                     ┆ i64   │
╞═══════╪═════════════════════════╪═══════╡
│ 1     ┆ 0.666667                ┆ 1     │
│ 0     ┆ 0.666667                ┆ 2     │
│ 1     ┆ 0.500000                ┆ 3     │
│ 0     ┆ 0.500000                ┆ 4     │
│ 1     ┆ 0.500000                ┆ 5     │
└───────┴─────────────────────────┴───────┘

fit(X: polars.DataFrame, y: polars.Series) → gators.encoders.leave_one_out_encoder.LeaveOneOutEncoder[source]#

Fit the transformer by computing leave-one-out target statistics.

Parameters:

X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series) – Target series (binary or continuous). Required for LeaveOneOutEncoder.

Returns:

The fitted transformer instance.

Return type:

LeaveOneOutEncoder

Raises:

ValueError – If y is None.

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame using leave-one-out encoding.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with leave-one-out encoded columns.
Return type:: pl.DataFrame

class gators.encoders.OneHotEncoder[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.encoders.onehot_encoder.OneHotEncoder[source]#

Fit the transformer by identifying categories for one-hot encoding.

Parameters:

X (pl.DataFrame) – Input DataFrame with string columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

OneHotEncoder

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by applying one-hot encoding to categorical columns.

Parameters:: X (pl.DataFrame) – Input DataFrame with string columns.
Returns:: DataFrame with one-hot encoded columns (one binary column per category).
Return type:: pl.DataFrame

class gators.encoders.OrdinalEncoder[source]#

Bases: gators.encoders._base_encoder._BaseEncoder

Encodes categorical values as ordinal.

Parameters:

subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (int | float, default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__ordinal_enc’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.

Examples

Basic usage:

>>> from gators.encoders import OrdinalEncoder
>>> import polars as pl
>>> X = pl.DataFrame({
...     "A": ["foo", "bar", "foo", "bar", "baz"],
...     "B": [True, False, True, True, False],
... })
>>> encoder = OrdinalEncoder(inplace=False)
>>> _ = encoder.fit(X)
>>> transformed_X = encoder.transform(X)
>>> print(transformed_X)
shape: (5, 2)
┌───────────────┬───────────────┐
│ A__ordinal_enc│ B__ordinal_enc│
│ f64           │ f64           │
╞═══════════════╪═══════════════╡
│ 3.0           │ 2.0           │
│ 2.0           │ 1.0           │
│ 3.0           │ 2.0           │
│ 2.0           │ 2.0           │
│ 1.0           │ 1.0           │
└───────────────┴───────────────┘

Drop columns:

>>> encoder = OrdinalEncoder(drop_columns=False, inplace=False)
>>> _ = encoder.fit(X)
>>> transformed_X = encoder.transform(X)
>>> print(transformed_X)
shape: (5, 4)
┌──────────────┬──────────────┬──────────────┬──────────────┐
│ A            │        B     │A__ordinal_enc│B__ordinal_enc│
│ str          │        bool  │f64           │ f64          │
╞══════════════╪══════════════╪══════════════╪══════════════╡
│ foo          │        true  │3.0           │ 2.0          │
│ bar          │        false │2.0           │ 1.0          │
│ foo          │        true  │3.0           │ 2.0          │
│ bar          │        true  │2.0           │ 2.0          │
│ baz          │        false │1.0           │ 1.0          │
└──────────────┴──────────────┴──────────────┴──────────────┘

Subset of columns:

>>> encoder = OrdinalEncoder(subset=["A"], inplace=False)
>>> _ = encoder.fit(X)
>>> transformed_X = encoder.transform(X)
>>> print(transformed_X)
shape: (5, 1)
┌───────────────┐
│ A__ordinal_enc│
│ f64           │
╞═══════════════╡
│ 3.0           │
│ 2.0           │
│ 3.0           │
│ 2.0           │
│ 1.0           │
└───────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.encoders.ordinal_encoder.OrdinalEncoder[source]#

Fit the transformer by computing ordinal mappings based on category frequency.

Parameters:

X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

OrdinalEncoder

class gators.encoders.RareCategoryEncoder[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Encodes rare categories.

Parameters:

subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
default (str, default="RARE") – Value to replace rare categories with.
min_count (PositiveInt | PositiveFloat, default=2) – Minimum count threshold for categories. Categories below this threshold are replaced with default. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__encode_rare’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.

Examples

>>> import polars as pl
>>> from gators.encoders import RareCategoryEncoder

>>> # Sample data
>>> X =pl.DataFrame({
...     'A': ['cat', 'dog', 'cat', 'dog', 'cat'],
...     'B': ['x', 'x', 'y', 'y', 'x'],
...     'target': [1, 0, 1, 1, 0]
... })

>>> encoder = RareCategoryEncoder(inplace=False)
>>> encoder.fit(X)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 2)
┌───────────────────┬───────────────────┐
│ A__encode_rare    │ B__encode_rare    │
│ ---               │ ---               │
│ str               │ str               │
├───────────────────┼───────────────────┤
│ cat               │ x                 │
│ dog               │ x                 │
│ cat               │ RARE              │
│ dog               │ RARE              │
│ cat               │ x                 │
└───────────────────┴───────────────────┘

>>> encoder = RareCategoryEncoder(drop_columns=False, inplace=False)
>>> encoder.fit(X)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 5)
┌─────┬─────┬────────┬───────────────────┬───────────────────┐
│ A   │ B   │ target │ A__encode_rare    │ B__encode_rare    │
│ --- │ --- │ ---    │ ---               │ ---               │
│ str │ str │ i64    │ str               │ str               │
├─────┼─────┼────────┼───────────────────┼───────────────────┤
│ cat │ x   │ 1      │ cat               │ x                 │
│ dog │ x   │ 0      │ dog               │ x                 │
│ cat │ y   │ 1      │ cat               │ RARE              │
│ dog │ y   │ 1      │ dog               │ RARE              │
│ cat │ x   │ 0      │ cat               │ x                 │
└─────┴─────┴────────┴───────────────────┴───────────────────┘

>>> encoder = RareCategoryEncoder(subset=['A'], inplace=False)
>>> encoder.fit(X)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 4)
┌─────┬─────┬────────┬───────────────────┐
│ A   │ B   │ target │ A__encode_rare    │
│ --- │ --- │ ---    │ ---               │
│ str │ str │ i64    │ str               │
├─────┼─────┼────────┼───────────────────┤
│ cat │ x   │ 1      │ cat               │
│ dog │ x   │ 0      │ dog               │
│ cat │ y   │ 1      │ cat               │
│ dog │ y   │ 1      │ dog               │
│ cat │ x   │ 0      │ cat               │
└─────┴─────┴────────┴───────────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None)[source]#

Fit the transformer by identifying rare categories.

Parameters:

X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

RareCategoryEncoder

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by replacing rare categories with the default value.

Parameters:: X (pl.DataFrame) – Input DataFrame with categorical columns.
Returns:: DataFrame with rare categories replaced.
Return type:: pl.DataFrame

class gators.encoders.TargetEncoder[source]#

Bases: gators.encoders._base_encoder._BaseEncoder

Target-based encoded categorical values.

Parameters:

subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
min_count (int | float, default=1) – Minimum count threshold for encoding categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__target_enc’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.

Examples

Basic usage:

>>> from gators.encoders import TargetEncoder
>>> import polars as pl
>>> X = pl.DataFrame({
...     "A": ["foo", "bar", "foo", "bar", "baz"],
...     "B": [True, False, True, True, False],
... })
>>> target = pl.Series("target", [1, 0, 1, 1, 0])
>>> encoder = TargetEncoder(inplace=False, drop_columns=True)
>>> encoder.fit(X, target)
TargetEncoder(...)
>>> transformed_X = encoder.transform(X)
>>> print(transformed_X)
shape: (5, 2)
┌───────────────┬───────────────┐
│ B__target_enc ┆ A__target_enc │
│ ---           ┆ ---           │
│ f64           ┆ f64           │
╞═══════════════╪═══════════════╡
│ 1.0           ┆ 1.0           │
│ 0.0           ┆ 0.5           │
│ 1.0           ┆ 1.0           │
│ 1.0           ┆ 0.5           │
│ 0.0           ┆ 0.0           │
└───────────────┴───────────────┘

Drop columns:

>>> encoder = TargetEncoder(drop_columns=False, inplace=False)
>>> encoder.fit(X, target)
TargetEncoder(...)
>>> transformed_X = encoder.transform(X)
>>> print(transformed_X)
shape: (5, 4)
┌─────────────┬─────────────┬───────────────┬───────────────┐
│ A           │ B           │ A__target_enc │ B__target_enc │
│ str         │ bool        │ f64           │ f64           │
╞═════════════╪═════════════╪═══════════════╪═══════════════╡
│ foo         │ true        │ 1.0           │ 1.0           │
│ bar         │ false       │ 1.0           │ 0.0           │
│ foo         │ true        │ 1.0           │ 1.0           │
│ bar         │ true        │ 1.0           │ 1.0           │
│ baz         │ false       │ 0.0           │ 0.0           │
└─────────────┴─────────────┴───────────────┴─────────────┘

Subset of columns:

>>> encoder = TargetEncoder(subset=["A"], inplace=False, drop_columns=True)
>>> encoder.fit(X, target)
TargetEncoder(...)
>>> transformed_X = encoder.transform(X)
>>> print(transformed_X)
shape: (5, 1)
┌───────────────┐
│ A__target_enc │
│ f64           │
╞═══════════════╡
│ 1.0           │
│ 1.0           │
│ 1.0           │
│ 1.0           │
│ 0.0           │
└───────────────┘

fit(X: polars.DataFrame, y: polars.Series) → gators.encoders.target_encoder.TargetEncoder[source]#

Fit the transformer by computing target mean for each category.

Parameters:

X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series) – Target series (binary or continuous).

Returns:

The fitted transformer instance.

Return type:

TargetEncoder

class gators.encoders.WOEEncoder[source]#

Bases: gators.encoders._base_encoder._BaseEncoder

Weight of Evidence (WOE) encodes categorical variables.

Parameters:

subset (list[str], default=None) – List of categorical columns to encode. If None, all string, boolean, and categorical columns are selected.
regularization (float, default=0.01) – Regularization term (0.0-1.0) to prevent division by zero in WOE calculation.
default (float, default=0.0) – Default WOE value for categories with insufficient counts or unseen categories.
min_count (PositiveInt | PositiveFloat, default=1) – Minimum count threshold for categories. If >= 1, treated as absolute count; if < 1, treated as frequency.
inplace (bool, default=True) – If True, replace original columns with encoded values. If False, create new columns with suffix ‘__encode_woe’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after encoding. Ignored when inplace=True.

Examples

>>> import polars as pl
>>> from gators.encoders import WOEEncoder

>>> # Sample data
>>> X = pl.DataFrame({
...     'A': ['cat', 'dog', 'cat', 'dog', 'cat'],
...     'B': ['x', 'x', 'y', 'y', 'x']
... })
>>> y = pl.Series('target', [1, 0, 1, 1, 0])

>>> encoder = WOEEncoder(inplace=False, drop_columns=True)
>>> _ = encoder.fit(X, y)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 2)
┌────────────────┬────────────────┐
│ A__encode_woe  │ B__encode_woe  │
│ ---            │ ---            │
│ f64            │ f64            │
├────────────────┼────────────────┤
│ 0.287682       │ -1.090344      │
│ -0.402159      │ -1.090344      │
│ 0.287682       │ 4.901146       │
│ -0.402159      │ 4.901146       │
│ 0.287682       │  -1.090344     │
└────────────────┴────────────────┘

>>> # Encoding with drop_columns=False
>>> encoder = WOEEncoder(inplace=False, inplace=False, drop_columns=False)
>>> encoder.fit(X, y)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 4)
┌─────┬─────┬────────────────┬────────────────┐
│ A   │ B   │ A__encode_woe  │ B__encode_woe  │
│ --- │ --- │ ---            │ ---            │
│ str │ str │ f64            │ f64            │
├─────┼─────┼────────────────┼────────────────┤
│ cat │ x   │ 0.287682       │ 0.287682       │
│ dog │ x   │ -1.203973      │ 0.287682       │
│ cat │ y   │ 0.287682       │ -1.203973      │
│ dog │ y   │ -1.203973      │ -1.203973      │
│ cat │ x   │ 0.287682       │ 0.287682       │
└─────┴─────┼────────────────┼────────────────┘

>>> # Encoding with columns as a subset
>>> encoder = WOEEncoder(subset=['A'], inplace=False, drop_columns=False)
>>> encoder.fit(X, y)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 3)
┌─────┬───────┬───────────────┬───────────────┐
│ A   ┆ B     ┆ B__target_enc ┆ A__target_enc │
│ --- ┆ ---   ┆ ---           ┆ ---           │
│ str ┆ bool  ┆ f64           ┆ f64           │
╞═════╪═══════╪═══════════════╪═══════════════╡
│ foo ┆ true  ┆ 1.0           ┆ 1.0           │
│ bar ┆ false ┆ 0.0           ┆ 0.5           │
│ foo ┆ true  ┆ 1.0           ┆ 1.0           │
│ bar ┆ true  ┆ 1.0           ┆ 0.5           │
│ baz ┆ false ┆ 0.0           ┆ 0.0           │
└─────┴───────┴───────────────┴───────────────┘

fit(X: polars.DataFrame, y: polars.Series) → gators.encoders.woe_encoder.WOEEncoder[source]#

Fit the transformer by computing Weight of Evidence values for each category.

Parameters:

X (pl.DataFrame) – Input DataFrame with categorical columns.
y (pl.Series) – Binary target series (must contain 0s and 1s).

Returns:

The fitted transformer instance.

Return type:

WOEEncoder