gators.imputers package#

Module contents#

class gators.imputers.BooleanImputer[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Imputes missing values in boolean columns using a specified strategy.

Parameters:

strategy (Literal["constant", "most_frequent"]) –
Strategy to use for imputing missing values.
- ”constant”: Fill with a constant value specified by value
- ”most_frequent”: Fill with the mode (most frequent value)
subset (list[str], default=None) – List of boolean columns to impute. If None, all boolean columns are selected.
value (bool, default=None) – Value to use when strategy is ‘constant’. Must be True or False. Required when strategy=’constant’, ignored otherwise.
inplace (bool, default=True) – If True, impute values in the original columns. If False, create new columns with suffix ‘__impute_{strategy}’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after imputation. Ignored when inplace=True.

Examples

>>> import polars as pl
>>> from gators.imputers import BooleanImputer

>>> # Sample data
>>> X =pl.DataFrame({
...     'A': [True, False, None, True, None],
...     'B': [False, None, True, None, False]
... })

>>> # Impute with 'most_frequent' strategy
>>> imputer = BooleanImputer(strategy="most_frequent", inplace=False)
>>> _ = imputer.fit(X)
>>> transformed_X =imputer.transform(X)
>>> print(transformed_X)
shape: (5, 2)
┌─────────────────────────┬─────────────────────────┐
│ A__impute_most_frequent ┆ B__impute_most_frequent │
│ ---                     ┆ ---                     │
│ bool                    ┆ bool                    │
╞═════════════════════════╪═════════════════════════╡
│ true                    ┆ false                   │
│ false                   ┆ false                   │
│ true                    ┆ true                    │
│ true                    ┆ false                   │
│ true                    ┆ false                   │
└─────────────────────────┴─────────────────────────┘

>>> # Impute with 'constant' strategy
>>> from gators.imputers import BooleanImputer
>>> imputer = BooleanImputer(strategy="constant", value=False, drop_columns=False, inplace=False)
>>> _ = imputer.fit(X)
>>> transformed_X =imputer.transform(X)
>>> print(transformed_X)
shape: (5, 4)
┌───────┬───────┬────────────────────┬────────────────────┐
│ A     ┆ B     ┆ A__impute_constant ┆ B__impute_constant │
│ ---   ┆ ---   ┆ ---                ┆ ---                │
│ bool  ┆ bool  ┆ bool               ┆ bool               │
╞═══════╪═══════╪════════════════════╪════════════════════╡
│ true  ┆ false ┆ true               ┆ false              │
│ false ┆ null  ┆ false              ┆ false              │
│ null  ┆ true  ┆ false              ┆ true               │
│ true  ┆ null  ┆ true               ┆ false              │
│ null  ┆ false ┆ false              ┆ false              │
└───────┴───────┴────────────────────┴────────────────────┘

>>> # Impute with columns specified
>>> from gators.imputers import BooleanImputer
>>> imputer = BooleanImputer(strategy="constant", value=True, subset=['B'], drop_columns=False, inplace=False)
>>> _ = imputer.fit(X)
>>> transformed_X =imputer.transform(X)
>>> print(transformed_X)
shape: (5, 3)
┌───────┬───────┬────────────────────┐
│ A     ┆ B     ┆ B__impute_constant │
│ ---   ┆ ---   ┆ ---                │
│ bool  ┆ bool  ┆ bool               │
╞═══════╪═══════╪════════════════════╡
│ true  ┆ false ┆ false              │
│ false ┆ null  ┆ true               │
│ null  ┆ true  ┆ true               │
│ true  ┆ null  ┆ true               │
│ null  ┆ false ┆ false              │
└───────┴───────┴────────────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.imputers.boolean_imputer.BooleanImputer[source]#

Fit the transformer by computing imputation statistics.

Parameters:

X (pl.DataFrame) – Input DataFrame with boolean columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

BooleanImputer

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by imputing missing values in boolean columns.

Parameters:: X (pl.DataFrame) – Input DataFrame with boolean columns containing null values.
Returns:: DataFrame with imputed boolean columns.
Return type:: pl.DataFrame

class gators.imputers.GroupByImputer[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Impute missing values in numeric columns by grouping with a categorical column and filling with the median or mean of each group.

Parameters:

group_by_column (str) – The categorical column to group by for computing group-wise statistics.
strategy (Literal['median', 'mean']) –
Strategy to use for imputing missing values within each group.
- ’median’: Fill with the median of each group
- ’mean’: Fill with the mean of each group
subset (list[str], default=None) – List of numeric columns to impute. If None, all numeric columns (except group_by_column) are selected.
inplace (bool, default=True) – If True, impute values in the original columns. If False, create new columns with suffix ‘__impute_groupby_{strategy}’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after imputation. Ignored when inplace=True.

Examples

>>> import polars as pl
>>> from gators.imputers import GroupByImputer

>>> # Sample DataFrame
>>> X = pl.DataFrame({
...     'district': ['A', 'A', 'B', 'B', 'C'],
...     'value1': [1.0, None, 3.0, 4.0, None],
...     'value2': [10.0, 20.0, None, 40.0, 50.0]
... })

>>> # Impute using group median
>>> imputer = GroupByImputer(group_by_column='district', strategy='median', inplace=False)
>>> imputer.fit(X)
GroupByImputer(group_by_column='district', strategy='median', subset=['value1', 'value2'], drop_columns=True, inplace=False)
>>> transformed_X = imputer.transform(X)
>>> print(transformed_X)
shape: (5, 3)
┌──────────┬───────────────────────────────┬───────────────────────────────┐
│ district ┆ value1__impute_groupby_median ┆ value2__impute_groupby_median │
│ ---      ┆ ---                           ┆ ---                           │
│ str      ┆ f64                           ┆ f64                           │
╞══════════╪═══════════════════════════════╪═══════════════════════════════╡
│ A        ┆ 1.0                           ┆ 10.0                          │
│ A        ┆ 1.0                           ┆ 20.0                          │
│ B        ┆ 3.0                           ┆ 40.0                          │
│ B        ┆ 4.0                           ┆ 40.0                          │
│ C        ┆ null                          ┆ 50.0                          │
└──────────┴───────────────────────────────┴───────────────────────────────┘

>>> # Impute using group mean with inplace=True
>>> imputer_inplace = GroupByImputer(
...     group_by_column='district',
...     strategy='mean',
...     inplace=True
... )
>>> imputer_inplace.fit(X)
GroupByImputer(group_by_column='district', strategy='mean', subset=['value1', 'value2'], drop_columns=True, inplace=True)
>>> transformed_X_inplace = imputer_inplace.transform(X)
>>> print(transformed_X_inplace)
shape: (5, 3)
┌──────────┬────────┬────────┐
│ district ┆ value1 ┆ value2 │
│ ---      ┆ ---    ┆ ---    │
│ str      ┆ f64    ┆ f64    │
╞══════════╪════════╪════════╡
│ A        ┆ 1.0    ┆ 10.0   │
│ A        ┆ 1.0    ┆ 20.0   │
│ B        ┆ 3.0    ┆ 40.0   │
│ B        ┆ 4.0    ┆ 40.0   │
│ C        ┆ null   ┆ 50.0   │
└──────────┴────────┴────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.imputers.groupby_imputer.GroupByImputer[source]#

Fit the transformer by computing group-wise imputation statistics.

Parameters:

X (pl.DataFrame) – Input DataFrame with numeric columns and a grouping column.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

GroupByImputer

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by imputing missing values using group-wise statistics.

Parameters:: X (pl.DataFrame) – Input DataFrame with numeric columns containing null values and a grouping column.
Returns:: DataFrame with imputed numeric columns based on group statistics.
Return type:: pl.DataFrame

class gators.imputers.KNNImputer[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Impute missing values using K-Nearest Neighbors.

For each row with missing values, finds the n_neighbors closest rows in the training set (using Euclidean distance over the non-null numeric columns) and fills the missing values with the mean (weights='uniform') or inverse-distance-weighted mean (weights='distance') of those neighbors’ values.

The algorithm is implemented entirely in Polars:

Fit: retain all training rows that are complete in both subset and the feature columns; store them in _train_df_ as the reference pool. Compute per-column medians as a global fallback.
Transform: rows with no nulls pass through unchanged. Rows with nulls are grouped by their missing pattern (which columns are null) so that each group can share a single cross-join + distance computation with the reference pool.

Note

KNN distance is scale-sensitive. Consider adding a StandardScaler (or similar) before this imputer when columns have very different magnitudes.

Parameters:

n_neighbors (int, default=5) – Number of nearest neighbors to use for imputation.
subset (list[str] or None, default=None) – Numeric columns to impute. If None, all numeric columns that contain at least one null in the training data are selected automatically.
weights ({'uniform', 'distance'}, default='uniform') –
Weight function applied when aggregating neighbor values.
- 'uniform': all neighbors contribute equally (plain mean).
- 'distance': neighbors are weighted by 1 / (d + ε) so that closer neighbors have a stronger influence.

_train_df_#

Complete training rows used as the KNN reference pool.

Type:: pl.DataFrame

_global_stats_#

Per-column median fallback values, used when no feature columns are available to compute distances.

Type:: dict[str, float]

_feature_cols_#

Numeric columns (not in subset) used to compute pairwise distances.

Type:: list[str]

Examples

>>> import polars as pl
>>> from gators.imputers import KNNImputer

>>> X = pl.DataFrame({
...     "age":    [25.0, 30.0, 35.0, 40.0, None],
...     "salary": [50.0, 60.0, 70.0, 80.0, 75.0],
... })
>>> imputer = KNNImputer(n_neighbors=2, subset=["age"])
>>> imputer.fit(X)
KNNImputer(n_neighbors=2, subset=['age'], weights='uniform')
>>> imputer.transform(X)
shape: (5, 2)
┌──────┬────────┐
│ age  ┆ salary │
│ ---  ┆ ---    │
│ f64  ┆ f64    │
╞══════╪════════╡
│ 25.0 ┆ 50.0   │
│ 30.0 ┆ 60.0   │
│ 35.0 ┆ 70.0   │
│ 40.0 ┆ 80.0   │
│ 37.5 ┆ 75.0   │  ← mean of the 2 nearest neighbours (age=35, age=40)
└──────┴────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.imputers.knn_imputer.KNNImputer[source]#

Fit by storing the complete training rows as the KNN reference pool.

Parameters:

X (pl.DataFrame) – Training DataFrame with numeric columns.
y (pl.Series or None, default=None) – Ignored. Present for sklearn API compatibility.

Returns:

The fitted transformer instance.

Return type:

KNNImputer

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Impute missing values in subset columns using KNN.

Parameters:: X (pl.DataFrame) – DataFrame to transform. May contain nulls in subset columns.
Returns:: DataFrame with nulls in subset columns filled.
Return type:: pl.DataFrame

class gators.imputers.NumericImputer[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Impute missing values in numeric columns using various strategies.

Parameters:

strategy (Literal['constant', 'most_frequent', 'median', 'mean', 'min', 'max', 'forward', 'backward', 'zero', 'one']) –
Strategy to use for imputing missing values.
- ’constant’: Fill missing values with value
- ’most_frequent’: Fill with the most frequent value in each column
- ’median’: Fill with the median of each column
- ’mean’: Fill with the mean of each column
- ’min’: Fill with the minimum value in each column
- ’max’: Fill with the maximum value in each column
- ’forward’: Fill with the previous non-null value (forward fill)
- ’backward’: Fill with the next non-null value (backward fill)
- ’zero’: Fill missing values with 0
- ’one’: Fill missing values with 1
subset (list[str], default=None) – List of numeric columns to impute. If None, all numeric columns are selected.
value (int | float | None, default=None) – Value to use when strategy is ‘constant’. Required when strategy=’constant’, ignored otherwise.
inplace (bool, default=True) – If True, impute values in the original columns. If False, create new columns with suffix ‘__impute_{strategy}’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after imputation. Ignored when inplace=True.

Examples

>>> import polars as pl
>>> from gators.imputers import NumericImputer

>>> # Sample DataFrame
>>> X = pl.DataFrame({
...     'A': [1.0, 2.0, None, 4.0],
...     'B': [5.0, None, 7.0, 8.0],
...     'C': [None, 2.0, 3.0, 4.0]
... })

>>> # Impute using the mean strategy
>>> imputer = NumericImputer(strategy='mean', inplace=False)
>>> imputer.fit(X)
NumericImputer(strategy='mean', subset=['A', 'B', 'C'], value=None, drop_columns=True, inplace=False)
>>> transformed_X = imputer.transform(X)
>>> print(transformed_X)
shape: (4, 3)
┌────────────────┬────────────────┬────────────────┐
│ A__impute_mean ┆ B__impute_mean ┆ C__impute_mean │
│ ---            ┆ ---            ┆ ---            │
│ f64            ┆ f64            ┆ f64            │
╞════════════════╪════════════════╪════════════════╡
│ 1.0            ┆ 5.0            ┆ 3.0            │
│ 2.0            ┆ 6.666667       ┆ 2.0            │
│ 2.333333       ┆ 7.0            ┆ 3.0            │
│ 4.0            ┆ 8.0            ┆ 4.0            │
└────────────────┴────────────────┴────────────────┘

>>> # Impute using a constant value
>>> imputer_constant = NumericImputer(strategy='constant', value=0, inplace=False)
>>> imputer_constant.fit(X)
NumericImputer(strategy='constant', subset=['A', 'B', 'C'], value=0, drop_columns=True, inplace=False)
>>> transformed_X_constant = imputer_constant.transform(X)
>>> print(transformed_X_constant)
shape: (4, 3)
┌────────────────────┬────────────────────┬────────────────────┐
│ A__impute_constant ┆ B__impute_constant ┆ C__impute_constant │
│ ---                ┆ ---                ┆ ---                │
│ f64                ┆ f64                ┆ f64                │
╞════════════════════╪════════════════════╪════════════════════╡
│ 1.0                ┆ 5.0                ┆ 0.0                │
│ 2.0                ┆ 0.0                ┆ 2.0                │
│ 0.0                ┆ 7.0                ┆ 3.0                │
│ 4.0                ┆ 8.0                ┆ 4.0                │
└────────────────────┴────────────────────┴────────────────────┘

>>> # Impute with drop_columns=False
>>> imputer_no_drop = NumericImputer(strategy='mean', drop_columns=False, inplace=False)
>>> imputer_no_drop.fit(X)
NumericImputer(strategy='mean', subset=['A', 'B', 'C'], value=None, drop_columns=False, inplace=False)
>>> transformed_X_no_drop = imputer_no_drop.transform(X)
>>> print(transformed_X_no_drop)
shape: (4, 6)
┌──────┬──────┬──────┬────────────────┬────────────────┬────────────────┐
│ A    ┆ B    ┆ C    ┆ A__impute_mean ┆ B__impute_mean ┆ C__impute_mean │
│ ---  ┆ ---  ┆ ---  ┆ ---            ┆ ---            ┆ ---            │
│ f64  ┆ f64  ┆ f64  ┆ f64            ┆ f64            ┆ f64            │
╞══════╪══════╪══════╪════════════════╪════════════════╪════════════════╡
│ 1.0  ┆ 5.0  ┆ null ┆ 1.0            ┆ 5.0            ┆ 3.0            │
│ 2.0  ┆ null ┆ 2.0  ┆ 2.0            ┆ 6.666667       ┆ 2.0            │
│ null ┆ 7.0  ┆ 3.0  ┆ 2.333333       ┆ 7.0            ┆ 3.0            │
│ 4.0  ┆ 8.0  ┆ 4.0  ┆ 4.0            ┆ 8.0            ┆ 4.0            │
└──────┴──────┴──────┴────────────────┴────────────────┴────────────────┘

>>> # Impute with a subset of columns
>>> imputer_subset = NumericImputer(strategy='mean', subset=['A'], inplace=False)
>>> imputer_subset.fit(X)
NumericImputer(strategy='mean', subset=['A'], value=None, drop_columns=True, inplace=False)
>>> transformed_X_subset = imputer_subset.transform(X)
>>> print(transformed_X_subset)
shape: (4, 3)
┌──────┬──────┬────────────────┐
│ B    ┆ C    ┆ A__impute_mean │
│ ---  ┆ ---  ┆ ---            │
│ f64  ┆ f64  ┆ f64            │
╞══════╪══════╪════════════════╡
│ 5.0  ┆ null ┆ 1.0            │
│ null ┆ 2.0  ┆ 2.0            │
│ 7.0  ┆ 3.0  ┆ 2.333333       │
│ 8.0  ┆ 4.0  ┆ 4.0            │
└──────┴──────┴────────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.imputers.numeric_imputer.NumericImputer[source]#

Fit the transformer by computing imputation statistics.

Parameters:

X (pl.DataFrame) – Input DataFrame with numeric columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

NumericImputer

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by imputing missing values in numeric columns.

Parameters:: X (pl.DataFrame) – Input DataFrame with numeric columns containing null values.
Returns:: DataFrame with imputed numeric columns.
Return type:: pl.DataFrame

class gators.imputers.StringImputer[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Impute missing values in string columns of a Polars DataFrame.

Parameters:

strategy (Literal['constant', 'most_frequent']) –
Strategy to use for imputing missing values.
- ”constant”: Fill with a constant value specified by value
- ”most_frequent”: Fill with the mode (most frequent value)
subset (list[str], default=None) – List of string columns to impute. If None, all string columns are selected.
value (str, default=None) – Value to use when strategy is ‘constant’. Required when strategy=’constant’, ignored otherwise.
inplace (bool, default=True) – If True, impute values in the original columns. If False, create new columns with suffix ‘__impute_{strategy}’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after imputation. Ignored when inplace=True.

Examples

>>> import polars as pl
>>> from gators.imputers import StringImputer
>>> X = pl.DataFrame({
...     'col1': ['a', None, 'b', None],
...     'col2': ['cat', 'dog', 'mouse', None],
...     'col3': ['Paris', 'London', 'Berlin', '']
... })
>>> imputer = StringImputer(strategy='most_frequent', inplace=False)
>>> imputer.fit(X)
StringImputer(strategy='most_frequent', subset=['col1', 'col2', 'col3'], value=None, drop_columns=True, inplace=False)
>>> X_imputed = imputer.transform(X)
>>> print(X_imputed)
shape: (4, 3)
┌────────────────────────────┬────────────────────────────┬────────────────────────────┐
│ col1__impute_most_frequent ┆ col2__impute_most_frequent ┆ col3__impute_most_frequent │
│ ---                        ┆ ---                        ┆ ---                        │
│ str                        ┆ str                        ┆ str                        │
╞════════════════════════════╪════════════════════════════╪════════════════════════════╡
│ a                          ┆ cat                        ┆ Paris                      │
│ a                          ┆ dog                        ┆ London                     │
│ b                          ┆ mouse                      ┆ Berlin                     │
│ a                          ┆ cat                        ┆                            │
└────────────────────────────┴────────────────────────────┴────────────────────────────┘

>>> imputer = StringImputer(strategy='most_frequent', drop_columns=False, inplace=False)
>>> X_imputed = imputer.fit_transform(X)
>>> print(X_imputed)
shape: (4, 6)
┌──────┬───────┬────────┬──────────────────┬─────────────────┬─────────────────┐
│ col1 ┆ col2  ┆ col3   ┆ col1__impute_mos ┆ col2__impute_mo ┆ col3__impute_mo │
│ ---  ┆ ---   ┆ ---    ┆ t_frequent       ┆ st_frequent     ┆ st_frequent     │
│ str  ┆ str   ┆ str    ┆ ---              ┆ ---             ┆ ---             │
│      ┆       ┆        ┆ str              ┆ str             ┆ str             │
╞══════╪═══════╪════════╪══════════════════╪═════════════════╪═════════════════╡
│ a    ┆ cat   ┆ Paris  ┆ a                ┆ cat             ┆ Paris           │
│ null ┆ dog   ┆ London ┆ a                ┆ dog             ┆ London          │
│ b    ┆ mouse ┆ Berlin ┆ b                ┆ mouse           ┆ Berlin          │
│ null ┆ null  ┆        ┆ a                ┆ cat             ┆                 │
└──────┴───────┴────────┴──────────────────┴─────────────────┴─────────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.imputers.string_imputer.StringImputer[source]#

Fit the transformer by computing imputation statistics.

Parameters:

X (pl.DataFrame) – Input DataFrame with string columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

StringImputer

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by imputing missing values in string columns.

Parameters:: X (pl.DataFrame) – Input DataFrame with string columns containing null values.
Returns:: DataFrame with imputed string columns.
Return type:: pl.DataFrame