gators.imputers package#

Module contents#

class gators.imputers.BooleanImputer[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Imputes missing values in boolean columns using a specified strategy.

Parameters:
  • strategy (Literal["constant", "most_frequent"]) –

    Strategy to use for imputing missing values.

    • ”constant”: Fill with a constant value specified by value

    • ”most_frequent”: Fill with the mode (most frequent value)

  • subset (Optional[List[str]], default=None) – List of boolean columns to impute. If None, all boolean columns are selected.

  • value (Optional[bool], default=None) – Value to use when strategy is ‘constant’. Must be True or False. Required when strategy=’constant’, ignored otherwise.

  • inplace (bool, default=True) – If True, impute values in the original columns. If False, create new columns with suffix ‘__impute_{strategy}’.

  • drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after imputation. Ignored when inplace=True.

Examples

>>> import polars as pl
>>> from gators.imputers import BooleanImputer
>>> # Sample data
>>> X =pl.DataFrame({
...     'A': [True, False, None, True, None],
...     'B': [False, None, True, None, False]
... })
>>> # Impute with 'most_frequent' strategy
>>> imputer = BooleanImputer(strategy="most_frequent", inplace=False)
>>> _ = imputer.fit(X)
>>> transformed_X =imputer.transform(X)
>>> print(transformed_X)
shape: (5, 2)
┌─────────────────────────┬─────────────────────────┐
│ A__impute_most_frequent ┆ B__impute_most_frequent │
│ ---                     ┆ ---                     │
│ bool                    ┆ bool                    │
╞═════════════════════════╪═════════════════════════╡
│ true                    ┆ false                   │
│ false                   ┆ false                   │
│ true                    ┆ true                    │
│ true                    ┆ false                   │
│ true                    ┆ false                   │
└─────────────────────────┴─────────────────────────┘
>>> # Impute with 'constant' strategy
>>> from gators.imputers import BooleanImputer
>>> imputer = BooleanImputer(strategy="constant", value=False, drop_columns=False, inplace=False)
>>> _ = imputer.fit(X)
>>> transformed_X =imputer.transform(X)
>>> print(transformed_X)
shape: (5, 4)
┌───────┬───────┬────────────────────┬────────────────────┐
│ A     ┆ B     ┆ A__impute_constant ┆ B__impute_constant │
│ ---   ┆ ---   ┆ ---                ┆ ---                │
│ bool  ┆ bool  ┆ bool               ┆ bool               │
╞═══════╪═══════╪════════════════════╪════════════════════╡
│ true  ┆ false ┆ true               ┆ false              │
│ false ┆ null  ┆ false              ┆ false              │
│ null  ┆ true  ┆ false              ┆ true               │
│ true  ┆ null  ┆ true               ┆ false              │
│ null  ┆ false ┆ false              ┆ false              │
└───────┴───────┴────────────────────┴────────────────────┘
>>> # Impute with columns specified
>>> from gators.imputers import BooleanImputer
>>> imputer = BooleanImputer(strategy="constant", value=True, subset=['B'], drop_columns=False, inplace=False)
>>> _ = imputer.fit(X)
>>> transformed_X =imputer.transform(X)
>>> print(transformed_X)
shape: (5, 3)
┌───────┬───────┬────────────────────┐
│ A     ┆ B     ┆ B__impute_constant │
│ ---   ┆ ---   ┆ ---                │
│ bool  ┆ bool  ┆ bool               │
╞═══════╪═══════╪════════════════════╡
│ true  ┆ false ┆ false              │
│ false ┆ null  ┆ true               │
│ null  ┆ true  ┆ true               │
│ true  ┆ null  ┆ true               │
│ null  ┆ false ┆ false              │
└───────┴───────┴────────────────────┘
fit(X, y=None)[source]#

Fit the transformer by computing imputation statistics.

Parameters:
  • X (DataFrame) – Input DataFrame with boolean columns.

  • y (Series | None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

BooleanImputer

transform(X)[source]#

Transform the input DataFrame by imputing missing values in boolean columns.

Parameters:

X (DataFrame) – Input DataFrame with boolean columns containing null values.

Returns:

DataFrame with imputed boolean columns.

Return type:

DataFrame

class gators.imputers.GroupByImputer[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Impute missing values in numeric columns by grouping with a categorical column and filling with the median or mean of each group.

Parameters:
  • group_by_column (str) – The categorical column to group by for computing group-wise statistics.

  • strategy (Literal['median', 'mean']) –

    Strategy to use for imputing missing values within each group.

    • ’median’: Fill with the median of each group

    • ’mean’: Fill with the mean of each group

  • subset (Optional[List[str]], default=None) – List of numeric columns to impute. If None, all numeric columns (except group_by_column) are selected.

  • inplace (bool, default=True) – If True, impute values in the original columns. If False, create new columns with suffix ‘__impute_groupby_{strategy}’.

  • drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after imputation. Ignored when inplace=True.

Examples

>>> import polars as pl
>>> from gators.imputers import GroupByImputer
>>> # Sample DataFrame
>>> X = pl.DataFrame({
...     'district': ['A', 'A', 'B', 'B', 'C'],
...     'value1': [1.0, None, 3.0, 4.0, None],
...     'value2': [10.0, 20.0, None, 40.0, 50.0]
... })
>>> # Impute using group median
>>> imputer = GroupByImputer(group_by_column='district', strategy='median', inplace=False)
>>> imputer.fit(X)
GroupByImputer(group_by_column='district', strategy='median', subset=['value1', 'value2'], drop_columns=True, inplace=False)
>>> transformed_X = imputer.transform(X)
>>> print(transformed_X)
shape: (5, 3)
┌──────────┬───────────────────────────────┬───────────────────────────────┐
│ district ┆ value1__impute_groupby_median ┆ value2__impute_groupby_median │
│ ---      ┆ ---                           ┆ ---                           │
│ str      ┆ f64                           ┆ f64                           │
╞══════════╪═══════════════════════════════╪═══════════════════════════════╡
│ A        ┆ 1.0                           ┆ 10.0                          │
│ A        ┆ 1.0                           ┆ 20.0                          │
│ B        ┆ 3.0                           ┆ 40.0                          │
│ B        ┆ 4.0                           ┆ 40.0                          │
│ C        ┆ null                          ┆ 50.0                          │
└──────────┴───────────────────────────────┴───────────────────────────────┘
>>> # Impute using group mean with inplace=True
>>> imputer_inplace = GroupByImputer(
...     group_by_column='district',
...     strategy='mean',
...     inplace=True
... )
>>> imputer_inplace.fit(X)
GroupByImputer(group_by_column='district', strategy='mean', subset=['value1', 'value2'], drop_columns=True, inplace=True)
>>> transformed_X_inplace = imputer_inplace.transform(X)
>>> print(transformed_X_inplace)
shape: (5, 3)
┌──────────┬────────┬────────┐
│ district ┆ value1 ┆ value2 │
│ ---      ┆ ---    ┆ ---    │
│ str      ┆ f64    ┆ f64    │
╞══════════╪════════╪════════╡
│ A        ┆ 1.0    ┆ 10.0   │
│ A        ┆ 1.0    ┆ 20.0   │
│ B        ┆ 3.0    ┆ 40.0   │
│ B        ┆ 4.0    ┆ 40.0   │
│ C        ┆ null   ┆ 50.0   │
└──────────┴────────┴────────┘
fit(X, y=None)[source]#

Fit the transformer by computing group-wise imputation statistics.

Parameters:
  • X (DataFrame) – Input DataFrame with numeric columns and a grouping column.

  • y (Series | None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

GroupByImputer

transform(X)[source]#

Transform the input DataFrame by imputing missing values using group-wise statistics.

Parameters:

X (DataFrame) – Input DataFrame with numeric columns containing null values and a grouping column.

Returns:

DataFrame with imputed numeric columns based on group statistics.

Return type:

DataFrame

class gators.imputers.NumericImputer[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Impute missing values in numeric columns using various strategies.

Parameters:
  • strategy (Literal['constant', 'most_frequent', 'median', 'mean', 'min', 'max', 'forward', 'backward', 'zero', 'one']) –

    Strategy to use for imputing missing values.

    • ’constant’: Fill missing values with value

    • ’most_frequent’: Fill with the most frequent value in each column

    • ’median’: Fill with the median of each column

    • ’mean’: Fill with the mean of each column

    • ’min’: Fill with the minimum value in each column

    • ’max’: Fill with the maximum value in each column

    • ’forward’: Fill with the previous non-null value (forward fill)

    • ’backward’: Fill with the next non-null value (backward fill)

    • ’zero’: Fill missing values with 0

    • ’one’: Fill missing values with 1

  • subset (Optional[List[str]], default=None) – List of numeric columns to impute. If None, all numeric columns are selected.

  • value (Optional[Union[int, float]], default=None) – Value to use when strategy is ‘constant’. Required when strategy=’constant’, ignored otherwise.

  • inplace (bool, default=True) – If True, impute values in the original columns. If False, create new columns with suffix ‘__impute_{strategy}’.

  • drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after imputation. Ignored when inplace=True.

Examples

>>> import polars as pl
>>> from gators.imputers import NumericImputer
>>> # Sample DataFrame
>>> X = pl.DataFrame({
...     'A': [1.0, 2.0, None, 4.0],
...     'B': [5.0, None, 7.0, 8.0],
...     'C': [None, 2.0, 3.0, 4.0]
... })
>>> # Impute using the mean strategy
>>> imputer = NumericImputer(strategy='mean', inplace=False)
>>> imputer.fit(X)
NumericImputer(strategy='mean', subset=['A', 'B', 'C'], value=None, drop_columns=True, inplace=False)
>>> transformed_X = imputer.transform(X)
>>> print(transformed_X)
shape: (4, 3)
┌────────────────┬────────────────┬────────────────┐
│ A__impute_mean ┆ B__impute_mean ┆ C__impute_mean │
│ ---            ┆ ---            ┆ ---            │
│ f64            ┆ f64            ┆ f64            │
╞════════════════╪════════════════╪════════════════╡
│ 1.0            ┆ 5.0            ┆ 3.0            │
│ 2.0            ┆ 6.666667       ┆ 2.0            │
│ 2.333333       ┆ 7.0            ┆ 3.0            │
│ 4.0            ┆ 8.0            ┆ 4.0            │
└────────────────┴────────────────┴────────────────┘
>>> # Impute using a constant value
>>> imputer_constant = NumericImputer(strategy='constant', value=0, inplace=False)
>>> imputer_constant.fit(X)
NumericImputer(strategy='constant', subset=['A', 'B', 'C'], value=0, drop_columns=True, inplace=False)
>>> transformed_X_constant = imputer_constant.transform(X)
>>> print(transformed_X_constant)
shape: (4, 3)
┌────────────────────┬────────────────────┬────────────────────┐
│ A__impute_constant ┆ B__impute_constant ┆ C__impute_constant │
│ ---                ┆ ---                ┆ ---                │
│ f64                ┆ f64                ┆ f64                │
╞════════════════════╪════════════════════╪════════════════════╡
│ 1.0                ┆ 5.0                ┆ 0.0                │
│ 2.0                ┆ 0.0                ┆ 2.0                │
│ 0.0                ┆ 7.0                ┆ 3.0                │
│ 4.0                ┆ 8.0                ┆ 4.0                │
└────────────────────┴────────────────────┴────────────────────┘
>>> # Impute with drop_columns=False
>>> imputer_no_drop = NumericImputer(strategy='mean', drop_columns=False, inplace=False)
>>> imputer_no_drop.fit(X)
NumericImputer(strategy='mean', subset=['A', 'B', 'C'], value=None, drop_columns=False, inplace=False)
>>> transformed_X_no_drop = imputer_no_drop.transform(X)
>>> print(transformed_X_no_drop)
shape: (4, 6)
┌──────┬──────┬──────┬────────────────┬────────────────┬────────────────┐
│ A    ┆ B    ┆ C    ┆ A__impute_mean ┆ B__impute_mean ┆ C__impute_mean │
│ ---  ┆ ---  ┆ ---  ┆ ---            ┆ ---            ┆ ---            │
│ f64  ┆ f64  ┆ f64  ┆ f64            ┆ f64            ┆ f64            │
╞══════╪══════╪══════╪════════════════╪════════════════╪════════════════╡
│ 1.0  ┆ 5.0  ┆ null ┆ 1.0            ┆ 5.0            ┆ 3.0            │
│ 2.0  ┆ null ┆ 2.0  ┆ 2.0            ┆ 6.666667       ┆ 2.0            │
│ null ┆ 7.0  ┆ 3.0  ┆ 2.333333       ┆ 7.0            ┆ 3.0            │
│ 4.0  ┆ 8.0  ┆ 4.0  ┆ 4.0            ┆ 8.0            ┆ 4.0            │
└──────┴──────┴──────┴────────────────┴────────────────┴────────────────┘
>>> # Impute with a subset of columns
>>> imputer_subset = NumericImputer(strategy='mean', subset=['A'], inplace=False)
>>> imputer_subset.fit(X)
NumericImputer(strategy='mean', subset=['A'], value=None, drop_columns=True, inplace=False)
>>> transformed_X_subset = imputer_subset.transform(X)
>>> print(transformed_X_subset)
shape: (4, 3)
┌──────┬──────┬────────────────┐
│ B    ┆ C    ┆ A__impute_mean │
│ ---  ┆ ---  ┆ ---            │
│ f64  ┆ f64  ┆ f64            │
╞══════╪══════╪════════════════╡
│ 5.0  ┆ null ┆ 1.0            │
│ null ┆ 2.0  ┆ 2.0            │
│ 7.0  ┆ 3.0  ┆ 2.333333       │
│ 8.0  ┆ 4.0  ┆ 4.0            │
└──────┴──────┴────────────────┘
fit(X, y=None)[source]#

Fit the transformer by computing imputation statistics.

Parameters:
  • X (DataFrame) – Input DataFrame with numeric columns.

  • y (Series | None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

NumericImputer

transform(X)[source]#

Transform the input DataFrame by imputing missing values in numeric columns.

Parameters:

X (DataFrame) – Input DataFrame with numeric columns containing null values.

Returns:

DataFrame with imputed numeric columns.

Return type:

DataFrame

class gators.imputers.StringImputer[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Impute missing values in string columns of a Polars DataFrame.

Parameters:
  • strategy (Literal['constant', 'most_frequent']) –

    Strategy to use for imputing missing values.

    • ”constant”: Fill with a constant value specified by value

    • ”most_frequent”: Fill with the mode (most frequent value)

  • subset (Optional[List[str]], default=None) – List of string columns to impute. If None, all string columns are selected.

  • value (Optional[str], default=None) – Value to use when strategy is ‘constant’. Required when strategy=’constant’, ignored otherwise.

  • inplace (bool, default=True) – If True, impute values in the original columns. If False, create new columns with suffix ‘__impute_{strategy}’.

  • drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after imputation. Ignored when inplace=True.

Examples

>>> import polars as pl
>>> from gators.imputers import StringImputer
>>> X = pl.DataFrame({
...     'col1': ['a', None, 'b', None],
...     'col2': ['cat', 'dog', 'mouse', None],
...     'col3': ['Paris', 'London', 'Berlin', '']
... })
>>> imputer = StringImputer(strategy='most_frequent', inplace=False)
>>> imputer.fit(X)
StringImputer(strategy='most_frequent', subset=['col1', 'col2', 'col3'], value=None, drop_columns=True, inplace=False)
>>> X_imputed = imputer.transform(X)
>>> print(X_imputed)
shape: (4, 3)
┌────────────────────────────┬────────────────────────────┬────────────────────────────┐
│ col1__impute_most_frequent ┆ col2__impute_most_frequent ┆ col3__impute_most_frequent │
│ ---                        ┆ ---                        ┆ ---                        │
│ str                        ┆ str                        ┆ str                        │
╞════════════════════════════╪════════════════════════════╪════════════════════════════╡
│ a                          ┆ cat                        ┆ Paris                      │
│ a                          ┆ dog                        ┆ London                     │
│ b                          ┆ mouse                      ┆ Berlin                     │
│ a                          ┆ cat                        ┆                            │
└────────────────────────────┴────────────────────────────┴────────────────────────────┘
>>> imputer = StringImputer(strategy='most_frequent', drop_columns=False, inplace=False)
>>> X_imputed = imputer.fit_transform(X)
>>> print(X_imputed)
shape: (4, 6)
┌──────┬───────┬────────┬──────────────────┬─────────────────┬─────────────────┐
│ col1 ┆ col2  ┆ col3   ┆ col1__impute_mos ┆ col2__impute_mo ┆ col3__impute_mo │
│ ---  ┆ ---   ┆ ---    ┆ t_frequent       ┆ st_frequent     ┆ st_frequent     │
│ str  ┆ str   ┆ str    ┆ ---              ┆ ---             ┆ ---             │
│      ┆       ┆        ┆ str              ┆ str             ┆ str             │
╞══════╪═══════╪════════╪══════════════════╪═════════════════╪═════════════════╡
│ a    ┆ cat   ┆ Paris  ┆ a                ┆ cat             ┆ Paris           │
│ null ┆ dog   ┆ London ┆ a                ┆ dog             ┆ London          │
│ b    ┆ mouse ┆ Berlin ┆ b                ┆ mouse           ┆ Berlin          │
│ null ┆ null  ┆        ┆ a                ┆ cat             ┆                 │
└──────┴───────┴────────┴──────────────────┴─────────────────┴─────────────────┘
fit(X, y=None)[source]#

Fit the transformer by computing imputation statistics.

Parameters:
  • X (DataFrame) – Input DataFrame with string columns.

  • y (Series | None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

StringImputer

transform(X)[source]#

Transform the input DataFrame by imputing missing values in string columns.

Parameters:

X (DataFrame) – Input DataFrame with string columns containing null values.

Returns:

DataFrame with imputed string columns.

Return type:

DataFrame