gators.imputers package#
Module contents#
- class gators.imputers.BooleanImputer[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinImputes missing values in boolean columns using a specified strategy.
- Parameters:
strategy (Literal["constant", "most_frequent"]) –
Strategy to use for imputing missing values.
”constant”: Fill with a constant value specified by value
”most_frequent”: Fill with the mode (most frequent value)
subset (Optional[List[str]], default=None) – List of boolean columns to impute. If None, all boolean columns are selected.
value (Optional[bool], default=None) – Value to use when strategy is ‘constant’. Must be True or False. Required when strategy=’constant’, ignored otherwise.
inplace (bool, default=True) – If True, impute values in the original columns. If False, create new columns with suffix ‘__impute_{strategy}’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after imputation. Ignored when inplace=True.
Examples
>>> import polars as pl >>> from gators.imputers import BooleanImputer
>>> # Sample data >>> X =pl.DataFrame({ ... 'A': [True, False, None, True, None], ... 'B': [False, None, True, None, False] ... })
>>> # Impute with 'most_frequent' strategy >>> imputer = BooleanImputer(strategy="most_frequent", inplace=False) >>> _ = imputer.fit(X) >>> transformed_X =imputer.transform(X) >>> print(transformed_X) shape: (5, 2) ┌─────────────────────────┬─────────────────────────┐ │ A__impute_most_frequent ┆ B__impute_most_frequent │ │ --- ┆ --- │ │ bool ┆ bool │ ╞═════════════════════════╪═════════════════════════╡ │ true ┆ false │ │ false ┆ false │ │ true ┆ true │ │ true ┆ false │ │ true ┆ false │ └─────────────────────────┴─────────────────────────┘
>>> # Impute with 'constant' strategy >>> from gators.imputers import BooleanImputer >>> imputer = BooleanImputer(strategy="constant", value=False, drop_columns=False, inplace=False) >>> _ = imputer.fit(X) >>> transformed_X =imputer.transform(X) >>> print(transformed_X) shape: (5, 4) ┌───────┬───────┬────────────────────┬────────────────────┐ │ A ┆ B ┆ A__impute_constant ┆ B__impute_constant │ │ --- ┆ --- ┆ --- ┆ --- │ │ bool ┆ bool ┆ bool ┆ bool │ ╞═══════╪═══════╪════════════════════╪════════════════════╡ │ true ┆ false ┆ true ┆ false │ │ false ┆ null ┆ false ┆ false │ │ null ┆ true ┆ false ┆ true │ │ true ┆ null ┆ true ┆ false │ │ null ┆ false ┆ false ┆ false │ └───────┴───────┴────────────────────┴────────────────────┘
>>> # Impute with columns specified >>> from gators.imputers import BooleanImputer >>> imputer = BooleanImputer(strategy="constant", value=True, subset=['B'], drop_columns=False, inplace=False) >>> _ = imputer.fit(X) >>> transformed_X =imputer.transform(X) >>> print(transformed_X) shape: (5, 3) ┌───────┬───────┬────────────────────┐ │ A ┆ B ┆ B__impute_constant │ │ --- ┆ --- ┆ --- │ │ bool ┆ bool ┆ bool │ ╞═══════╪═══════╪════════════════════╡ │ true ┆ false ┆ false │ │ false ┆ null ┆ true │ │ null ┆ true ┆ true │ │ true ┆ null ┆ true │ │ null ┆ false ┆ false │ └───────┴───────┴────────────────────┘
- class gators.imputers.GroupByImputer[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinImpute missing values in numeric columns by grouping with a categorical column and filling with the median or mean of each group.
- Parameters:
group_by_column (str) – The categorical column to group by for computing group-wise statistics.
strategy (Literal['median', 'mean']) –
Strategy to use for imputing missing values within each group.
’median’: Fill with the median of each group
’mean’: Fill with the mean of each group
subset (Optional[List[str]], default=None) – List of numeric columns to impute. If None, all numeric columns (except group_by_column) are selected.
inplace (bool, default=True) – If True, impute values in the original columns. If False, create new columns with suffix ‘__impute_groupby_{strategy}’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after imputation. Ignored when inplace=True.
Examples
>>> import polars as pl >>> from gators.imputers import GroupByImputer
>>> # Sample DataFrame >>> X = pl.DataFrame({ ... 'district': ['A', 'A', 'B', 'B', 'C'], ... 'value1': [1.0, None, 3.0, 4.0, None], ... 'value2': [10.0, 20.0, None, 40.0, 50.0] ... })
>>> # Impute using group median >>> imputer = GroupByImputer(group_by_column='district', strategy='median', inplace=False) >>> imputer.fit(X) GroupByImputer(group_by_column='district', strategy='median', subset=['value1', 'value2'], drop_columns=True, inplace=False) >>> transformed_X = imputer.transform(X) >>> print(transformed_X) shape: (5, 3) ┌──────────┬───────────────────────────────┬───────────────────────────────┐ │ district ┆ value1__impute_groupby_median ┆ value2__impute_groupby_median │ │ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 │ ╞══════════╪═══════════════════════════════╪═══════════════════════════════╡ │ A ┆ 1.0 ┆ 10.0 │ │ A ┆ 1.0 ┆ 20.0 │ │ B ┆ 3.0 ┆ 40.0 │ │ B ┆ 4.0 ┆ 40.0 │ │ C ┆ null ┆ 50.0 │ └──────────┴───────────────────────────────┴───────────────────────────────┘
>>> # Impute using group mean with inplace=True >>> imputer_inplace = GroupByImputer( ... group_by_column='district', ... strategy='mean', ... inplace=True ... ) >>> imputer_inplace.fit(X) GroupByImputer(group_by_column='district', strategy='mean', subset=['value1', 'value2'], drop_columns=True, inplace=True) >>> transformed_X_inplace = imputer_inplace.transform(X) >>> print(transformed_X_inplace) shape: (5, 3) ┌──────────┬────────┬────────┐ │ district ┆ value1 ┆ value2 │ │ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 │ ╞══════════╪════════╪════════╡ │ A ┆ 1.0 ┆ 10.0 │ │ A ┆ 1.0 ┆ 20.0 │ │ B ┆ 3.0 ┆ 40.0 │ │ B ┆ 4.0 ┆ 40.0 │ │ C ┆ null ┆ 50.0 │ └──────────┴────────┴────────┘
- fit(X, y=None)[source]#
Fit the transformer by computing group-wise imputation statistics.
- Parameters:
X (
DataFrame) – Input DataFrame with numeric columns and a grouping column.y (
Series|None) – Target series (not used, present for sklearn compatibility).
- Returns:
The fitted transformer instance.
- Return type:
- transform(X)[source]#
Transform the input DataFrame by imputing missing values using group-wise statistics.
- Parameters:
X (
DataFrame) – Input DataFrame with numeric columns containing null values and a grouping column.- Returns:
DataFrame with imputed numeric columns based on group statistics.
- Return type:
DataFrame
- class gators.imputers.NumericImputer[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinImpute missing values in numeric columns using various strategies.
- Parameters:
strategy (Literal['constant', 'most_frequent', 'median', 'mean', 'min', 'max', 'forward', 'backward', 'zero', 'one']) –
Strategy to use for imputing missing values.
’constant’: Fill missing values with value
’most_frequent’: Fill with the most frequent value in each column
’median’: Fill with the median of each column
’mean’: Fill with the mean of each column
’min’: Fill with the minimum value in each column
’max’: Fill with the maximum value in each column
’forward’: Fill with the previous non-null value (forward fill)
’backward’: Fill with the next non-null value (backward fill)
’zero’: Fill missing values with 0
’one’: Fill missing values with 1
subset (Optional[List[str]], default=None) – List of numeric columns to impute. If None, all numeric columns are selected.
value (Optional[Union[int, float]], default=None) – Value to use when strategy is ‘constant’. Required when strategy=’constant’, ignored otherwise.
inplace (bool, default=True) – If True, impute values in the original columns. If False, create new columns with suffix ‘__impute_{strategy}’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after imputation. Ignored when inplace=True.
Examples
>>> import polars as pl >>> from gators.imputers import NumericImputer
>>> # Sample DataFrame >>> X = pl.DataFrame({ ... 'A': [1.0, 2.0, None, 4.0], ... 'B': [5.0, None, 7.0, 8.0], ... 'C': [None, 2.0, 3.0, 4.0] ... })
>>> # Impute using the mean strategy >>> imputer = NumericImputer(strategy='mean', inplace=False) >>> imputer.fit(X) NumericImputer(strategy='mean', subset=['A', 'B', 'C'], value=None, drop_columns=True, inplace=False) >>> transformed_X = imputer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌────────────────┬────────────────┬────────────────┐ │ A__impute_mean ┆ B__impute_mean ┆ C__impute_mean │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪════════════════╡ │ 1.0 ┆ 5.0 ┆ 3.0 │ │ 2.0 ┆ 6.666667 ┆ 2.0 │ │ 2.333333 ┆ 7.0 ┆ 3.0 │ │ 4.0 ┆ 8.0 ┆ 4.0 │ └────────────────┴────────────────┴────────────────┘
>>> # Impute using a constant value >>> imputer_constant = NumericImputer(strategy='constant', value=0, inplace=False) >>> imputer_constant.fit(X) NumericImputer(strategy='constant', subset=['A', 'B', 'C'], value=0, drop_columns=True, inplace=False) >>> transformed_X_constant = imputer_constant.transform(X) >>> print(transformed_X_constant) shape: (4, 3) ┌────────────────────┬────────────────────┬────────────────────┐ │ A__impute_constant ┆ B__impute_constant ┆ C__impute_constant │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞════════════════════╪════════════════════╪════════════════════╡ │ 1.0 ┆ 5.0 ┆ 0.0 │ │ 2.0 ┆ 0.0 ┆ 2.0 │ │ 0.0 ┆ 7.0 ┆ 3.0 │ │ 4.0 ┆ 8.0 ┆ 4.0 │ └────────────────────┴────────────────────┴────────────────────┘
>>> # Impute with drop_columns=False >>> imputer_no_drop = NumericImputer(strategy='mean', drop_columns=False, inplace=False) >>> imputer_no_drop.fit(X) NumericImputer(strategy='mean', subset=['A', 'B', 'C'], value=None, drop_columns=False, inplace=False) >>> transformed_X_no_drop = imputer_no_drop.transform(X) >>> print(transformed_X_no_drop) shape: (4, 6) ┌──────┬──────┬──────┬────────────────┬────────────────┬────────────────┐ │ A ┆ B ┆ C ┆ A__impute_mean ┆ B__impute_mean ┆ C__impute_mean │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞══════╪══════╪══════╪════════════════╪════════════════╪════════════════╡ │ 1.0 ┆ 5.0 ┆ null ┆ 1.0 ┆ 5.0 ┆ 3.0 │ │ 2.0 ┆ null ┆ 2.0 ┆ 2.0 ┆ 6.666667 ┆ 2.0 │ │ null ┆ 7.0 ┆ 3.0 ┆ 2.333333 ┆ 7.0 ┆ 3.0 │ │ 4.0 ┆ 8.0 ┆ 4.0 ┆ 4.0 ┆ 8.0 ┆ 4.0 │ └──────┴──────┴──────┴────────────────┴────────────────┴────────────────┘
>>> # Impute with a subset of columns >>> imputer_subset = NumericImputer(strategy='mean', subset=['A'], inplace=False) >>> imputer_subset.fit(X) NumericImputer(strategy='mean', subset=['A'], value=None, drop_columns=True, inplace=False) >>> transformed_X_subset = imputer_subset.transform(X) >>> print(transformed_X_subset) shape: (4, 3) ┌──────┬──────┬────────────────┐ │ B ┆ C ┆ A__impute_mean │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞══════╪══════╪════════════════╡ │ 5.0 ┆ null ┆ 1.0 │ │ null ┆ 2.0 ┆ 2.0 │ │ 7.0 ┆ 3.0 ┆ 2.333333 │ │ 8.0 ┆ 4.0 ┆ 4.0 │ └──────┴──────┴────────────────┘
- class gators.imputers.StringImputer[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinImpute missing values in string columns of a Polars DataFrame.
- Parameters:
strategy (Literal['constant', 'most_frequent']) –
Strategy to use for imputing missing values.
”constant”: Fill with a constant value specified by value
”most_frequent”: Fill with the mode (most frequent value)
subset (Optional[List[str]], default=None) – List of string columns to impute. If None, all string columns are selected.
value (Optional[str], default=None) – Value to use when strategy is ‘constant’. Required when strategy=’constant’, ignored otherwise.
inplace (bool, default=True) – If True, impute values in the original columns. If False, create new columns with suffix ‘__impute_{strategy}’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after imputation. Ignored when inplace=True.
Examples
>>> import polars as pl >>> from gators.imputers import StringImputer >>> X = pl.DataFrame({ ... 'col1': ['a', None, 'b', None], ... 'col2': ['cat', 'dog', 'mouse', None], ... 'col3': ['Paris', 'London', 'Berlin', ''] ... }) >>> imputer = StringImputer(strategy='most_frequent', inplace=False) >>> imputer.fit(X) StringImputer(strategy='most_frequent', subset=['col1', 'col2', 'col3'], value=None, drop_columns=True, inplace=False) >>> X_imputed = imputer.transform(X) >>> print(X_imputed) shape: (4, 3) ┌────────────────────────────┬────────────────────────────┬────────────────────────────┐ │ col1__impute_most_frequent ┆ col2__impute_most_frequent ┆ col3__impute_most_frequent │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str │ ╞════════════════════════════╪════════════════════════════╪════════════════════════════╡ │ a ┆ cat ┆ Paris │ │ a ┆ dog ┆ London │ │ b ┆ mouse ┆ Berlin │ │ a ┆ cat ┆ │ └────────────────────────────┴────────────────────────────┴────────────────────────────┘
>>> imputer = StringImputer(strategy='most_frequent', drop_columns=False, inplace=False) >>> X_imputed = imputer.fit_transform(X) >>> print(X_imputed) shape: (4, 6) ┌──────┬───────┬────────┬──────────────────┬─────────────────┬─────────────────┐ │ col1 ┆ col2 ┆ col3 ┆ col1__impute_mos ┆ col2__impute_mo ┆ col3__impute_mo │ │ --- ┆ --- ┆ --- ┆ t_frequent ┆ st_frequent ┆ st_frequent │ │ str ┆ str ┆ str ┆ --- ┆ --- ┆ --- │ │ ┆ ┆ ┆ str ┆ str ┆ str │ ╞══════╪═══════╪════════╪══════════════════╪═════════════════╪═════════════════╡ │ a ┆ cat ┆ Paris ┆ a ┆ cat ┆ Paris │ │ null ┆ dog ┆ London ┆ a ┆ dog ┆ London │ │ b ┆ mouse ┆ Berlin ┆ b ┆ mouse ┆ Berlin │ │ null ┆ null ┆ ┆ a ┆ cat ┆ │ └──────┴───────┴────────┴──────────────────┴─────────────────┴─────────────────┘