gators.data_cleaning package#

Module contents#

class gators.data_cleaning.RenameColumns[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Renames columns based on a provided mapping.

Parameters:: column_mapping (dict[str, str]) – Dictionary mapping original column names to new column names.

Examples

Example when renaming all columns:

>>> import polars as pl
>>> from gators.data_cleaning import RenameColumns
>>> X = pl.DataFrame({
...     "col1": ["a", "a", "b", "c"],
...     "col2": ["x", "x", "x", "y"],
...     "col3": [1, 2, 3, 4]
... })
>>> transformer = RenameColumns(column_mapping={"col1": "column1", "col2": "column2", "col3": "column3"})
>>> transformer.fit(X)
...
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (4, 3)
┌────────┬────────┬────────┐
│ column1│ column2│ column3│
│ str    │ str    │ i64    │
├────────┼────────┼────────┤
│ a      │  x     │  1     │
│ a      │  x     │  2     │
│ b      │  x     │  3     │
│ c      │  y     │  4     │
└────────┴────────┴────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.data_cleaning.rename_columns.RenameColumns[source]#

Fit the transformer by storing the column mapping.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

RenameColumns

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by extracting specified components.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame

class gators.data_cleaning.CastColumns[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Casts specified columns to a given data type.

Parameters:

subset (list[str], default=None) – List of column names to cast. If None, all columns will be cast.
dtype (type) – Target Polars data type (e.g., pl.Float64, pl.String, pl.Int64, pl.Datetime, pl.Date).
inplace (bool, default=True) – If True, cast values in the original columns. If False, create new columns with suffix ‘__cast_{dtype}’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after casting. Ignored when inplace=True.

Examples

Example 1: Cast columns with inplace=False and keep originals

>>> import polars as pl
>>> from gators.data_cleaning import CastColumns
>>> X = pl.DataFrame({
...     "col1": ["10", "20", "30"],
...     "col2": ["1.1", "2.2", "3.3"],
...     "col3": [True, False, True]
... })
>>> cast_columns = CastColumns(
...     subset=["col1", "col2"],
...     dtype=pl.Float64,
...     inplace=False,
...     drop_columns=False
... )
>>> cast_columns.fit(X)
>>> transformed_X = cast_columns.transform(X)
>>> print(transformed_X)
shape: (3, 5)
┌──────┬──────┬────────────────────┬────────────────────┬───────┐
│ col1 │ col2 │ col1__cast_float64 │ col2__cast_float64 │ col3  │
├──────┼──────┼────────────────────┼────────────────────┼───────┤
│ 10   │ 1.1  │ 10.0               │ 1.1                │ True  │
├──────┼──────┼────────────────────┼────────────────────┼───────┤
│ 20   │ 2.2  │ 20.0               │ 2.2                │ False │
├──────┼──────┼────────────────────┼────────────────────┼───────┤
│ 30   │ 3.3  │ 30.0               │ 3.3                │ True  │
└──────┴──────┴────────────────────┴────────────────────┴───────┘

Example 2: Cast columns with inplace=False and drop originals

>>> cast_columns = CastColumns(
...     subset=["col1", "col2"],
...     dtype=pl.Float64,
...     inplace=False,
...     drop_columns=True
... )
>>> cast_columns.fit(X)
>>> transformed_X = cast_columns.transform(X)
>>> print(transformed_X)
shape: (3, 3)
┌────────────────────┬────────────────────┬───────┐
│ col1__cast_float64 │ col2__cast_float64 │ col3  │
├────────────────────┼────────────────────┼───────┤
│ 10.0               │ 1.1                │ True  │
├────────────────────┼────────────────────┼───────┤
│ 20.0               │ 2.2                │ False │
├────────────────────┼────────────────────┼───────┤
│ 30.0               │ 3.3                │ True  │
└────────────────────┴────────────────────┴───────┘

Example 3: Cast columns in place

>>> cast_columns = CastColumns(
...     subset=["col1", "col2"],
...     dtype=pl.Float64,
...     inplace=True
... )
>>> cast_columns.fit(X)
>>> transformed_X = cast_columns.transform(X)
>>> print(transformed_X)
shape: (3, 3)
┌──────┬──────┬───────┐
│ col1 │ col2 │ col3  │
├──────┼──────┼───────┤
│ 10.0 │ 1.1  │ True  │
├──────┼──────┼───────┤
│ 20.0 │ 2.2  │ False │
├──────┼──────┼───────┤
│ 30.0 │ 3.3  │ True  │
└──────┴──────┴───────┘

Notes

When casting to Datetime or Date from String, the transformer handles format parsing automatically
If subset=None, all columns in the DataFrame will be cast to the specified dtype
When inplace=True, the drop_columns parameter is ignored as original columns are replaced

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.data_cleaning.cast_columns.CastColumns[source]#

Fit the transformer by identifying columns to cast.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

CastColumns

transform(X: polars.DataFrame, y: polars.Series | None = None) → polars.DataFrame[source]#

Transform the DataFrame by casting columns to the target type.

Parameters:

X (pl.DataFrame) – Input DataFrame to transform.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

DataFrame with cast columns.

Return type:

pl.DataFrame

class gators.data_cleaning.DropColumns[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Drops specified columns from a DataFrame.

Parameters:: subset (list[str]) – List of column names to drop.

Examples

Create an instance of the DropColumns class:

>>> import polars as pl
>>> from gators.data_cleaning import DropColumns
>>> drop_columns = DropColumns(subset=["col1", "col2"])

Fit the transformer:

>>> drop_columns.fit(X)

Transform the DataFrame:

>>> X = pl.DataFrame({"col1": [1, 2, 3],
...                    "col2": ["A", "B", "C"],
...                    "col3": [True, False, True]})
>>> transformed_X = drop_columns.transform(X)
>>> print(transformed_X)
shape: (3, 1)
┌───────┐
│ col3  │
├───────┤
│ True  │
├───────┤
│ False │
├───────┤
│ True  │
└───────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.data_cleaning.drop_columns.DropColumns[source]#

Fit the transformer (no-op for DropColumns).

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

DropColumns

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by extracting specified components.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame

class gators.data_cleaning.DropHighNaNRatio[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Drops columns with a high ratio of NaN values.

Parameters:

max_ratio (float) – Maximum allowed ratio of NaN values (0.0-1.0). Columns with NaN ratio >= max_ratio will be dropped.
subset (list[str], default=None) – List of columns to check for high NaN ratio. If None, all columns are checked.

Examples

Initializing and using DropHighNaNRatio transformer.

Example when drop_columns is True and columns is None:

>>> import polars as pl
>>> from gators.data_cleaning import DropHighNaNRatio
>>> X = pl.DataFrame({
...     "col1": ["a", None, "b", "c"],
...     "col2": ["x", "x", "x", None],
...     "col3": [1, 2, None, None]
... })
>>> transformer = DropHighNaNRatio(max_ratio=0.5)
>>> transformer.fit(X)
...
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (4, 1)
┌─────┐
│ col1│
│ str │
├─────┤
│  a  │
│ None│
│  b  │
│  c  │
└─────┘

Example when drop_columns is True and columns is a subset:

>>> transformer = DropHighNaNRatio(max_ratio=0.5, subset=['col2', 'col3'])
>>> transformer.fit(X)
...
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (4, 2)
┌─────┬─────┐
│ col1│ col2│
│ str │ str │
├─────┼─────┤
│  a  │  x  │
│ None│  x  │
│  b  │  x  │
│  c  │ None│
└─────┴─────┘

Example when drop_columns is False and columns is None:

>>> transformer = DropHighNaNRatio(max_ratio=0.5)
>>> transformer.fit(X)
...
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (4, 3)
┌─────┬─────┬──────┐
│ col1│ col2│ col3 │
│ str │ str │ i64  │
├─────┼─────┼──────┤
│  a  │  x  │   1  │
│ None│  x  │   2  │
│  b  │  x  │ None │
│  c  │ None│ None │
└─────┴─────┴──────┘

Example when drop_columns is False and columns is a subset:

>>> transformer = DropHighNaNRatio(max_ratio=0.5, subset=['col2', 'col3'])
>>> transformer.fit(X)
...
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (4, 3)
┌─────┬─────┬──────┐
│ col1│ col2│ col3 │
│ str │ str │ i64  │
├─────┼─────┼──────┤
│  a  │  x  │   1  │
│ None│  x  │   2  │
│  b  │  x  │ None │
│  c  │ None│ None │
└─────┴─────┴──────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.data_cleaning.drop_high_nan_ratio.DropHighNaNRatio[source]#

Fit the transformer by identifying columns with high NaN ratios.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

DropHighNaNRatio

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the DataFrame by dropping columns with high NaN ratios.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: DataFrame with high-NaN columns removed.
Return type:: pl.DataFrame

class gators.data_cleaning.DropLowCardinality[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Drops columns with low cardinality.

Parameters:

min_count (int) – Minimum number of unique values for a column to be retained. Must be >= 1. Columns with unique count < min_count will be dropped.
subset (list[str], default=None) – List of columns to check for low cardinality. If None, all string, boolean, and categorical columns are checked.

Examples

Initializing and using DropLowCardinality transformer.

Example when drop_columns is True and columns is None:

>>> X = pl.DataFrame({
...     "col1": ["a", "a", "b", "c"],
...     "col2": ["x", "x", "x", "y"],
...     "col3": [1, 2, 3, 4]
... })
>>> transformer = DropLowCardinality(min_count=2, columns=None, drop_columns=True)
>>> transformer.fit(X)
...
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (4, 1)
┌─────┐
│ col3│
│ i64 │
├─────┤
│  1  │
│  2  │
│  3  │
│  4  │
└─────┘

Example when drop_columns is True and columns is a subset:

>>> transformer = DropLowCardinality(min_count=2, subset=['col1'], drop_columns=True)
>>> transformer.fit(X)
...
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (4, 2)
┌─────┬─────┐
│ col2│ col3│
│ str │ i64 │
├─────┼─────┤
│  x  │  1  │
│  x  │  2  │
│  x  │  3  │
│  y  │  4  │
└─────┴─────┘

Example when drop_columns is False and columns is None:

>>> transformer = DropLowCardinality(min_count=2, columns=None, drop_columns=False)
>>> transformer.fit(X)
...
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (4, 3)
┌─────┬─────┬─────┐
│ col1│ col2│ col3│
│ str │ str │ i64 │
├─────┼─────┼─────┤
│  a  │  x  │  1  │
│  a  │  x  │  2  │
│  b  │  x  │  3  │
│  c  │  y  │  4  │
└─────┴─────┴─────┘

Example when drop_columns is False and columns is a subset:

>>> transformer = DropLowCardinality(min_count=2, subset=['col1'], drop_columns=False)
>>> transformer.fit(X)
...
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (4, 3)
┌─────┬─────┬─────┐
│ col1│ col2│ col3│
│ str │ str │ i64 │
├─────┼─────┼─────┤
│  a  │  x  │  1  │
│  a  │  x  │  2  │
│  b  │  x  │  3  │
│  c  │  y  │  4  │
└─────┴─────┴─────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.data_cleaning.drop_low_cardinality.DropLowCardinality[source]#

Fit the transformer by identifying columns with low cardinality.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

DropLowCardinality

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the DataFrame by dropping low cardinality columns.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: DataFrame with low-cardinality columns removed.
Return type:: pl.DataFrame

class gators.data_cleaning.VarianceFilter[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Removes numerical columns with a low variance.

Parameters:

subset (list[str], default=None) – List of numeric columns to check for variance. If None, all numeric columns are checked.
min_var (float) – Minimum variance threshold. Columns with variance <= min_var will be dropped. Must be >= 0.0.

Examples

Initialize and use VarianceFilter.

Example with all numeric columns:

>>> import polars as pl
>>> from gators.data_cleaning import VarianceFilter
>>> X = pl.DataFrame({
...     "feature1": [1, 2, 3, 4],
...     "feature2": [0.5, 0.5, 0.5, 0.5],  # Low variance
...     "feature3": [5, 6, 7, 8],
...     "label": [0, 1, 0, 1]
... })
>>> transformer = VarianceFilter(min_var=0.1)
>>> transformer.fit(X)
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (4, 3)
┌──────────┬─────────┬───────┐
│ feature1 │feature3 │ label │
│     i64  │    i64  │  i64  │
├──────────┼────────–┼──────⁠–┤
│       1  │      5  │    0  │
│       2  │      6  │    1  │
│       3  │      7  │    0  │
│       4  │      8  │    1  │
└──────────┴────────━┴─────–─┘

Example with specific columns:

>>> X = pl.DataFrame({
...     "feature1": [1, 2, 3, 4],
...     "feature2": [0.5, 0.5, 0.5, 0.5],
...     "feature3": [5, 6, 7, 8],
...     "label": [0, 1, 0, 1]
... })
>>> transformer = VarianceFilter(subset=['feature1’, ‘feature3'], min_var=0.1)
>>> transformer.fit(X)
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (4, 4)
┌──────────┬─────────┬──────────┬───────┐
│ feature1 │feature3 │ feature2 │ label │
│     i64  │    i64  │    i64   │  i64  │
├──────────┼────────–┼─────────⁠–┼──────⁠–┤
│       1  │      5  │      0.5 │   0   │
│       2  │      6  │      0.5 │   1   │
│       3  │      7  │      0.5 │   0   │
│       4  │      8  │      0.5 │   1   │
└──────────┴─────────┴──────────┴──────–┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.data_cleaning.variance_filter.VarianceFilter[source]#

Fit the transformer by identifying low-variance columns.

Parameters:

X (pl.DataFrame) – Input DataFrame with numeric columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

VarianceFilter

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the DataFrame by dropping low-variance columns.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: DataFrame with low-variance columns removed.
Return type:: pl.DataFrame

class gators.data_cleaning.Replace[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Replaces values in specified columns.

Parameters:

to_replace (dict[str, dict[str, any]]) – Nested dictionary specifying replacement mappings. Outer keys are column names, inner dictionaries map old values to new values.
inplace (bool, default=True) – If True, replace values in the original columns. If False, create new columns with suffix ‘__replace’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after replacement. Ignored when inplace=True.

Examples

Initializing and using Replace transformer.

Example with drop_columns=True and columns=None:

>>> X = pl.DataFrame({
...     "col1": ["a", "a", "b", "c"],
...     "col2": ["x", "x", "x", "y"],
...     "col3": [1, 2, 3, 4]
... })
>>> replace_map = {
...     "col1": {"a": "alpha", "b": "bravo"},
...     "col2": {"x": "x-ray", "y": "yankee"}
... }
>>> transformer = Replace(to_replace=replace_map, drop_columns=True)
>>> transformer.fit(X)
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (4, 2)
┌───────────────┬───────────────┐
│ col1__replace │ col2__replace │
│ str           │   str         │
├───────────────┬───────────────┤
│ alpha         │  x-ray        │
│ alpha         │ x-ray         │
│ bravo         │ x-ray         │
│ charlie       │ yankee        │
└───────────────┴───────────────┘

Example with drop_columns=True and columns as a subset:

>>> X = pl.DataFrame({
...     "col1": ["a", "a", "b", "c"],
...     "col2": ["x", "x", "x", "y"],
...     "col3": [1, 2, 3, 4]
... })
>>> replace_map = {
...     "col1": {"a": "alpha", "b": "bravo"}
... }
>>> transformer = Replace(to_replace=replace_map, drop_columns=True)
>>> transformer.fit(X)
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (4, 3)
┌───────────────────┬─────────────────────┬────────────────────┐
│           col1    │           col2      │          col3      │
│ str               │          str        │           i64      │
├───────────────────┬─────────────────────┬────────────────────┤
│ alpha             │         x           │           1        │
│ alpha             │         x           │           2        │
│ bravo             │         x           │           3        │
│ charlie           │         y           │           4        │
└───────────────────┴─────────────────────┴────────────────────┘

Example with drop_columns=False and columns=None:

>>> X = pl.DataFrame({
...     "col1": ["a", "a", "b", "c"],
...     "col2": ["x", "x", "x", "y"],
...     "col3": [1, 2, 3, 4]
... })
>>> replace_map = {
...     "col1": {"a": "alpha", "b": "bravo"},
...     "col2": {"x": "x-ray", "y": "yankee"}
... }
>>> transformer = Replace(to_replace=replace_map, drop_columns=False)
>>> transformer.fit(X)
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (4, 4)
┌─────────────────┬──────────────────────┬─────────────────────┐─────────────────────────┐
│         col1    │           col2       │           col3      │           col1__replace │
│        str      │          str         │           i64       │           str           │
│─────────────────┬──────────────────────┬─────────────────────┬─────────────────────────┤
│            alpha│                x-ray │           1         │          alpha          │
│            alpha│                x-ray │           2         │          alpha          │
│            bravo│                x-ray │           3         │          bravo          │
│        charlie  │               yankee │           4         │          charlie        │
└─────────────────┴──────────────────────┴─────────────────────┴─────────────────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.data_cleaning.replace.Replace[source]#

Fit the transformer.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Name of the target column (if needed).

Returns:

The fitted transformer instance.

Return type:

Self

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by extracting specified components.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame

class gators.data_cleaning.CorrelationFilter[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Filters out highly correlated numeric columns.

Identifies groups of highly correlated columns and removes all but one from each group, helping to reduce multicollinearity in the dataset.

Parameters:

subset (list[str], default=None) – List of numeric columns to consider for correlation filtering. If None, all numeric columns are used.
max_corr (float) – Maximum allowed absolute correlation between columns. Must be > 0 and <= 1. Columns with correlation >= max_corr are considered highly correlated.

Examples

>>> from correlation_filter import CorrelationFilter
>>> import polars as pl

>>> X ={'A': [1, 2, 3, 4],
...         'B': [4, 3, 2, 1],
...         'C': [1, 2, 1, 2],
...         'y': [1, 1, 0, 0]}
>>> X = pl.DataFrame(X)
>>> # Example 1
>>> corr_filter = CorrelationFilter(max_corr=0.9)
>>> _ = corr_filter.fit(X, y)
>>> result = corr_filter.transform(X)
>>> result
shape: (4, 2)
┌─────┬─────┐
│  C  │  y  │
│ i64 │ i64 │
├─────┼─────┤
│  1  │  1  │
│  2  │  1  │
│  1  │  0  │
│  2  │  0  │
└─────┴─────┘

>>> # Example 2
>>> corr_filter = CorrelationFilter(subset=['A', 'B'], max_corr=1)
>>> _ = corr_filter.fit(X)
>>> result = corr_filter.transform(X)
>>> result
shape: (4, 4)
┌─────┬─────┬─────┬─────┐
│  A  │  B  │  C  │  y  │
│ i64 │ i64 │ i64 │ i64 │
├─────┼─────┼─────┼─────┤
│  1  │  4  │  1  │  1  │
│  2  │  3  │  2  │  1  │
│  3  │  2  │  1  │  0  │
│  4  │  1  │  2  │  0  │
└─────┴─────┴─────┴─────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.data_cleaning.correlation_filter.CorrelationFilter[source]#

Fit the transformer by identifying highly correlated columns to drop.

Parameters:

X (pl.DataFrame) – Input DataFrame with numeric columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).

Returns:

The fitted transformer instance.

Return type:

CorrelationFilter

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the DataFrame by dropping highly correlated columns.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: DataFrame with highly correlated columns removed.
Return type:: pl.DataFrame

class gators.data_cleaning.DropDuplicateColumns[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Removes duplicate columns from the DataFrame.

Identifies and removes columns that have identical values across all rows. This is useful for reducing dimensionality and removing redundant features that don’t add predictive value.

Parameters:

keep (str, default='first') –

Strategy for keeping duplicate columns:

’first’: Keep first occurrence of duplicate columns
’last’: Keep last occurrence of duplicate columns

Examples

Example 1: Remove duplicate columns (keep first)

>>> from gators.data_cleaning import DropDuplicateColumns
>>> import polars as pl
>>> X = pl.DataFrame({
...     'A': [1, 2, 3, 4],
...     'B': [5, 6, 7, 8],
...     'C': [1, 2, 3, 4],  # Duplicate of A
...     'D': [9, 10, 11, 12],
...     'E': [5, 6, 7, 8]   # Duplicate of B
... })
>>> remover = DropDuplicateColumns(keep='first')
>>> remover.fit(X)
>>> result = remover.transform(X)
>>> print(result)
shape: (4, 3)
┌─────┬─────┬──────┐
│ A   ┆ B   ┆ D    │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ i64 ┆ i64  │
├─────┼─────┼──────┤
│ 1   ┆ 5   ┆ 9    │
│ 2   ┆ 6   ┆ 10   │
│ 3   ┆ 7   ┆ 11   │
│ 4   ┆ 8   ┆ 12   │
└─────┴─────┴──────┘

Example 2: Remove duplicate columns (keep last)

>>> X = pl.DataFrame({
...     'feature_1': [1.0, 2.0, 3.0],
...     'feature_2': [4.0, 5.0, 6.0],
...     'feature_3': [1.0, 2.0, 3.0],  # Duplicate of feature_1
...     'target': [0, 1, 0]
... })
>>> remover = DropDuplicateColumns(keep='last')
>>> remover.fit(X)
>>> print(f"Columns to drop: {remover.columns_to_drop_}")
Columns to drop: ['feature_1']
>>> print(f"Column groups: {remover.column_groups_}")
Column groups: {'feature_3': ['feature_1']}
>>> result = remover.transform(X)
>>> print(result)
shape: (3, 3)
┌───────────┬───────────┬────────┐
│ feature_2 | feature_3 ┆ target │
│ ---       | ---       ┆ ---    │
│ f64       | f64       ┆ i64    │
├───────────┼───────────┼────────┤
│ 4.0       | 1.0       ┆ 0      │
│ 5.0       | 2.0       ┆ 1      │
│ 6.0       | 3.0       ┆ 0      │
└───────────┴───────────┴────────┘

Example 3: Check duplicate groups

>>> X = pl.DataFrame({
...     'a': [1, 2, 3],
...     'b': [1, 2, 3],
...     'c': [1, 2, 3],
...     'd': [4, 5, 6]
... })
>>> remover = DropDuplicateColumns()
>>> remover.fit(X)
>>> print(f"Kept column groups: {remover.column_groups_}")
Kept column groups: {'a': ['b', 'c']}
>>> result = remover.transform(X)
>>> print(result.columns)
['a', 'd']

Example 4: No duplicates

>>> X = pl.DataFrame({
...     'x': [1, 2, 3],
...     'y': [4, 5, 6],
...     'z': [7, 8, 9]
... })
>>> remover = DropDuplicateColumns()
>>> remover.fit(X)
>>> print(f"Columns to drop: {remover.columns_to_drop_}")
Columns to drop: []
>>> result = remover.transform(X)
>>> print(result.shape)
(3, 3)

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.data_cleaning.drop_duplicate_columns.DropDuplicateColumns[source]#

Fit the transformer by identifying duplicate columns.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

DropDuplicateColumns

Raises:

ValueError – If keep parameter is not ‘first’ or ‘last’.

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the DataFrame by removing duplicate columns.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with duplicate columns removed.
Return type:: pl.DataFrame

class gators.data_cleaning.DropDuplicateRows[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Removes duplicate rows from the DataFrame.

Identifies and removes duplicate rows based on all columns or a subset of columns. Critical for preventing data leakage and ensuring data quality.

Parameters:

subset (list[str], default=None) – List of columns to consider for identifying duplicates. If None, all columns are used.
keep (str, default='first') –
Strategy for keeping duplicates:
- ’first’: Keep first occurrence, drop subsequent duplicates
- ’last’: Keep last occurrence, drop previous duplicates
- ’none’: Drop all duplicates (keep no occurrences)

Examples

Example 1: Remove full duplicate rows (keep first)

>>> from gators.data_cleaning import DropDuplicateRows
>>> import polars as pl
>>> X = pl.DataFrame({
...     'id': [1, 2, 2, 3, 4, 4],
...     'name': ['Alice', 'Bob', 'Bob', 'Charlie', 'David', 'David'],
...     'age': [25, 30, 30, 35, 40, 40]
... })
>>> remover = DropDuplicateRows(keep='first')
>>> result = remover.fit_transform(X)
>>> print(result)
shape: (4, 3)
┌─────┬─────────┬─────┐
│ id  ┆ name    ┆ age │
│ --- ┆ ---     ┆ --- │
│ i64 ┆ str     ┆ i64 │
├─────┼─────────┼─────┤
│ 1   ┆ Alice   ┆ 25  │
│ 2   ┆ Bob     ┆ 30  │
│ 3   ┆ Charlie ┆ 35  │
│ 4   ┆ David   ┆ 40  │
└─────┴─────────┴─────┘

Example 2: Remove duplicates based on subset (keep last)

>>> X = pl.DataFrame({
...     'id': [1, 2, 3, 4],
...     'name': ['Alice', 'Bob', 'Alice', 'Bob'],
...     'score': [85, 90, 88, 92]
... })
>>> remover = DropDuplicateRows(subset=['name'], keep='last')
>>> result = remover.fit_transform(X)
>>> print(result)
shape: (2, 3)
┌─────┬───────┬───────┐
│ id  ┆ name  ┆ score │
│ --- ┆ ---   ┆ ---   │
│ i64 ┆ str   ┆ i64   │
├─────┼───────┼───────┤
│ 3   ┆ Alice ┆ 88    │
│ 4   ┆ Bob   ┆ 92    │
└─────┴───────┴───────┘

Example 3: Drop all duplicate occurrences (keep none)

>>> X = pl.DataFrame({
...     'user_id': [1, 2, 2, 3, 4, 4, 5],
...     'action': ['login', 'view', 'view', 'click', 'buy', 'buy', 'logout']
... })
>>> remover = DropDuplicateRows(subset=['user_id'], keep='none')
>>> result = remover.fit_transform(X)
>>> print(result)
shape: (3, 2)
┌─────────┬────────┐
│ user_id ┆ action │
│ ---     ┆ ---    │
│ i64     ┆ str    │
├─────────┼────────┤
│ 1       ┆ login  │
│ 3       ┆ click  │
│ 5       ┆ logout │
└─────────┴────────┘

Example 4: Check for duplicates without subset

>>> X = pl.DataFrame({
...     'a': [1, 1, 2],
...     'b': [10, 10, 20],
...     'c': [100, 100, 200]
... })
>>> remover = DropDuplicateRows()
>>> result = remover.fit_transform(X)
>>> print(result)
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
├─────┼─────┼─────┤
│ 1   ┆ 10  ┆ 100 │
│ 2   ┆ 20  ┆ 200 │
└─────┴─────┴─────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.data_cleaning.drop_duplicate_rows.DropDuplicateRows[source]#

Fit the transformer by validating parameters.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

DropDuplicateRows

Raises:

ValueError – If subset columns are specified but not found in DataFrame.

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the DataFrame by removing duplicate rows.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with duplicates removed.
Return type:: pl.DataFrame

class gators.data_cleaning.DropConstantColumns[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Removes columns that have only a single unique value (constant columns).

Identifies and removes columns with zero information content. More specific than VarianceFilter (which only works on numerics) and faster than variance calculation. Handles both numeric and categorical constant columns.

Parameters:

subset (list[str], default=None) – List of columns to check for constant values. If None, all columns are checked.
include_na (bool, default=True) – Whether to count NaN/null as a unique value. If True, a column with all NaN is considered constant. If False, NaN values are ignored when counting unique values.

Examples

Example 1: Remove constant numeric column

>>> from gators.data_cleaning import DropConstantColumns
>>> import polars as pl
>>> X = pl.DataFrame({
...     'id': [1, 2, 3, 4, 5],
...     'constant_num': [42, 42, 42, 42, 42],
...     'varying': [10, 20, 30, 40, 50]
... })
>>> remover = DropConstantColumns()
>>> result = remover.fit_transform(X)
>>> print(result)
shape: (5, 2)
┌─────┬─────────┐
│ id  ┆ varying │
│ --- ┆ ---     │
│ i64 ┆ i64     │
├─────┼─────────┤
│ 1   ┆ 10      │
│ 2   ┆ 20      │
│ 3   ┆ 30      │
│ 4   ┆ 40      │
│ 5   ┆ 50      │
└─────┴─────────┘

Example 2: Remove constant categorical column

>>> X = pl.DataFrame({
...     'country': ['USA', 'USA', 'USA', 'USA'],
...     'city': ['NYC', 'LA', 'Chicago', 'Boston'],
...     'status': ['active', 'active', 'active', 'active']
... })
>>> remover = DropConstantColumns()
>>> result = remover.fit_transform(X)
>>> print(result)
shape: (4, 1)
┌─────────┐
│ city    │
│ ---     │
│ str     │
├─────────┤
│ NYC     │
│ LA      │
│ Chicago │
│ Boston  │
└─────────┘

Example 3: Handle NaN values (with include_na=True)

>>> X = pl.DataFrame({
...     'all_null': [None, None, None],
...     'mixed': [1, None, 1],
...     'varying': [1, 2, 3]
... })
>>> remover = DropConstantColumns(include_na=True)
>>> result = remover.fit_transform(X)
>>> print(result)
shape: (3, 2)
┌───────┬─────────┐
│ mixed ┆ varying │
│ ---   ┆ ---     │
│ i64   ┆ i64     │
├───────┼─────────┤
│ 1     ┆ 1       │
│ null  ┆ 2       │
│ 1     ┆ 3       │
└───────┴─────────┘

Example 4: Handle NaN values (with include_na=False)

>>> remover = DropConstantColumns(include_na=False)
>>> result = remover.fit_transform(X)
>>> print(result)
shape: (3, 1)
┌─────────┐
│ varying │
│ ---     │
│ i64     │
├─────────┤
│ 1       │
│ 2       │
│ 3       │
└─────────┘

Example 5: Subset of columns

>>> X = pl.DataFrame({
...     'col1': [1, 1, 1],
...     'col2': [5, 5, 5],
...     'col3': [10, 20, 30]
... })
>>> remover = DropConstantColumns(subset=['col1', 'col2'])
>>> result = remover.fit_transform(X)
>>> print(result)
shape: (3, 1)
┌──────┐
│ col3 │
│ ---  │
│ i64  │
├──────┤
│ 10   │
│ 20   │
│ 30   │
└──────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.data_cleaning.drop_constant_columns.DropConstantColumns[source]#

Fit the transformer by identifying constant columns.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

DropConstantColumns

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the DataFrame by removing constant columns.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with constant columns removed.
Return type:: pl.DataFrame

class gators.data_cleaning.HighCardinalityFilter[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Removes columns with too many unique values (high cardinality).

Identifies and removes columns with excessive cardinality, which can cause issues for tree-based models (memory, overfitting) and create sparse encodings. Common use case: remove ID-like columns, timestamps, or free-text fields.

Opposite of DropLowCardinality. Can filter by absolute count threshold or by ratio of unique values to total rows.

Parameters:

subset (list[str], default=None) – List of columns to check for high cardinality. If None, all columns are checked.
max_unique (int, default=None) – Maximum number of unique values allowed. Columns with more unique values will be removed. If None, no absolute threshold is applied.
max_ratio (float, default=None) – Maximum ratio of unique values to total rows. Must be between 0 and 1. For example, 0.9 means columns where >90% of rows are unique will be removed. If None, no ratio threshold is applied.
ignore_na (bool, default=True) – Whether to ignore NaN/null values when counting unique values. If True, NaN is not counted as a unique value.

Examples

Example 1: Remove by absolute count

>>> from gators.data_cleaning import HighCardinalityFilter
>>> import polars as pl
>>> X = pl.DataFrame({
...     'user_id': range(1000),
...     'country': ['USA'] * 500 + ['UK'] * 500,
...     'transaction_id': [f'tx_{i}' for i in range(1000)]
... })
>>> filter = HighCardinalityFilter(max_unique=100)
>>> result = filter.fit_transform(X)
>>> print(result)
shape: (1000, 1)
┌─────────┐
│ country │
│ ---     │
│ str     │
├─────────┤
│ USA     │
│ USA     │
│ ...     │
│ UK      │
│ UK      │
└─────────┘

Example 2: Remove by ratio

>>> X = pl.DataFrame({
...     'id': range(100),
...     'category': ['A', 'B', 'C'] * 33 + ['A'],
...     'subcategory': ['X', 'Y'] * 50
... })
>>> filter = HighCardinalityFilter(max_ratio=0.95)
>>> result = filter.fit_transform(X)
>>> print(result.columns)
['category', 'subcategory']

Example 3: Combined thresholds

>>> X = pl.DataFrame({
...     'col1': range(50),  # 50 unique, ratio=1.0
...     'col2': list(range(25)) * 2,  # 25 unique, ratio=0.5
...     'col3': ['A', 'B'] * 25  # 2 unique, ratio=0.04
... })
>>> filter = HighCardinalityFilter(max_unique=30, max_ratio=0.8)
>>> result = filter.fit_transform(X)
>>> print(result.columns)
['col2', 'col3']

Example 4: Handling NaN

>>> X = pl.DataFrame({
...     'col1': [1, 2, 3, None, None] * 20,  # 3 unique + NaN
...     'col2': list(range(90)) + [None] * 10  # 90 unique + NaN
... })
>>> filter = HighCardinalityFilter(max_unique=50, ignore_na=True)
>>> result = filter.fit_transform(X)
>>> print(result.columns)
['col1']

Example 5: Subset of columns

>>> X = pl.DataFrame({
...     'id1': range(100),
...     'id2': range(100),
...     'feature': ['A', 'B'] * 50
... })
>>> filter = HighCardinalityFilter(subset=['id1', 'id2'], max_unique=50)
>>> result = filter.fit_transform(X)
>>> print(result.columns)
['feature']

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.data_cleaning.high_cardinality_filter.HighCardinalityFilter[source]#

Fit the transformer by identifying high-cardinality columns.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

HighCardinalityFilter

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the DataFrame by removing high-cardinality columns.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with high-cardinality columns removed.
Return type:: pl.DataFrame

class gators.data_cleaning.RoundSignificantDigits[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Round selected numeric columns to a given number of significant figures.

Significant-figure rounding preserves the most meaningful digits of a value regardless of its magnitude (e.g., with n_digits=3: 0.001234 → 0.00123, 1234.0 → 1230.0, -9876.5 → -9880.0).

Parameters:

n_digits (int) – Number of significant figures to keep. Must be >= 1.
subset (list[str], default=None) – Columns to round. When None, all numeric columns in the DataFrame are rounded automatically.
inplace (bool, default=True) – If True, the original columns are replaced in-place. If False, new columns named {col}__round_{n_digits}sig are added alongside the originals.
drop_columns (bool, default=True) – Relevant only when inplace=False. If True, the original columns are dropped after the new rounded columns are added. Ignored when inplace=True.

Examples

Example 1: Round all numeric columns in-place (default)

>>> import polars as pl
>>> from gators.data_cleaning import RoundSignificantDigits
>>> X = pl.DataFrame({
...     "a": [0.001234, 1234.0, -9876.5],
...     "b": [3.14159, 0.0, 9.9999],
...     "label": ["x", "y", "z"],
... })
>>> transformer = RoundSignificantDigits(n_digits=3)
>>> transformer.fit(X)
RoundSignificantDigits(n_digits=3, subset=None, inplace=True, drop_columns=True)
>>> print(transformer.transform(X))
shape: (3, 3)
┌──────────┬───────┬───────┐
│ a        ┆ b     ┆ label │
│ ---      ┆ ---   ┆ ---   │
│ f64      ┆ f64   ┆ str   │
╞══════════╪═══════╪═══════╡
│ 0.00123  ┆ 3.14  ┆ x     │
│ 1230.0   ┆ 0.0   ┆ y     │
│ -9880.0  ┆ 10.0  ┆ z     │
└──────────┴───────┴───────┘

Example 2: Add rounded columns without dropping originals

>>> transformer = RoundSignificantDigits(
...     n_digits=2, subset=["a"], inplace=False, drop_columns=False
... )
>>> transformer.fit(X)
RoundSignificantDigits(n_digits=2, subset=['a'], inplace=False, drop_columns=False)
>>> print(transformer.transform(X))
shape: (3, 4)
┌──────────┬───────┬───────┬───────────────┐
│ a        ┆ b     ┆ label ┆ a__round_2sig │
│ ---      ┆ ---   ┆ ---   ┆ ---           │
│ f64      ┆ f64   ┆ str   ┆ f64           │
╞══════════╪═══════╪═══════╪═══════════════╡
│ 0.001234 ┆ 3.14  ┆ x     ┆ 0.0012        │
│ 1234.0   ┆ 0.0   ┆ y     ┆ 1200.0        │
│ -9876.5  ┆ 10.0  ┆ z     ┆ -9900.0       │
└──────────┴───────┴───────┴───────────────┘

Example 3: Add rounded columns and drop originals

>>> transformer = RoundSignificantDigits(
...     n_digits=2, subset=["a"], inplace=False, drop_columns=True
... )
>>> transformer.fit(X)
RoundSignificantDigits(n_digits=2, subset=['a'], inplace=False, drop_columns=True)
>>> print(transformer.transform(X))
shape: (3, 3)
┌───────┬───────┬───────────────┐
│ b     ┆ label ┆ a__round_2sig │
│ ---   ┆ ---   ┆ ---           │
│ f64   ┆ str   ┆ f64           │
╞═══════╪═══════╪═══════════════╡
│ 3.14  ┆ x     ┆ 0.0012        │
│ 0.0   ┆ y     ┆ 1200.0        │
│ 10.0  ┆ z     ┆ -9900.0       │
└───────┴───────┴───────────────┘

Notes

All numeric columns (including integers) are cast to Float64 during the rounding computation; the output columns therefore have dtype Float64.
Zero values are returned as 0.0 (log10(0) is undefined).
Null values propagate unchanged.

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.data_cleaning.round_significant_digits.RoundSignificantDigits[source]#

Fit the transformer by recording which columns to round.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Ignored; present for sklearn compatibility.

Returns:

Fitted transformer instance.

Return type:

RoundSignificantDigits

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Round the selected columns to the configured number of significant figures.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: DataFrame with rounded columns.
Return type:: pl.DataFrame