gators.data_cleaning package#
Module contents#
- class gators.data_cleaning.RenameColumns[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerRenames columns based on a provided mapping.
- Parameters:
column_mapping (dict[str, str]) – Dictionary mapping original column names to new column names.
Examples
Example when renaming all columns:
>>> import polars as pl >>> from gators.data_cleaning import RenameColumns >>> X = pl.DataFrame({ ... "col1": ["a", "a", "b", "c"], ... "col2": ["x", "x", "x", "y"], ... "col3": [1, 2, 3, 4] ... }) >>> transformer = RenameColumns(column_mapping={"col1": "column1", "col2": "column2", "col3": "column3"}) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌────────┬────────┬────────┐ │ column1│ column2│ column3│ │ str │ str │ i64 │ ├────────┼────────┼────────┤ │ a │ x │ 1 │ │ a │ x │ 2 │ │ b │ x │ 3 │ │ c │ y │ 4 │ └────────┴────────┴────────┘
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.data_cleaning.rename_columns.RenameColumns[source]#
Fit the transformer by storing the column mapping.
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).
- Returns:
The fitted transformer instance.
- Return type:
- class gators.data_cleaning.CastColumns[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerCasts specified columns to a given data type.
- Parameters:
subset (list[str], default=None) – List of column names to cast. If None, all columns will be cast.
dtype (type) – Target Polars data type (e.g., pl.Float64, pl.String, pl.Int64, pl.Datetime, pl.Date).
inplace (bool, default=True) – If True, cast values in the original columns. If False, create new columns with suffix ‘__cast_{dtype}’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after casting. Ignored when inplace=True.
Examples
Example 1: Cast columns with inplace=False and keep originals
>>> import polars as pl >>> from gators.data_cleaning import CastColumns >>> X = pl.DataFrame({ ... "col1": ["10", "20", "30"], ... "col2": ["1.1", "2.2", "3.3"], ... "col3": [True, False, True] ... }) >>> cast_columns = CastColumns( ... subset=["col1", "col2"], ... dtype=pl.Float64, ... inplace=False, ... drop_columns=False ... ) >>> cast_columns.fit(X) >>> transformed_X = cast_columns.transform(X) >>> print(transformed_X) shape: (3, 5) ┌──────┬──────┬────────────────────┬────────────────────┬───────┐ │ col1 │ col2 │ col1__cast_float64 │ col2__cast_float64 │ col3 │ ├──────┼──────┼────────────────────┼────────────────────┼───────┤ │ 10 │ 1.1 │ 10.0 │ 1.1 │ True │ ├──────┼──────┼────────────────────┼────────────────────┼───────┤ │ 20 │ 2.2 │ 20.0 │ 2.2 │ False │ ├──────┼──────┼────────────────────┼────────────────────┼───────┤ │ 30 │ 3.3 │ 30.0 │ 3.3 │ True │ └──────┴──────┴────────────────────┴────────────────────┴───────┘
Example 2: Cast columns with inplace=False and drop originals
>>> cast_columns = CastColumns( ... subset=["col1", "col2"], ... dtype=pl.Float64, ... inplace=False, ... drop_columns=True ... ) >>> cast_columns.fit(X) >>> transformed_X = cast_columns.transform(X) >>> print(transformed_X) shape: (3, 3) ┌────────────────────┬────────────────────┬───────┐ │ col1__cast_float64 │ col2__cast_float64 │ col3 │ ├────────────────────┼────────────────────┼───────┤ │ 10.0 │ 1.1 │ True │ ├────────────────────┼────────────────────┼───────┤ │ 20.0 │ 2.2 │ False │ ├────────────────────┼────────────────────┼───────┤ │ 30.0 │ 3.3 │ True │ └────────────────────┴────────────────────┴───────┘
Example 3: Cast columns in place
>>> cast_columns = CastColumns( ... subset=["col1", "col2"], ... dtype=pl.Float64, ... inplace=True ... ) >>> cast_columns.fit(X) >>> transformed_X = cast_columns.transform(X) >>> print(transformed_X) shape: (3, 3) ┌──────┬──────┬───────┐ │ col1 │ col2 │ col3 │ ├──────┼──────┼───────┤ │ 10.0 │ 1.1 │ True │ ├──────┼──────┼───────┤ │ 20.0 │ 2.2 │ False │ ├──────┼──────┼───────┤ │ 30.0 │ 3.3 │ True │ └──────┴──────┴───────┘
Notes
When casting to Datetime or Date from String, the transformer handles format parsing automatically
If subset=None, all columns in the DataFrame will be cast to the specified dtype
When inplace=True, the drop_columns parameter is ignored as original columns are replaced
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.data_cleaning.cast_columns.CastColumns[source]#
Fit the transformer by identifying columns to cast.
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).
- Returns:
The fitted transformer instance.
- Return type:
- transform(X: polars.DataFrame, y: polars.Series | None = None) polars.DataFrame[source]#
Transform the DataFrame by casting columns to the target type.
- Parameters:
X (pl.DataFrame) – Input DataFrame to transform.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).
- Returns:
DataFrame with cast columns.
- Return type:
pl.DataFrame
- class gators.data_cleaning.DropColumns[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerDrops specified columns from a DataFrame.
- Parameters:
subset (list[str]) – List of column names to drop.
Examples
Create an instance of the DropColumns class:
>>> import polars as pl >>> from gators.data_cleaning import DropColumns >>> drop_columns = DropColumns(subset=["col1", "col2"])
Fit the transformer:
>>> drop_columns.fit(X)
Transform the DataFrame:
>>> X = pl.DataFrame({"col1": [1, 2, 3], ... "col2": ["A", "B", "C"], ... "col3": [True, False, True]}) >>> transformed_X = drop_columns.transform(X) >>> print(transformed_X) shape: (3, 1) ┌───────┐ │ col3 │ ├───────┤ │ True │ ├───────┤ │ False │ ├───────┤ │ True │ └───────┘
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.data_cleaning.drop_columns.DropColumns[source]#
Fit the transformer (no-op for DropColumns).
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).
- Returns:
The fitted transformer instance.
- Return type:
- class gators.data_cleaning.DropHighNaNRatio[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerDrops columns with a high ratio of NaN values.
- Parameters:
max_ratio (float) – Maximum allowed ratio of NaN values (0.0-1.0). Columns with NaN ratio >= max_ratio will be dropped.
subset (list[str], default=None) – List of columns to check for high NaN ratio. If None, all columns are checked.
Examples
Initializing and using DropHighNaNRatio transformer.
Example when drop_columns is True and columns is None:
>>> import polars as pl >>> from gators.data_cleaning import DropHighNaNRatio >>> X = pl.DataFrame({ ... "col1": ["a", None, "b", "c"], ... "col2": ["x", "x", "x", None], ... "col3": [1, 2, None, None] ... }) >>> transformer = DropHighNaNRatio(max_ratio=0.5) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 1) ┌─────┐ │ col1│ │ str │ ├─────┤ │ a │ │ None│ │ b │ │ c │ └─────┘
Example when drop_columns is True and columns is a subset:
>>> transformer = DropHighNaNRatio(max_ratio=0.5, subset=['col2', 'col3']) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 2) ┌─────┬─────┐ │ col1│ col2│ │ str │ str │ ├─────┼─────┤ │ a │ x │ │ None│ x │ │ b │ x │ │ c │ None│ └─────┴─────┘
Example when drop_columns is False and columns is None:
>>> transformer = DropHighNaNRatio(max_ratio=0.5) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌─────┬─────┬──────┐ │ col1│ col2│ col3 │ │ str │ str │ i64 │ ├─────┼─────┼──────┤ │ a │ x │ 1 │ │ None│ x │ 2 │ │ b │ x │ None │ │ c │ None│ None │ └─────┴─────┴──────┘
Example when drop_columns is False and columns is a subset:
>>> transformer = DropHighNaNRatio(max_ratio=0.5, subset=['col2', 'col3']) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌─────┬─────┬──────┐ │ col1│ col2│ col3 │ │ str │ str │ i64 │ ├─────┼─────┼──────┤ │ a │ x │ 1 │ │ None│ x │ 2 │ │ b │ x │ None │ │ c │ None│ None │ └─────┴─────┴──────┘
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.data_cleaning.drop_high_nan_ratio.DropHighNaNRatio[source]#
Fit the transformer by identifying columns with high NaN ratios.
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).
- Returns:
The fitted transformer instance.
- Return type:
- class gators.data_cleaning.DropLowCardinality[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerDrops columns with low cardinality.
- Parameters:
min_count (int) – Minimum number of unique values for a column to be retained. Must be >= 1. Columns with unique count < min_count will be dropped.
subset (list[str], default=None) – List of columns to check for low cardinality. If None, all string, boolean, and categorical columns are checked.
Examples
Initializing and using DropLowCardinality transformer.
Example when drop_columns is True and columns is None:
>>> X = pl.DataFrame({ ... "col1": ["a", "a", "b", "c"], ... "col2": ["x", "x", "x", "y"], ... "col3": [1, 2, 3, 4] ... }) >>> transformer = DropLowCardinality(min_count=2, columns=None, drop_columns=True) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 1) ┌─────┐ │ col3│ │ i64 │ ├─────┤ │ 1 │ │ 2 │ │ 3 │ │ 4 │ └─────┘
Example when drop_columns is True and columns is a subset:
>>> transformer = DropLowCardinality(min_count=2, subset=['col1'], drop_columns=True) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 2) ┌─────┬─────┐ │ col2│ col3│ │ str │ i64 │ ├─────┼─────┤ │ x │ 1 │ │ x │ 2 │ │ x │ 3 │ │ y │ 4 │ └─────┴─────┘
Example when drop_columns is False and columns is None:
>>> transformer = DropLowCardinality(min_count=2, columns=None, drop_columns=False) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌─────┬─────┬─────┐ │ col1│ col2│ col3│ │ str │ str │ i64 │ ├─────┼─────┼─────┤ │ a │ x │ 1 │ │ a │ x │ 2 │ │ b │ x │ 3 │ │ c │ y │ 4 │ └─────┴─────┴─────┘
Example when drop_columns is False and columns is a subset:
>>> transformer = DropLowCardinality(min_count=2, subset=['col1'], drop_columns=False) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌─────┬─────┬─────┐ │ col1│ col2│ col3│ │ str │ str │ i64 │ ├─────┼─────┼─────┤ │ a │ x │ 1 │ │ a │ x │ 2 │ │ b │ x │ 3 │ │ c │ y │ 4 │ └─────┴─────┴─────┘
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.data_cleaning.drop_low_cardinality.DropLowCardinality[source]#
Fit the transformer by identifying columns with low cardinality.
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).
- Returns:
The fitted transformer instance.
- Return type:
- class gators.data_cleaning.VarianceFilter[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerRemoves numerical columns with a low variance.
- Parameters:
subset (list[str], default=None) – List of numeric columns to check for variance. If None, all numeric columns are checked.
min_var (float) – Minimum variance threshold. Columns with variance <= min_var will be dropped. Must be >= 0.0.
Examples
Initialize and use
VarianceFilter.Example with all numeric columns:
>>> import polars as pl >>> from gators.data_cleaning import VarianceFilter >>> X = pl.DataFrame({ ... "feature1": [1, 2, 3, 4], ... "feature2": [0.5, 0.5, 0.5, 0.5], # Low variance ... "feature3": [5, 6, 7, 8], ... "label": [0, 1, 0, 1] ... }) >>> transformer = VarianceFilter(min_var=0.1) >>> transformer.fit(X) >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌──────────┬─────────┬───────┐ │ feature1 │feature3 │ label │ │ i64 │ i64 │ i64 │ ├──────────┼────────–┼──────–┤ │ 1 │ 5 │ 0 │ │ 2 │ 6 │ 1 │ │ 3 │ 7 │ 0 │ │ 4 │ 8 │ 1 │ └──────────┴────────━┴─────–─┘
Example with specific columns:
>>> X = pl.DataFrame({ ... "feature1": [1, 2, 3, 4], ... "feature2": [0.5, 0.5, 0.5, 0.5], ... "feature3": [5, 6, 7, 8], ... "label": [0, 1, 0, 1] ... }) >>> transformer = VarianceFilter(subset=['feature1’, ‘feature3'], min_var=0.1) >>> transformer.fit(X) >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 4) ┌──────────┬─────────┬──────────┬───────┐ │ feature1 │feature3 │ feature2 │ label │ │ i64 │ i64 │ i64 │ i64 │ ├──────────┼────────–┼─────────–┼──────–┤ │ 1 │ 5 │ 0.5 │ 0 │ │ 2 │ 6 │ 0.5 │ 1 │ │ 3 │ 7 │ 0.5 │ 0 │ │ 4 │ 8 │ 0.5 │ 1 │ └──────────┴─────────┴──────────┴──────–┘
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.data_cleaning.variance_filter.VarianceFilter[source]#
Fit the transformer by identifying low-variance columns.
- Parameters:
X (pl.DataFrame) – Input DataFrame with numeric columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).
- Returns:
The fitted transformer instance.
- Return type:
- class gators.data_cleaning.Replace[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerReplaces values in specified columns.
- Parameters:
to_replace (dict[str, dict[str, any]]) – Nested dictionary specifying replacement mappings. Outer keys are column names, inner dictionaries map old values to new values.
inplace (bool, default=True) – If True, replace values in the original columns. If False, create new columns with suffix ‘__replace’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after replacement. Ignored when inplace=True.
Examples
Initializing and using Replace transformer.
Example with drop_columns=True and columns=None:
>>> X = pl.DataFrame({ ... "col1": ["a", "a", "b", "c"], ... "col2": ["x", "x", "x", "y"], ... "col3": [1, 2, 3, 4] ... }) >>> replace_map = { ... "col1": {"a": "alpha", "b": "bravo"}, ... "col2": {"x": "x-ray", "y": "yankee"} ... } >>> transformer = Replace(to_replace=replace_map, drop_columns=True) >>> transformer.fit(X) >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 2) ┌───────────────┬───────────────┐ │ col1__replace │ col2__replace │ │ str │ str │ ├───────────────┬───────────────┤ │ alpha │ x-ray │ │ alpha │ x-ray │ │ bravo │ x-ray │ │ charlie │ yankee │ └───────────────┴───────────────┘
Example with drop_columns=True and columns as a subset:
>>> X = pl.DataFrame({ ... "col1": ["a", "a", "b", "c"], ... "col2": ["x", "x", "x", "y"], ... "col3": [1, 2, 3, 4] ... }) >>> replace_map = { ... "col1": {"a": "alpha", "b": "bravo"} ... } >>> transformer = Replace(to_replace=replace_map, drop_columns=True) >>> transformer.fit(X) >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌───────────────────┬─────────────────────┬────────────────────┐ │ col1 │ col2 │ col3 │ │ str │ str │ i64 │ ├───────────────────┬─────────────────────┬────────────────────┤ │ alpha │ x │ 1 │ │ alpha │ x │ 2 │ │ bravo │ x │ 3 │ │ charlie │ y │ 4 │ └───────────────────┴─────────────────────┴────────────────────┘
Example with drop_columns=False and columns=None:
>>> X = pl.DataFrame({ ... "col1": ["a", "a", "b", "c"], ... "col2": ["x", "x", "x", "y"], ... "col3": [1, 2, 3, 4] ... }) >>> replace_map = { ... "col1": {"a": "alpha", "b": "bravo"}, ... "col2": {"x": "x-ray", "y": "yankee"} ... } >>> transformer = Replace(to_replace=replace_map, drop_columns=False) >>> transformer.fit(X) >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 4) ┌─────────────────┬──────────────────────┬─────────────────────┐─────────────────────────┐ │ col1 │ col2 │ col3 │ col1__replace │ │ str │ str │ i64 │ str │ │─────────────────┬──────────────────────┬─────────────────────┬─────────────────────────┤ │ alpha│ x-ray │ 1 │ alpha │ │ alpha│ x-ray │ 2 │ alpha │ │ bravo│ x-ray │ 3 │ bravo │ │ charlie │ yankee │ 4 │ charlie │ └─────────────────┴──────────────────────┴─────────────────────┴─────────────────────────┘
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.data_cleaning.replace.Replace[source]#
Fit the transformer.
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Name of the target column (if needed).
- Returns:
The fitted transformer instance.
- Return type:
Self
- class gators.data_cleaning.CorrelationFilter[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerFilters out highly correlated numeric columns.
Identifies groups of highly correlated columns and removes all but one from each group, helping to reduce multicollinearity in the dataset.
- Parameters:
subset (list[str], default=None) – List of numeric columns to consider for correlation filtering. If None, all numeric columns are used.
max_corr (float) – Maximum allowed absolute correlation between columns. Must be > 0 and <= 1. Columns with correlation >= max_corr are considered highly correlated.
Examples
>>> from correlation_filter import CorrelationFilter >>> import polars as pl
>>> X ={'A': [1, 2, 3, 4], ... 'B': [4, 3, 2, 1], ... 'C': [1, 2, 1, 2], ... 'y': [1, 1, 0, 0]} >>> X = pl.DataFrame(X) >>> # Example 1 >>> corr_filter = CorrelationFilter(max_corr=0.9) >>> _ = corr_filter.fit(X, y) >>> result = corr_filter.transform(X) >>> result shape: (4, 2) ┌─────┬─────┐ │ C │ y │ │ i64 │ i64 │ ├─────┼─────┤ │ 1 │ 1 │ │ 2 │ 1 │ │ 1 │ 0 │ │ 2 │ 0 │ └─────┴─────┘
>>> # Example 2 >>> corr_filter = CorrelationFilter(subset=['A', 'B'], max_corr=1) >>> _ = corr_filter.fit(X) >>> result = corr_filter.transform(X) >>> result shape: (4, 4) ┌─────┬─────┬─────┬─────┐ │ A │ B │ C │ y │ │ i64 │ i64 │ i64 │ i64 │ ├─────┼─────┼─────┼─────┤ │ 1 │ 4 │ 1 │ 1 │ │ 2 │ 3 │ 2 │ 1 │ │ 3 │ 2 │ 1 │ 0 │ │ 4 │ 1 │ 2 │ 0 │ └─────┴─────┴─────┴─────┘
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.data_cleaning.correlation_filter.CorrelationFilter[source]#
Fit the transformer by identifying highly correlated columns to drop.
- Parameters:
X (pl.DataFrame) – Input DataFrame with numeric columns.
y (pl.Series, default=None) – Target series (not used, present for sklearn compatibility).
- Returns:
The fitted transformer instance.
- Return type:
- class gators.data_cleaning.DropDuplicateColumns[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerRemoves duplicate columns from the DataFrame.
Identifies and removes columns that have identical values across all rows. This is useful for reducing dimensionality and removing redundant features that don’t add predictive value.
- Parameters:
keep (str, default='first') –
Strategy for keeping duplicate columns:
’first’: Keep first occurrence of duplicate columns
’last’: Keep last occurrence of duplicate columns
Examples
Example 1: Remove duplicate columns (keep first)
>>> from gators.data_cleaning import DropDuplicateColumns >>> import polars as pl >>> X = pl.DataFrame({ ... 'A': [1, 2, 3, 4], ... 'B': [5, 6, 7, 8], ... 'C': [1, 2, 3, 4], # Duplicate of A ... 'D': [9, 10, 11, 12], ... 'E': [5, 6, 7, 8] # Duplicate of B ... }) >>> remover = DropDuplicateColumns(keep='first') >>> remover.fit(X) >>> result = remover.transform(X) >>> print(result) shape: (4, 3) ┌─────┬─────┬──────┐ │ A ┆ B ┆ D │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ├─────┼─────┼──────┤ │ 1 ┆ 5 ┆ 9 │ │ 2 ┆ 6 ┆ 10 │ │ 3 ┆ 7 ┆ 11 │ │ 4 ┆ 8 ┆ 12 │ └─────┴─────┴──────┘
Example 2: Remove duplicate columns (keep last)
>>> X = pl.DataFrame({ ... 'feature_1': [1.0, 2.0, 3.0], ... 'feature_2': [4.0, 5.0, 6.0], ... 'feature_3': [1.0, 2.0, 3.0], # Duplicate of feature_1 ... 'target': [0, 1, 0] ... }) >>> remover = DropDuplicateColumns(keep='last') >>> remover.fit(X) >>> print(f"Columns to drop: {remover.columns_to_drop_}") Columns to drop: ['feature_1'] >>> print(f"Column groups: {remover.column_groups_}") Column groups: {'feature_3': ['feature_1']} >>> result = remover.transform(X) >>> print(result) shape: (3, 3) ┌───────────┬───────────┬────────┐ │ feature_2 | feature_3 ┆ target │ │ --- | --- ┆ --- │ │ f64 | f64 ┆ i64 │ ├───────────┼───────────┼────────┤ │ 4.0 | 1.0 ┆ 0 │ │ 5.0 | 2.0 ┆ 1 │ │ 6.0 | 3.0 ┆ 0 │ └───────────┴───────────┴────────┘
Example 3: Check duplicate groups
>>> X = pl.DataFrame({ ... 'a': [1, 2, 3], ... 'b': [1, 2, 3], ... 'c': [1, 2, 3], ... 'd': [4, 5, 6] ... }) >>> remover = DropDuplicateColumns() >>> remover.fit(X) >>> print(f"Kept column groups: {remover.column_groups_}") Kept column groups: {'a': ['b', 'c']} >>> result = remover.transform(X) >>> print(result.columns) ['a', 'd']
Example 4: No duplicates
>>> X = pl.DataFrame({ ... 'x': [1, 2, 3], ... 'y': [4, 5, 6], ... 'z': [7, 8, 9] ... }) >>> remover = DropDuplicateColumns() >>> remover.fit(X) >>> print(f"Columns to drop: {remover.columns_to_drop_}") Columns to drop: [] >>> result = remover.transform(X) >>> print(result.shape) (3, 3)
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.data_cleaning.drop_duplicate_columns.DropDuplicateColumns[source]#
Fit the transformer by identifying duplicate columns.
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.
- Returns:
Fitted transformer instance.
- Return type:
- Raises:
ValueError – If keep parameter is not ‘first’ or ‘last’.
- class gators.data_cleaning.DropDuplicateRows[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerRemoves duplicate rows from the DataFrame.
Identifies and removes duplicate rows based on all columns or a subset of columns. Critical for preventing data leakage and ensuring data quality.
- Parameters:
subset (list[str], default=None) – List of columns to consider for identifying duplicates. If None, all columns are used.
keep (str, default='first') –
Strategy for keeping duplicates:
’first’: Keep first occurrence, drop subsequent duplicates
’last’: Keep last occurrence, drop previous duplicates
’none’: Drop all duplicates (keep no occurrences)
Examples
Example 1: Remove full duplicate rows (keep first)
>>> from gators.data_cleaning import DropDuplicateRows >>> import polars as pl >>> X = pl.DataFrame({ ... 'id': [1, 2, 2, 3, 4, 4], ... 'name': ['Alice', 'Bob', 'Bob', 'Charlie', 'David', 'David'], ... 'age': [25, 30, 30, 35, 40, 40] ... }) >>> remover = DropDuplicateRows(keep='first') >>> result = remover.fit_transform(X) >>> print(result) shape: (4, 3) ┌─────┬─────────┬─────┐ │ id ┆ name ┆ age │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 │ ├─────┼─────────┼─────┤ │ 1 ┆ Alice ┆ 25 │ │ 2 ┆ Bob ┆ 30 │ │ 3 ┆ Charlie ┆ 35 │ │ 4 ┆ David ┆ 40 │ └─────┴─────────┴─────┘
Example 2: Remove duplicates based on subset (keep last)
>>> X = pl.DataFrame({ ... 'id': [1, 2, 3, 4], ... 'name': ['Alice', 'Bob', 'Alice', 'Bob'], ... 'score': [85, 90, 88, 92] ... }) >>> remover = DropDuplicateRows(subset=['name'], keep='last') >>> result = remover.fit_transform(X) >>> print(result) shape: (2, 3) ┌─────┬───────┬───────┐ │ id ┆ name ┆ score │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 │ ├─────┼───────┼───────┤ │ 3 ┆ Alice ┆ 88 │ │ 4 ┆ Bob ┆ 92 │ └─────┴───────┴───────┘
Example 3: Drop all duplicate occurrences (keep none)
>>> X = pl.DataFrame({ ... 'user_id': [1, 2, 2, 3, 4, 4, 5], ... 'action': ['login', 'view', 'view', 'click', 'buy', 'buy', 'logout'] ... }) >>> remover = DropDuplicateRows(subset=['user_id'], keep='none') >>> result = remover.fit_transform(X) >>> print(result) shape: (3, 2) ┌─────────┬────────┐ │ user_id ┆ action │ │ --- ┆ --- │ │ i64 ┆ str │ ├─────────┼────────┤ │ 1 ┆ login │ │ 3 ┆ click │ │ 5 ┆ logout │ └─────────┴────────┘
Example 4: Check for duplicates without subset
>>> X = pl.DataFrame({ ... 'a': [1, 1, 2], ... 'b': [10, 10, 20], ... 'c': [100, 100, 200] ... }) >>> remover = DropDuplicateRows() >>> result = remover.fit_transform(X) >>> print(result) shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ├─────┼─────┼─────┤ │ 1 ┆ 10 ┆ 100 │ │ 2 ┆ 20 ┆ 200 │ └─────┴─────┴─────┘
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.data_cleaning.drop_duplicate_rows.DropDuplicateRows[source]#
Fit the transformer by validating parameters.
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.
- Returns:
Fitted transformer instance.
- Return type:
- Raises:
ValueError – If subset columns are specified but not found in DataFrame.
- class gators.data_cleaning.DropConstantColumns[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerRemoves columns that have only a single unique value (constant columns).
Identifies and removes columns with zero information content. More specific than VarianceFilter (which only works on numerics) and faster than variance calculation. Handles both numeric and categorical constant columns.
- Parameters:
subset (list[str], default=None) – List of columns to check for constant values. If None, all columns are checked.
include_na (bool, default=True) – Whether to count NaN/null as a unique value. If True, a column with all NaN is considered constant. If False, NaN values are ignored when counting unique values.
Examples
Example 1: Remove constant numeric column
>>> from gators.data_cleaning import DropConstantColumns >>> import polars as pl >>> X = pl.DataFrame({ ... 'id': [1, 2, 3, 4, 5], ... 'constant_num': [42, 42, 42, 42, 42], ... 'varying': [10, 20, 30, 40, 50] ... }) >>> remover = DropConstantColumns() >>> result = remover.fit_transform(X) >>> print(result) shape: (5, 2) ┌─────┬─────────┐ │ id ┆ varying │ │ --- ┆ --- │ │ i64 ┆ i64 │ ├─────┼─────────┤ │ 1 ┆ 10 │ │ 2 ┆ 20 │ │ 3 ┆ 30 │ │ 4 ┆ 40 │ │ 5 ┆ 50 │ └─────┴─────────┘
Example 2: Remove constant categorical column
>>> X = pl.DataFrame({ ... 'country': ['USA', 'USA', 'USA', 'USA'], ... 'city': ['NYC', 'LA', 'Chicago', 'Boston'], ... 'status': ['active', 'active', 'active', 'active'] ... }) >>> remover = DropConstantColumns() >>> result = remover.fit_transform(X) >>> print(result) shape: (4, 1) ┌─────────┐ │ city │ │ --- │ │ str │ ├─────────┤ │ NYC │ │ LA │ │ Chicago │ │ Boston │ └─────────┘
Example 3: Handle NaN values (with include_na=True)
>>> X = pl.DataFrame({ ... 'all_null': [None, None, None], ... 'mixed': [1, None, 1], ... 'varying': [1, 2, 3] ... }) >>> remover = DropConstantColumns(include_na=True) >>> result = remover.fit_transform(X) >>> print(result) shape: (3, 2) ┌───────┬─────────┐ │ mixed ┆ varying │ │ --- ┆ --- │ │ i64 ┆ i64 │ ├───────┼─────────┤ │ 1 ┆ 1 │ │ null ┆ 2 │ │ 1 ┆ 3 │ └───────┴─────────┘
Example 4: Handle NaN values (with include_na=False)
>>> remover = DropConstantColumns(include_na=False) >>> result = remover.fit_transform(X) >>> print(result) shape: (3, 1) ┌─────────┐ │ varying │ │ --- │ │ i64 │ ├─────────┤ │ 1 │ │ 2 │ │ 3 │ └─────────┘
Example 5: Subset of columns
>>> X = pl.DataFrame({ ... 'col1': [1, 1, 1], ... 'col2': [5, 5, 5], ... 'col3': [10, 20, 30] ... }) >>> remover = DropConstantColumns(subset=['col1', 'col2']) >>> result = remover.fit_transform(X) >>> print(result) shape: (3, 1) ┌──────┐ │ col3 │ │ --- │ │ i64 │ ├──────┤ │ 10 │ │ 20 │ │ 30 │ └──────┘
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.data_cleaning.drop_constant_columns.DropConstantColumns[source]#
Fit the transformer by identifying constant columns.
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.
- Returns:
Fitted transformer instance.
- Return type:
- class gators.data_cleaning.HighCardinalityFilter[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerRemoves columns with too many unique values (high cardinality).
Identifies and removes columns with excessive cardinality, which can cause issues for tree-based models (memory, overfitting) and create sparse encodings. Common use case: remove ID-like columns, timestamps, or free-text fields.
Opposite of DropLowCardinality. Can filter by absolute count threshold or by ratio of unique values to total rows.
- Parameters:
subset (list[str], default=None) – List of columns to check for high cardinality. If None, all columns are checked.
max_unique (int, default=None) – Maximum number of unique values allowed. Columns with more unique values will be removed. If None, no absolute threshold is applied.
max_ratio (float, default=None) – Maximum ratio of unique values to total rows. Must be between 0 and 1. For example, 0.9 means columns where >90% of rows are unique will be removed. If None, no ratio threshold is applied.
ignore_na (bool, default=True) – Whether to ignore NaN/null values when counting unique values. If True, NaN is not counted as a unique value.
Examples
Example 1: Remove by absolute count
>>> from gators.data_cleaning import HighCardinalityFilter >>> import polars as pl >>> X = pl.DataFrame({ ... 'user_id': range(1000), ... 'country': ['USA'] * 500 + ['UK'] * 500, ... 'transaction_id': [f'tx_{i}' for i in range(1000)] ... }) >>> filter = HighCardinalityFilter(max_unique=100) >>> result = filter.fit_transform(X) >>> print(result) shape: (1000, 1) ┌─────────┐ │ country │ │ --- │ │ str │ ├─────────┤ │ USA │ │ USA │ │ ... │ │ UK │ │ UK │ └─────────┘
Example 2: Remove by ratio
>>> X = pl.DataFrame({ ... 'id': range(100), ... 'category': ['A', 'B', 'C'] * 33 + ['A'], ... 'subcategory': ['X', 'Y'] * 50 ... }) >>> filter = HighCardinalityFilter(max_ratio=0.95) >>> result = filter.fit_transform(X) >>> print(result.columns) ['category', 'subcategory']
Example 3: Combined thresholds
>>> X = pl.DataFrame({ ... 'col1': range(50), # 50 unique, ratio=1.0 ... 'col2': list(range(25)) * 2, # 25 unique, ratio=0.5 ... 'col3': ['A', 'B'] * 25 # 2 unique, ratio=0.04 ... }) >>> filter = HighCardinalityFilter(max_unique=30, max_ratio=0.8) >>> result = filter.fit_transform(X) >>> print(result.columns) ['col2', 'col3']
Example 4: Handling NaN
>>> X = pl.DataFrame({ ... 'col1': [1, 2, 3, None, None] * 20, # 3 unique + NaN ... 'col2': list(range(90)) + [None] * 10 # 90 unique + NaN ... }) >>> filter = HighCardinalityFilter(max_unique=50, ignore_na=True) >>> result = filter.fit_transform(X) >>> print(result.columns) ['col1']
Example 5: Subset of columns
>>> X = pl.DataFrame({ ... 'id1': range(100), ... 'id2': range(100), ... 'feature': ['A', 'B'] * 50 ... }) >>> filter = HighCardinalityFilter(subset=['id1', 'id2'], max_unique=50) >>> result = filter.fit_transform(X) >>> print(result.columns) ['feature']
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.data_cleaning.high_cardinality_filter.HighCardinalityFilter[source]#
Fit the transformer by identifying high-cardinality columns.
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.
- Returns:
Fitted transformer instance.
- Return type:
- class gators.data_cleaning.RoundSignificantDigits[source]#
Bases:
gators.transformer._base_transformer._BaseTransformerRound selected numeric columns to a given number of significant figures.
Significant-figure rounding preserves the most meaningful digits of a value regardless of its magnitude (e.g., with
n_digits=3:0.001234 → 0.00123,1234.0 → 1230.0,-9876.5 → -9880.0).- Parameters:
n_digits (int) – Number of significant figures to keep. Must be >= 1.
subset (list[str], default=None) – Columns to round. When
None, all numeric columns in the DataFrame are rounded automatically.inplace (bool, default=True) – If
True, the original columns are replaced in-place. IfFalse, new columns named{col}__round_{n_digits}sigare added alongside the originals.drop_columns (bool, default=True) – Relevant only when
inplace=False. IfTrue, the original columns are dropped after the new rounded columns are added. Ignored wheninplace=True.
Examples
Example 1: Round all numeric columns in-place (default)
>>> import polars as pl >>> from gators.data_cleaning import RoundSignificantDigits >>> X = pl.DataFrame({ ... "a": [0.001234, 1234.0, -9876.5], ... "b": [3.14159, 0.0, 9.9999], ... "label": ["x", "y", "z"], ... }) >>> transformer = RoundSignificantDigits(n_digits=3) >>> transformer.fit(X) RoundSignificantDigits(n_digits=3, subset=None, inplace=True, drop_columns=True) >>> print(transformer.transform(X)) shape: (3, 3) ┌──────────┬───────┬───────┐ │ a ┆ b ┆ label │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞══════════╪═══════╪═══════╡ │ 0.00123 ┆ 3.14 ┆ x │ │ 1230.0 ┆ 0.0 ┆ y │ │ -9880.0 ┆ 10.0 ┆ z │ └──────────┴───────┴───────┘
Example 2: Add rounded columns without dropping originals
>>> transformer = RoundSignificantDigits( ... n_digits=2, subset=["a"], inplace=False, drop_columns=False ... ) >>> transformer.fit(X) RoundSignificantDigits(n_digits=2, subset=['a'], inplace=False, drop_columns=False) >>> print(transformer.transform(X)) shape: (3, 4) ┌──────────┬───────┬───────┬───────────────┐ │ a ┆ b ┆ label ┆ a__round_2sig │ │ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str ┆ f64 │ ╞══════════╪═══════╪═══════╪═══════════════╡ │ 0.001234 ┆ 3.14 ┆ x ┆ 0.0012 │ │ 1234.0 ┆ 0.0 ┆ y ┆ 1200.0 │ │ -9876.5 ┆ 10.0 ┆ z ┆ -9900.0 │ └──────────┴───────┴───────┴───────────────┘
Example 3: Add rounded columns and drop originals
>>> transformer = RoundSignificantDigits( ... n_digits=2, subset=["a"], inplace=False, drop_columns=True ... ) >>> transformer.fit(X) RoundSignificantDigits(n_digits=2, subset=['a'], inplace=False, drop_columns=True) >>> print(transformer.transform(X)) shape: (3, 3) ┌───────┬───────┬───────────────┐ │ b ┆ label ┆ a__round_2sig │ │ --- ┆ --- ┆ --- │ │ f64 ┆ str ┆ f64 │ ╞═══════╪═══════╪═══════════════╡ │ 3.14 ┆ x ┆ 0.0012 │ │ 0.0 ┆ y ┆ 1200.0 │ │ 10.0 ┆ z ┆ -9900.0 │ └───────┴───────┴───────────────┘
Notes
All numeric columns (including integers) are cast to
Float64during the rounding computation; the output columns therefore have dtypeFloat64.Zero values are returned as
0.0(log10(0) is undefined).Null values propagate unchanged.
- fit(X: polars.DataFrame, y: polars.Series | None = None) gators.data_cleaning.round_significant_digits.RoundSignificantDigits[source]#
Fit the transformer by recording which columns to round.
- Parameters:
X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Ignored; present for sklearn compatibility.
- Returns:
Fitted transformer instance.
- Return type: