gators.data_cleaning package#
Module contents#
- class gators.data_cleaning.RenameColumns[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinRenames columns based on a provided mapping.
- Parameters:
column_mapping (Dict[str, str]) – Dictionary mapping original column names to new column names.
Examples
Example when renaming all columns:
>>> import polars as pl >>> from gators.data_cleaning import RenameColumns >>> X = pl.DataFrame({ ... "col1": ["a", "a", "b", "c"], ... "col2": ["x", "x", "x", "y"], ... "col3": [1, 2, 3, 4] ... }) >>> transformer = RenameColumns(column_mapping={"col1": "column1", "col2": "column2", "col3": "column3"}) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌────────┬────────┬────────┐ │ column1│ column2│ column3│ │ str │ str │ i64 │ ├────────┼────────┼────────┤ │ a │ x │ 1 │ │ a │ x │ 2 │ │ b │ x │ 3 │ │ c │ y │ 4 │ └────────┴────────┴────────┘
- class gators.data_cleaning.CastColumns[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinCasts specified columns to a given data type.
- Parameters:
subset (Optional[List[str]], default=None) – List of column names to cast. If None, all columns will be cast.
dtype (type) – Target Polars data type (e.g., pl.Float64, pl.String, pl.Int64, pl.Datetime, pl.Date).
inplace (bool, default=True) – If True, cast values in the original columns. If False, create new columns with suffix ‘__cast_{dtype}’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after casting. Ignored when inplace=True.
Examples
Example 1: Cast columns with inplace=False and keep originals
>>> import polars as pl >>> from gators.data_cleaning import CastColumns >>> X = pl.DataFrame({ ... "col1": ["10", "20", "30"], ... "col2": ["1.1", "2.2", "3.3"], ... "col3": [True, False, True] ... }) >>> cast_columns = CastColumns( ... subset=["col1", "col2"], ... dtype=pl.Float64, ... inplace=False, ... drop_columns=False ... ) >>> cast_columns.fit(X) >>> transformed_X = cast_columns.transform(X) >>> print(transformed_X) shape: (3, 5) ┌──────┬──────┬────────────────────┬────────────────────┬───────┐ │ col1 │ col2 │ col1__cast_float64 │ col2__cast_float64 │ col3 │ ├──────┼──────┼────────────────────┼────────────────────┼───────┤ │ 10 │ 1.1 │ 10.0 │ 1.1 │ True │ ├──────┼──────┼────────────────────┼────────────────────┼───────┤ │ 20 │ 2.2 │ 20.0 │ 2.2 │ False │ ├──────┼──────┼────────────────────┼────────────────────┼───────┤ │ 30 │ 3.3 │ 30.0 │ 3.3 │ True │ └──────┴──────┴────────────────────┴────────────────────┴───────┘
Example 2: Cast columns with inplace=False and drop originals
>>> cast_columns = CastColumns( ... subset=["col1", "col2"], ... dtype=pl.Float64, ... inplace=False, ... drop_columns=True ... ) >>> cast_columns.fit(X) >>> transformed_X = cast_columns.transform(X) >>> print(transformed_X) shape: (3, 3) ┌────────────────────┬────────────────────┬───────┐ │ col1__cast_float64 │ col2__cast_float64 │ col3 │ ├────────────────────┼────────────────────┼───────┤ │ 10.0 │ 1.1 │ True │ ├────────────────────┼────────────────────┼───────┤ │ 20.0 │ 2.2 │ False │ ├────────────────────┼────────────────────┼───────┤ │ 30.0 │ 3.3 │ True │ └────────────────────┴────────────────────┴───────┘
Example 3: Cast columns in place
>>> cast_columns = CastColumns( ... subset=["col1", "col2"], ... dtype=pl.Float64, ... inplace=True ... ) >>> cast_columns.fit(X) >>> transformed_X = cast_columns.transform(X) >>> print(transformed_X) shape: (3, 3) ┌──────┬──────┬───────┐ │ col1 │ col2 │ col3 │ ├──────┼──────┼───────┤ │ 10.0 │ 1.1 │ True │ ├──────┼──────┼───────┤ │ 20.0 │ 2.2 │ False │ ├──────┼──────┼───────┤ │ 30.0 │ 3.3 │ True │ └──────┴──────┴───────┘
Notes
When casting to Datetime or Date from String, the transformer handles format parsing automatically
If subset=None, all columns in the DataFrame will be cast to the specified dtype
When inplace=True, the drop_columns parameter is ignored as original columns are replaced
- class gators.data_cleaning.DropColumns[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinDrops specified columns from a DataFrame.
- Parameters:
subset (List[str]) – List of column names to drop.
Examples
Create an instance of the DropColumns class:
>>> import polars as pl >>> from gators.data_cleaning import DropColumns >>> drop_columns = DropColumns(subset=["col1", "col2"])
Fit the transformer:
>>> drop_columns.fit(X)
Transform the DataFrame:
>>> X = pl.DataFrame({"col1": [1, 2, 3], ... "col2": ["A", "B", "C"], ... "col3": [True, False, True]}) >>> transformed_X = drop_columns.transform(X) >>> print(transformed_X) shape: (3, 1) ┌───────┐ │ col3 │ ├───────┤ │ True │ ├───────┤ │ False │ ├───────┤ │ True │ └───────┘
- class gators.data_cleaning.DropHighNaNRatio[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinDrops columns with a high ratio of NaN values.
- Parameters:
Examples
Initializing and using DropHighNaNRatio transformer.
Example when drop_columns is True and columns is None:
>>> import polars as pl >>> from gators.data_cleaning import DropHighNaNRatio >>> X = pl.DataFrame({ ... "col1": ["a", None, "b", "c"], ... "col2": ["x", "x", "x", None], ... "col3": [1, 2, None, None] ... }) >>> transformer = DropHighNaNRatio(max_ratio=0.5) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 1) ┌─────┐ │ col1│ │ str │ ├─────┤ │ a │ │ None│ │ b │ │ c │ └─────┘
Example when drop_columns is True and columns is a subset:
>>> transformer = DropHighNaNRatio(max_ratio=0.5, subset=['col2', 'col3']) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 2) ┌─────┬─────┐ │ col1│ col2│ │ str │ str │ ├─────┼─────┤ │ a │ x │ │ None│ x │ │ b │ x │ │ c │ None│ └─────┴─────┘
Example when drop_columns is False and columns is None:
>>> transformer = DropHighNaNRatio(max_ratio=0.5) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌─────┬─────┬──────┐ │ col1│ col2│ col3 │ │ str │ str │ i64 │ ├─────┼─────┼──────┤ │ a │ x │ 1 │ │ None│ x │ 2 │ │ b │ x │ None │ │ c │ None│ None │ └─────┴─────┴──────┘
Example when drop_columns is False and columns is a subset:
>>> transformer = DropHighNaNRatio(max_ratio=0.5, subset=['col2', 'col3']) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌─────┬─────┬──────┐ │ col1│ col2│ col3 │ │ str │ str │ i64 │ ├─────┼─────┼──────┤ │ a │ x │ 1 │ │ None│ x │ 2 │ │ b │ x │ None │ │ c │ None│ None │ └─────┴─────┴──────┘
- class gators.data_cleaning.DropLowCardinality[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinDrops columns with low cardinality.
- Parameters:
min_count (int) – Minimum number of unique values for a column to be retained. Must be >= 1. Columns with unique count < min_count will be dropped.
subset (Optional[List[str]], default=None) – List of columns to check for low cardinality. If None, all string, boolean, and categorical columns are checked.
Examples
Initializing and using DropLowCardinality transformer.
Example when drop_columns is True and columns is None:
>>> X = pl.DataFrame({ ... "col1": ["a", "a", "b", "c"], ... "col2": ["x", "x", "x", "y"], ... "col3": [1, 2, 3, 4] ... }) >>> transformer = DropLowCardinality(min_count=2, columns=None, drop_columns=True) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 1) ┌─────┐ │ col3│ │ i64 │ ├─────┤ │ 1 │ │ 2 │ │ 3 │ │ 4 │ └─────┘
Example when drop_columns is True and columns is a subset:
>>> transformer = DropLowCardinality(min_count=2, subset=['col1'], drop_columns=True) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 2) ┌─────┬─────┐ │ col2│ col3│ │ str │ i64 │ ├─────┼─────┤ │ x │ 1 │ │ x │ 2 │ │ x │ 3 │ │ y │ 4 │ └─────┴─────┘
Example when drop_columns is False and columns is None:
>>> transformer = DropLowCardinality(min_count=2, columns=None, drop_columns=False) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌─────┬─────┬─────┐ │ col1│ col2│ col3│ │ str │ str │ i64 │ ├─────┼─────┼─────┤ │ a │ x │ 1 │ │ a │ x │ 2 │ │ b │ x │ 3 │ │ c │ y │ 4 │ └─────┴─────┴─────┘
Example when drop_columns is False and columns is a subset:
>>> transformer = DropLowCardinality(min_count=2, subset=['col1'], drop_columns=False) >>> transformer.fit(X) ... >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌─────┬─────┬─────┐ │ col1│ col2│ col3│ │ str │ str │ i64 │ ├─────┼─────┼─────┤ │ a │ x │ 1 │ │ a │ x │ 2 │ │ b │ x │ 3 │ │ c │ y │ 4 │ └─────┴─────┴─────┘
- class gators.data_cleaning.VarianceFilter[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinRemoves numerical columns with a low variance.
- Parameters:
Examples
Initialize and use
VarianceFilter.Example with all numeric columns:
>>> import polars as pl >>> from gators.data_cleaning import VarianceFilter >>> X = pl.DataFrame({ ... "feature1": [1, 2, 3, 4], ... "feature2": [0.5, 0.5, 0.5, 0.5], # Low variance ... "feature3": [5, 6, 7, 8], ... "label": [0, 1, 0, 1] ... }) >>> transformer = VarianceFilter(min_var=0.1) >>> transformer.fit(X) >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌──────────┬─────────┬───────┐ │ feature1 │feature3 │ label │ │ i64 │ i64 │ i64 │ ├──────────┼────────–┼──────–┤ │ 1 │ 5 │ 0 │ │ 2 │ 6 │ 1 │ │ 3 │ 7 │ 0 │ │ 4 │ 8 │ 1 │ └──────────┴────────━┴─────–─┘
Example with specific columns:
>>> X = pl.DataFrame({ ... "feature1": [1, 2, 3, 4], ... "feature2": [0.5, 0.5, 0.5, 0.5], ... "feature3": [5, 6, 7, 8], ... "label": [0, 1, 0, 1] ... }) >>> transformer = VarianceFilter(subset=['feature1’, ‘feature3'], min_var=0.1) >>> transformer.fit(X) >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 4) ┌──────────┬─────────┬──────────┬───────┐ │ feature1 │feature3 │ feature2 │ label │ │ i64 │ i64 │ i64 │ i64 │ ├──────────┼────────–┼─────────–┼──────–┤ │ 1 │ 5 │ 0.5 │ 0 │ │ 2 │ 6 │ 0.5 │ 1 │ │ 3 │ 7 │ 0.5 │ 0 │ │ 4 │ 8 │ 0.5 │ 1 │ └──────────┴─────────┴──────────┴──────–┘
- class gators.data_cleaning.Replace[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinReplaces values in specified columns.
- Parameters:
to_replace (Dict[str, Dict[str, any]]) – Nested dictionary specifying replacement mappings. Outer keys are column names, inner dictionaries map old values to new values.
inplace (bool, default=True) – If True, replace values in the original columns. If False, create new columns with suffix ‘__replace’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after replacement. Ignored when inplace=True.
Examples
Initializing and using Replace transformer.
Example with drop_columns=True and columns=None:
>>> X = pl.DataFrame({ ... "col1": ["a", "a", "b", "c"], ... "col2": ["x", "x", "x", "y"], ... "col3": [1, 2, 3, 4] ... }) >>> replace_map = { ... "col1": {"a": "alpha", "b": "bravo"}, ... "col2": {"x": "x-ray", "y": "yankee"} ... } >>> transformer = Replace(to_replace=replace_map, drop_columns=True) >>> transformer.fit(X) >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 2) ┌───────────────┬───────────────┐ │ col1__replace │ col2__replace │ │ str │ str │ ├───────────────┬───────────────┤ │ alpha │ x-ray │ │ alpha │ x-ray │ │ bravo │ x-ray │ │ charlie │ yankee │ └───────────────┴───────────────┘
Example with drop_columns=True and columns as a subset:
>>> X = pl.DataFrame({ ... "col1": ["a", "a", "b", "c"], ... "col2": ["x", "x", "x", "y"], ... "col3": [1, 2, 3, 4] ... }) >>> replace_map = { ... "col1": {"a": "alpha", "b": "bravo"} ... } >>> transformer = Replace(to_replace=replace_map, drop_columns=True) >>> transformer.fit(X) >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 3) ┌───────────────────┬─────────────────────┬────────────────────┐ │ col1 │ col2 │ col3 │ │ str │ str │ i64 │ ├───────────────────┬─────────────────────┬────────────────────┤ │ alpha │ x │ 1 │ │ alpha │ x │ 2 │ │ bravo │ x │ 3 │ │ charlie │ y │ 4 │ └───────────────────┴─────────────────────┴────────────────────┘
Example with drop_columns=False and columns=None:
>>> X = pl.DataFrame({ ... "col1": ["a", "a", "b", "c"], ... "col2": ["x", "x", "x", "y"], ... "col3": [1, 2, 3, 4] ... }) >>> replace_map = { ... "col1": {"a": "alpha", "b": "bravo"}, ... "col2": {"x": "x-ray", "y": "yankee"} ... } >>> transformer = Replace(to_replace=replace_map, drop_columns=False) >>> transformer.fit(X) >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (4, 4) ┌─────────────────┬──────────────────────┬─────────────────────┐─────────────────────────┐ │ col1 │ col2 │ col3 │ col1__replace │ │ str │ str │ i64 │ str │ │─────────────────┬──────────────────────┬─────────────────────┬─────────────────────────┤ │ alpha│ x-ray │ 1 │ alpha │ │ alpha│ x-ray │ 2 │ alpha │ │ bravo│ x-ray │ 3 │ bravo │ │ charlie │ yankee │ 4 │ charlie │ └─────────────────┴──────────────────────┴─────────────────────┴─────────────────────────┘
- class gators.data_cleaning.CorrelationFilter[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinFilters out highly correlated numeric columns.
Identifies groups of highly correlated columns and removes all but one from each group, helping to reduce multicollinearity in the dataset.
- Parameters:
subset (Optional[List[str]], default=None) – List of numeric columns to consider for correlation filtering. If None, all numeric columns are used.
max_corr (float) – Maximum allowed absolute correlation between columns. Must be > 0 and <= 1. Columns with correlation >= max_corr are considered highly correlated.
Examples
>>> from correlation_filter import CorrelationFilter >>> import polars as pl
>>> X ={'A': [1, 2, 3, 4], ... 'B': [4, 3, 2, 1], ... 'C': [1, 2, 1, 2], ... 'y': [1, 1, 0, 0]} >>> X = pl.DataFrame(X) >>> # Example 1 >>> corr_filter = CorrelationFilter(max_corr=0.9) >>> _ = corr_filter.fit(X, y) >>> result = corr_filter.transform(X) >>> result shape: (4, 2) ┌─────┬─────┐ │ C │ y │ │ i64 │ i64 │ ├─────┼─────┤ │ 1 │ 1 │ │ 2 │ 1 │ │ 1 │ 0 │ │ 2 │ 0 │ └─────┴─────┘
>>> # Example 2 >>> corr_filter = CorrelationFilter(subset=['A', 'B'], max_corr=1) >>> _ = corr_filter.fit(X) >>> result = corr_filter.transform(X) >>> result shape: (4, 4) ┌─────┬─────┬─────┬─────┐ │ A │ B │ C │ y │ │ i64 │ i64 │ i64 │ i64 │ ├─────┼─────┼─────┼─────┤ │ 1 │ 4 │ 1 │ 1 │ │ 2 │ 3 │ 2 │ 1 │ │ 3 │ 2 │ 1 │ 0 │ │ 4 │ 1 │ 2 │ 0 │ └─────┴─────┴─────┴─────┘
- class gators.data_cleaning.OutlierFilter[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinRemoves or caps outliers in numerical columns using various methods.
Detects outliers using IQR, Z-score, or percentile methods and either removes rows or caps values. Essential for tree-based models to prevent splits dominated by extreme values.
Supports class-aware outlier detection for imbalanced datasets to avoid removing minority class examples that appear as statistical outliers.
- Parameters:
subset (Optional[List[str]], default=None) – List of numeric columns to check for outliers. If None, all numeric columns are checked.
method (str, default='iqr') –
Method for outlier detection:
’iqr’: Interquartile Range method (values outside Q1-k*IQR, Q3+k*IQR)
’zscore’: Z-score method (values with absolute z-score > threshold)
’percentile’: Percentile method (values outside specified percentiles)
threshold (float, default=1.5) –
Threshold parameter for outlier detection:
For ‘iqr’: multiplier for IQR (typically 1.5 or 3.0)
For ‘zscore’: z-score threshold (typically 3.0)
Not used for ‘percentile’ method
lower_percentile (float, default=0.01) – Lower percentile for outlier detection (only for ‘percentile’ method). Values below this percentile are considered outliers.
upper_percentile (float, default=0.99) – Upper percentile for outlier detection (only for ‘percentile’ method). Values above this percentile are considered outliers.
action (str, default='remove') – Action to take on outliers: - ‘remove’: Remove rows containing outliers - ‘cap’: Cap outliers to boundary values
class_aware (bool, default=False) – Whether to detect outliers separately within each class. Prevents removing minority class examples that appear as outliers when considering all data together. Requires passing target column name to fit(). Recommended for imbalanced classification tasks.
Examples
Example 1: IQR method with row removal
>>> from gators.data_cleaning import OutlierFilter >>> import polars as pl >>> X = pl.DataFrame({ ... 'age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 200], # 200 is outlier ... 'income': [30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000] ... }) >>> filter_iqr = OutlierFilter( ... subset=['age'], ... method='iqr', ... threshold=1.5, ... action='remove' ... ) >>> filter_iqr.fit(X) >>> result = filter_iqr.transform(X) >>> print(result) shape: (9, 2) ┌─────┬────────┐ │ age ┆ income │ │ --- ┆ --- │ │ i64 ┆ i64 │ ├─────┼────────┤ │ 25 ┆ 30000 │ │ 30 ┆ 35000 │ │ ... ┆ ... │ │ 65 ┆ 70000 │ └─────┴────────┘
Example 2: Z-score method with capping
>>> filter_zscore = OutlierFilter( ... subset=['age'], ... method='zscore', ... threshold=3.0, ... action='cap' ... ) >>> filter_zscore.fit(X) >>> result = filter_zscore.transform(X)
Example 3: Percentile method
>>> filter_percentile = OutlierFilter( ... subset=['income'], ... method='percentile', ... lower_percentile=0.05, ... upper_percentile=0.95, ... action='remove' ... ) >>> filter_percentile.fit(X) >>> result = filter_percentile.transform(X)
Example 4: Class-aware mode for imbalanced datasets
>>> X = pl.DataFrame({ ... 'transaction_amount': [100, 120, 110, 105, 115, 5000, 4800, 4900], ... 'is_fraud': [0, 0, 0, 0, 0, 1, 1, 1] ... }) >>> filter_basic = OutlierFilter( ... subset=['transaction_amount'], ... method='iqr', ... action='remove', ... class_aware=False ... ) >>> filter_basic.fit(X) >>> result_basic = filter_basic.transform(X) >>> print(len(result_basic)) # May remove fraud examples! 5 >>> filter_aware = OutlierFilter( ... subset=['transaction_amount'], ... method='iqr', ... action='remove', ... class_aware=True ... ) >>> filter_aware.fit(X, y='is_fraud') >>> result_aware = filter_aware.transform(X) >>> print(len(result_aware)) # Preserves minority class! 8
- fit(X, y=None)[source]#
Fit the transformer by computing outlier bounds.
- Parameters:
- Returns:
Fitted transformer instance.
- Return type:
- Raises:
ValueError – If class_aware=True and y is None. If y is provided but not found in X columns.
- class gators.data_cleaning.DropDuplicateColumns[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinRemoves duplicate columns from the DataFrame.
Identifies and removes columns that have identical values across all rows. This is useful for reducing dimensionality and removing redundant features that don’t add predictive value.
- Parameters:
keep (str, default='first') –
Strategy for keeping duplicate columns:
’first’: Keep first occurrence of duplicate columns
’last’: Keep last occurrence of duplicate columns
Examples
Example 1: Remove duplicate columns (keep first)
>>> from gators.data_cleaning import DropDuplicateColumns >>> import polars as pl >>> X = pl.DataFrame({ ... 'A': [1, 2, 3, 4], ... 'B': [5, 6, 7, 8], ... 'C': [1, 2, 3, 4], # Duplicate of A ... 'D': [9, 10, 11, 12], ... 'E': [5, 6, 7, 8] # Duplicate of B ... }) >>> remover = DropDuplicateColumns(keep='first') >>> remover.fit(X) >>> result = remover.transform(X) >>> print(result) shape: (4, 3) ┌─────┬─────┬──────┐ │ A ┆ B ┆ D │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ├─────┼─────┼──────┤ │ 1 ┆ 5 ┆ 9 │ │ 2 ┆ 6 ┆ 10 │ │ 3 ┆ 7 ┆ 11 │ │ 4 ┆ 8 ┆ 12 │ └─────┴─────┴──────┘
Example 2: Remove duplicate columns (keep last)
>>> X = pl.DataFrame({ ... 'feature_1': [1.0, 2.0, 3.0], ... 'feature_2': [4.0, 5.0, 6.0], ... 'feature_3': [1.0, 2.0, 3.0], # Duplicate of feature_1 ... 'target': [0, 1, 0] ... }) >>> remover = DropDuplicateColumns(keep='last') >>> remover.fit(X) >>> print(f"Columns to drop: {remover.columns_to_drop_}") Columns to drop: ['feature_1'] >>> print(f"Column groups: {remover.column_groups_}") Column groups: {'feature_3': ['feature_1']} >>> result = remover.transform(X) >>> print(result) shape: (3, 3) ┌───────────┬───────────┬────────┐ │ feature_2 | feature_3 ┆ target │ │ --- | --- ┆ --- │ │ f64 | f64 ┆ i64 │ ├───────────┼───────────┼────────┤ │ 4.0 | 1.0 ┆ 0 │ │ 5.0 | 2.0 ┆ 1 │ │ 6.0 | 3.0 ┆ 0 │ └───────────┴───────────┴────────┘
Example 3: Check duplicate groups
>>> X = pl.DataFrame({ ... 'a': [1, 2, 3], ... 'b': [1, 2, 3], ... 'c': [1, 2, 3], ... 'd': [4, 5, 6] ... }) >>> remover = DropDuplicateColumns() >>> remover.fit(X) >>> print(f"Kept column groups: {remover.column_groups_}") Kept column groups: {'a': ['b', 'c']} >>> result = remover.transform(X) >>> print(result.columns) ['a', 'd']
Example 4: No duplicates
>>> X = pl.DataFrame({ ... 'x': [1, 2, 3], ... 'y': [4, 5, 6], ... 'z': [7, 8, 9] ... }) >>> remover = DropDuplicateColumns() >>> remover.fit(X) >>> print(f"Columns to drop: {remover.columns_to_drop_}") Columns to drop: [] >>> result = remover.transform(X) >>> print(result.shape) (3, 3)
- fit(X, y=None)[source]#
Fit the transformer by identifying duplicate columns.
- Parameters:
X (
DataFrame) – Input DataFrame.y (
Series|None) – Target variable. Not used, present here for compatibility.
- Returns:
Fitted transformer instance.
- Return type:
- Raises:
ValueError – If keep parameter is not ‘first’ or ‘last’.
- class gators.data_cleaning.DropDuplicateRows[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinRemoves duplicate rows from the DataFrame.
Identifies and removes duplicate rows based on all columns or a subset of columns. Critical for preventing data leakage and ensuring data quality.
- Parameters:
subset (Optional[List[str]], default=None) – List of columns to consider for identifying duplicates. If None, all columns are used.
keep (str, default='first') –
Strategy for keeping duplicates:
’first’: Keep first occurrence, drop subsequent duplicates
’last’: Keep last occurrence, drop previous duplicates
’none’: Drop all duplicates (keep no occurrences)
Examples
Example 1: Remove full duplicate rows (keep first)
>>> from gators.data_cleaning import DropDuplicateRows >>> import polars as pl >>> X = pl.DataFrame({ ... 'id': [1, 2, 2, 3, 4, 4], ... 'name': ['Alice', 'Bob', 'Bob', 'Charlie', 'David', 'David'], ... 'age': [25, 30, 30, 35, 40, 40] ... }) >>> remover = DropDuplicateRows(keep='first') >>> result = remover.fit_transform(X) >>> print(result) shape: (4, 3) ┌─────┬─────────┬─────┐ │ id ┆ name ┆ age │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 │ ├─────┼─────────┼─────┤ │ 1 ┆ Alice ┆ 25 │ │ 2 ┆ Bob ┆ 30 │ │ 3 ┆ Charlie ┆ 35 │ │ 4 ┆ David ┆ 40 │ └─────┴─────────┴─────┘
Example 2: Remove duplicates based on subset (keep last)
>>> X = pl.DataFrame({ ... 'id': [1, 2, 3, 4], ... 'name': ['Alice', 'Bob', 'Alice', 'Bob'], ... 'score': [85, 90, 88, 92] ... }) >>> remover = DropDuplicateRows(subset=['name'], keep='last') >>> result = remover.fit_transform(X) >>> print(result) shape: (2, 3) ┌─────┬───────┬───────┐ │ id ┆ name ┆ score │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 │ ├─────┼───────┼───────┤ │ 3 ┆ Alice ┆ 88 │ │ 4 ┆ Bob ┆ 92 │ └─────┴───────┴───────┘
Example 3: Drop all duplicate occurrences (keep none)
>>> X = pl.DataFrame({ ... 'user_id': [1, 2, 2, 3, 4, 4, 5], ... 'action': ['login', 'view', 'view', 'click', 'buy', 'buy', 'logout'] ... }) >>> remover = DropDuplicateRows(subset=['user_id'], keep='none') >>> result = remover.fit_transform(X) >>> print(result) shape: (3, 2) ┌─────────┬────────┐ │ user_id ┆ action │ │ --- ┆ --- │ │ i64 ┆ str │ ├─────────┼────────┤ │ 1 ┆ login │ │ 3 ┆ click │ │ 5 ┆ logout │ └─────────┴────────┘
Example 4: Check for duplicates without subset
>>> X = pl.DataFrame({ ... 'a': [1, 1, 2], ... 'b': [10, 10, 20], ... 'c': [100, 100, 200] ... }) >>> remover = DropDuplicateRows() >>> result = remover.fit_transform(X) >>> print(result) shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ├─────┼─────┼─────┤ │ 1 ┆ 10 ┆ 100 │ │ 2 ┆ 20 ┆ 200 │ └─────┴─────┴─────┘
- fit(X, y=None)[source]#
Fit the transformer by validating parameters.
- Parameters:
X (
DataFrame) – Input DataFrame.y (
Series|None) – Target variable. Not used, present here for compatibility.
- Returns:
Fitted transformer instance.
- Return type:
- Raises:
ValueError – If subset columns are specified but not found in DataFrame.
- class gators.data_cleaning.DropConstantColumns[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinRemoves columns that have only a single unique value (constant columns).
Identifies and removes columns with zero information content. More specific than VarianceFilter (which only works on numerics) and faster than variance calculation. Handles both numeric and categorical constant columns.
- Parameters:
subset (Optional[List[str]], default=None) – List of columns to check for constant values. If None, all columns are checked.
include_na (bool, default=True) – Whether to count NaN/null as a unique value. If True, a column with all NaN is considered constant. If False, NaN values are ignored when counting unique values.
Examples
Example 1: Remove constant numeric column
>>> from gators.data_cleaning import DropConstantColumns >>> import polars as pl >>> X = pl.DataFrame({ ... 'id': [1, 2, 3, 4, 5], ... 'constant_num': [42, 42, 42, 42, 42], ... 'varying': [10, 20, 30, 40, 50] ... }) >>> remover = DropConstantColumns() >>> result = remover.fit_transform(X) >>> print(result) shape: (5, 2) ┌─────┬─────────┐ │ id ┆ varying │ │ --- ┆ --- │ │ i64 ┆ i64 │ ├─────┼─────────┤ │ 1 ┆ 10 │ │ 2 ┆ 20 │ │ 3 ┆ 30 │ │ 4 ┆ 40 │ │ 5 ┆ 50 │ └─────┴─────────┘
Example 2: Remove constant categorical column
>>> X = pl.DataFrame({ ... 'country': ['USA', 'USA', 'USA', 'USA'], ... 'city': ['NYC', 'LA', 'Chicago', 'Boston'], ... 'status': ['active', 'active', 'active', 'active'] ... }) >>> remover = DropConstantColumns() >>> result = remover.fit_transform(X) >>> print(result) shape: (4, 1) ┌─────────┐ │ city │ │ --- │ │ str │ ├─────────┤ │ NYC │ │ LA │ │ Chicago │ │ Boston │ └─────────┘
Example 3: Handle NaN values (with include_na=True)
>>> X = pl.DataFrame({ ... 'all_null': [None, None, None], ... 'mixed': [1, None, 1], ... 'varying': [1, 2, 3] ... }) >>> remover = DropConstantColumns(include_na=True) >>> result = remover.fit_transform(X) >>> print(result) shape: (3, 2) ┌───────┬─────────┐ │ mixed ┆ varying │ │ --- ┆ --- │ │ i64 ┆ i64 │ ├───────┼─────────┤ │ 1 ┆ 1 │ │ null ┆ 2 │ │ 1 ┆ 3 │ └───────┴─────────┘
Example 4: Handle NaN values (with include_na=False)
>>> remover = DropConstantColumns(include_na=False) >>> result = remover.fit_transform(X) >>> print(result) shape: (3, 1) ┌─────────┐ │ varying │ │ --- │ │ i64 │ ├─────────┤ │ 1 │ │ 2 │ │ 3 │ └─────────┘
Example 5: Subset of columns
>>> X = pl.DataFrame({ ... 'col1': [1, 1, 1], ... 'col2': [5, 5, 5], ... 'col3': [10, 20, 30] ... }) >>> remover = DropConstantColumns(subset=['col1', 'col2']) >>> result = remover.fit_transform(X) >>> print(result) shape: (3, 1) ┌──────┐ │ col3 │ │ --- │ │ i64 │ ├──────┤ │ 10 │ │ 20 │ │ 30 │ └──────┘
- class gators.data_cleaning.HighCardinalityFilter[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinRemoves columns with too many unique values (high cardinality).
Identifies and removes columns with excessive cardinality, which can cause issues for tree-based models (memory, overfitting) and create sparse encodings. Common use case: remove ID-like columns, timestamps, or free-text fields.
Opposite of DropLowCardinality. Can filter by absolute count threshold or by ratio of unique values to total rows.
- Parameters:
subset (Optional[List[str]], default=None) – List of columns to check for high cardinality. If None, all columns are checked.
max_unique (Optional[int], default=None) – Maximum number of unique values allowed. Columns with more unique values will be removed. If None, no absolute threshold is applied.
max_ratio (Optional[float], default=None) – Maximum ratio of unique values to total rows. Must be between 0 and 1. For example, 0.9 means columns where >90% of rows are unique will be removed. If None, no ratio threshold is applied.
ignore_na (bool, default=True) – Whether to ignore NaN/null values when counting unique values. If True, NaN is not counted as a unique value.
Examples
Example 1: Remove by absolute count
>>> from gators.data_cleaning import HighCardinalityFilter >>> import polars as pl >>> X = pl.DataFrame({ ... 'user_id': range(1000), ... 'country': ['USA'] * 500 + ['UK'] * 500, ... 'transaction_id': [f'tx_{i}' for i in range(1000)] ... }) >>> filter = HighCardinalityFilter(max_unique=100) >>> result = filter.fit_transform(X) >>> print(result) shape: (1000, 1) ┌─────────┐ │ country │ │ --- │ │ str │ ├─────────┤ │ USA │ │ USA │ │ ... │ │ UK │ │ UK │ └─────────┘
Example 2: Remove by ratio
>>> X = pl.DataFrame({ ... 'id': range(100), ... 'category': ['A', 'B', 'C'] * 33 + ['A'], ... 'subcategory': ['X', 'Y'] * 50 ... }) >>> filter = HighCardinalityFilter(max_ratio=0.95) >>> result = filter.fit_transform(X) >>> print(result.columns) ['category', 'subcategory']
Example 3: Combined thresholds
>>> X = pl.DataFrame({ ... 'col1': range(50), # 50 unique, ratio=1.0 ... 'col2': list(range(25)) * 2, # 25 unique, ratio=0.5 ... 'col3': ['A', 'B'] * 25 # 2 unique, ratio=0.04 ... }) >>> filter = HighCardinalityFilter(max_unique=30, max_ratio=0.8) >>> result = filter.fit_transform(X) >>> print(result.columns) ['col2', 'col3']
Example 4: Handling NaN
>>> X = pl.DataFrame({ ... 'col1': [1, 2, 3, None, None] * 20, # 3 unique + NaN ... 'col2': list(range(90)) + [None] * 10 # 90 unique + NaN ... }) >>> filter = HighCardinalityFilter(max_unique=50, ignore_na=True) >>> result = filter.fit_transform(X) >>> print(result.columns) ['col1']
Example 5: Subset of columns
>>> X = pl.DataFrame({ ... 'id1': range(100), ... 'id2': range(100), ... 'feature': ['A', 'B'] * 50 ... }) >>> filter = HighCardinalityFilter(subset=['id1', 'id2'], max_unique=50) >>> result = filter.fit_transform(X) >>> print(result.columns) ['feature']