gators.clippers package#
Module contents#
Clipping transformers for outlier handling.
- class gators.clippers.CustomClipper[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinClip column values using custom lower and upper bounds.
This transformer allows you to specify custom clipping bounds for each column independently. You can specify only lower bounds, only upper bounds, or both for different columns. Columns not specified in either dictionary are left unchanged.
- Parameters:
lower_bounds (dict of str to float, optional) – Dictionary mapping column names to their lower bounds. Values below the lower bound will be clipped to the lower bound. Default is None (no lower bounds).
upper_bounds (dict of str to float, optional) – Dictionary mapping column names to their upper bounds. Values above the upper bound will be clipped to the upper bound. Default is None (no upper bounds).
inplace (bool, default=True) – If True, clip values in the original columns. If False, create new columns with the suffix ‘__clip_custom’.
drop_columns (bool, default=True) – If True and inplace=False, drop the original columns after clipping. If False and inplace=False, keep both original and clipped columns. Ignored if inplace=True.
- _columns#
List of columns that will be clipped (union of lower_bounds and upper_bounds keys).
- _bounds_map#
Mapping of column names to (lower_bound, upper_bound) tuples. None values indicate no bound on that side.
- Type:
dict of str to tuple
Examples
>>> import polars as pl >>> from gators.clippers import CustomClipper
Clip with both lower and upper bounds:
>>> X = pl.DataFrame({ ... "age": [-5, 25, 150], ... "salary": [-1000, 50000, 2000000] ... }) >>> clipper = CustomClipper( ... lower_bounds={"age": 0, "salary": 0}, ... upper_bounds={"age": 120, "salary": 1000000} ... ) >>> clipper.fit_transform(X) shape: (3, 2) ┌─────┬─────────┐ │ age ┆ salary │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞═════╪═════════╡ │ 0.0 ┆ 0.0 │ │ 25.0┆ 50000.0 │ │ 120.0┆1000000.0│ └─────┴─────────┘
Clip with only lower bounds:
>>> clipper = CustomClipper(lower_bounds={"age": 0}) >>> clipper.fit_transform(X) shape: (3, 2) ┌─────┬─────────┐ │ age ┆ salary │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞═════╪═════════╡ │ 0.0 ┆ -1000.0 │ │ 25.0┆ 50000.0 │ │ 150.0┆2000000.0│ └─────┴─────────┘
Create new columns instead of modifying in place:
>>> clipper = CustomClipper( ... lower_bounds={"age": 0}, ... upper_bounds={"age": 120}, ... inplace=False ... ) >>> clipper.fit_transform(X) shape: (3, 2) ┌──────────────────┬─────────┐ │ age__clip_custom ┆ salary │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════════════╪═════════╡ │ 0.0 ┆ -1000.0 │ │ 25.0 ┆ 50000.0 │ │ 120.0 ┆2000000.0│ └──────────────────┴─────────┘
Notes
Non-numeric columns are automatically ignored.
Columns not specified in either bounds dictionary are left unchanged.
You can specify bounds for only some columns while leaving others untouched.
If a column appears in both dictionaries, both bounds are applied.
See also
GaussianClipperClip values based on mean and standard deviation.
QuantileClipperClip values based on quantiles.
MADClipperClip values based on median absolute deviation.
IQRClipperClip values based on interquartile range.
- fit(X, y=None)[source]#
Fit the clipper by identifying columns to clip.
- Parameters:
X (
DataFrame) – Input DataFrame.y (
Series|None) – Target values (ignored, present for sklearn compatibility).
- Returns:
self – Fitted clipper.
- Return type:
- Raises:
ValueError – If no bounds are specified or if specified columns don’t exist in X.
- class gators.clippers.GaussianClipper[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinClip numeric values to mean ± n standard deviations.
This transformer caps values that are smaller than mean - n*std or larger than mean + n*std, where n is the number of standard deviations (n_sigmas). Values outside this range are clipped to the boundary values.
- Parameters:
n_sigmas (int, default=3) – Number of standard deviations to use for clipping bounds. Must be a positive integer.
subset (Optional[List[str]], default=None) – List of numeric columns to clip. If None, all numeric columns are selected.
inplace (bool, default=True) – If True, clip values in the original columns. If False, create new columns with suffix ‘__clip_gaussian’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after clipping. Ignored when inplace=True.
Examples
>>> import polars as pl >>> from gators.clipping import GaussianClipper
>>> # Sample DataFrame with outliers >>> X = pl.DataFrame({ ... 'A': [1.0, 2.0, 3.0, 4.0, 100.0], # 100.0 is an outlier ... 'B': [-50.0, 5.0, 6.0, 7.0, 8.0], # -50.0 is an outlier ... 'C': [10.0, 20.0, 30.0, 40.0, 50.0] ... })
>>> # Clip using 3 standard deviations (default) >>> clipper = GaussianClipper(inplace=False) >>> clipper.fit(X) GaussianClipper(n_sigmas=3, subset=['A', 'B', 'C'], drop_columns=True, inplace=False) >>> transformed_X = clipper.transform(X) >>> print(transformed_X) shape: (5, 3) ┌───────────────────┬───────────────────┬───────────────────┐ │ A__clip_gaussian ┆ B__clip_gaussian ┆ C__clip_gaussian │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞═══════════════════╪═══════════════════╪═══════════════════╡ │ 1.0 ┆ -24.8 ┆ 10.0 │ │ 2.0 ┆ 5.0 ┆ 20.0 │ │ 3.0 ┆ 6.0 ┆ 30.0 │ │ 4.0 ┆ 7.0 ┆ 40.0 │ │ 42.8 ┆ 8.0 ┆ 50.0 │ └───────────────────┴───────────────────┴───────────────────┘
>>> # Clip using 2 standard deviations (more aggressive) >>> clipper_2sigma = GaussianClipper(n_sigmas=2, inplace=False) >>> clipper_2sigma.fit(X) GaussianClipper(n_sigmas=2, subset=['A', 'B', 'C'], drop_columns=True, inplace=False) >>> transformed_X_2sigma = clipper_2sigma.transform(X) >>> print(transformed_X_2sigma) shape: (5, 3) ┌───────────────────┬───────────────────┬───────────────────┐ │ A__clip_gaussian ┆ B__clip_gaussian ┆ C__clip_gaussian │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞═══════════════════╪═══════════════════╪═══════════════════╡ │ 1.0 ┆ -16.5 ┆ 10.0 │ │ 2.0 ┆ 5.0 ┆ 20.0 │ │ 3.0 ┆ 6.0 ┆ 30.0 │ │ 4.0 ┆ 7.0 ┆ 40.0 │ │ 28.5 ┆ 8.0 ┆ 50.0 │ └───────────────────┴───────────────────┴───────────────────┘
>>> # Clip with drop_columns=False to keep original columns >>> clipper_no_drop = GaussianClipper(n_sigmas=3, drop_columns=False, inplace=False) >>> clipper_no_drop.fit(X) GaussianClipper(n_sigmas=3, subset=['A', 'B', 'C'], drop_columns=False, inplace=False) >>> transformed_X_no_drop = clipper_no_drop.transform(X) >>> print(transformed_X_no_drop) shape: (5, 6) ┌───────┬────────┬──────┬───────────────────┬───────────────────┬───────────────────┐ │ A ┆ B ┆ C ┆ A__clip_gaussian ┆ B__clip_gaussian ┆ C__clip_gaussian │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═══════╪════════╪══════╪═══════════════════╪═══════════════════╪═══════════════════╡ │ 1.0 ┆ -50.0 ┆ 10.0 ┆ 1.0 ┆ -24.8 ┆ 10.0 │ │ 2.0 ┆ 5.0 ┆ 20.0 ┆ 2.0 ┆ 5.0 ┆ 20.0 │ │ 3.0 ┆ 6.0 ┆ 30.0 ┆ 3.0 ┆ 6.0 ┆ 30.0 │ │ 4.0 ┆ 7.0 ┆ 40.0 ┆ 4.0 ┆ 7.0 ┆ 40.0 │ │ 100.0 ┆ 8.0 ┆ 50.0 ┆ 42.8 ┆ 8.0 ┆ 50.0 │ └───────┴────────┴──────┴───────────────────┴───────────────────┴───────────────────┘
>>> # Clip only a subset of columns >>> clipper_subset = GaussianClipper(n_sigmas=3, subset=['A'], inplace=False) >>> clipper_subset.fit(X) GaussianClipper(n_sigmas=3, subset=['A'], drop_columns=True, inplace=False) >>> transformed_X_subset = clipper_subset.transform(X) >>> print(transformed_X_subset) shape: (5, 3) ┌────────┬──────┬───────────────────┐ │ B ┆ C ┆ A__clip_gaussian │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞════════╪══════╪═══════════════════╡ │ -50.0 ┆ 10.0 ┆ 1.0 │ │ 5.0 ┆ 20.0 ┆ 2.0 │ │ 6.0 ┆ 30.0 ┆ 3.0 │ │ 7.0 ┆ 40.0 ┆ 4.0 │ │ 8.0 ┆ 50.0 ┆ 42.8 │ └────────┴──────┴───────────────────┘
>>> # Clip inplace (modifies original columns) >>> clipper_inplace = GaussianClipper(n_sigmas=3, inplace=True) >>> clipper_inplace.fit(X) GaussianClipper(n_sigmas=3, subset=['A', 'B', 'C'], drop_columns=True, inplace=True) >>> transformed_X_inplace = clipper_inplace.transform(X) >>> print(transformed_X_inplace) shape: (5, 3) ┌───────┬────────┬──────┐ │ A ┆ B ┆ C │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞═══════╪════════╪══════╡ │ 1.0 ┆ -24.8 ┆ 10.0 │ │ 2.0 ┆ 5.0 ┆ 20.0 │ │ 3.0 ┆ 6.0 ┆ 30.0 │ │ 4.0 ┆ 7.0 ┆ 40.0 │ │ 42.8 ┆ 8.0 ┆ 50.0 │ └───────┴────────┴──────┘
- class gators.clippers.IQRClipper[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinClip numeric values based on Interquartile Range (IQR).
This transformer caps values that fall outside the range [Q1 - n_iqrs*IQR, Q3 + n_iqrs*IQR], where Q1 and Q3 are the first and third quartiles, and IQR = Q3 - Q1. This is a robust method commonly used for outlier detection (n_iqrs=1.5 is the standard for box plots).
- Parameters:
n_iqrs (float, default=1.5) – Number of IQRs beyond Q1/Q3 to use for clipping bounds. Must be a positive number. Common values: - 1.5: Standard box plot outlier threshold - 3.0: Extreme outlier threshold
subset (Optional[List[str]], default=None) – List of numeric columns to clip. If None, all numeric columns are selected.
inplace (bool, default=True) – If True, clip values in the original columns. If False, create new columns with suffix ‘__clip_iqr’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after clipping. Ignored when inplace=True.
Examples
>>> import polars as pl >>> from gators.clipping import IQRClipper
>>> # Sample DataFrame with outliers >>> X = pl.DataFrame({ ... 'A': [10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 100.0], ... 'B': [-100.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0], ... })
>>> # Clip using 1.5 IQRs (default, standard box plot threshold) >>> clipper = IQRClipper(inplace=False) >>> clipper.fit(X) IQRClipper(n_iqrs=1.5, subset=['A', 'B'], drop_columns=True, inplace=False) >>> transformed_X = clipper.transform(X) >>> print(transformed_X) shape: (12, 2) ┌──────────────┬──────────────┐ │ A__clip_iqr ┆ B__clip_iqr │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════════╪══════════════╡ │ 10.0 ┆ 1.25 │ │ 11.0 ┆ 10.0 │ │ 12.0 ┆ 11.0 │ │ 13.0 ┆ 12.0 │ │ 14.0 ┆ 13.0 │ │ 15.0 ┆ 14.0 │ │ 16.0 ┆ 15.0 │ │ 17.0 ┆ 16.0 │ │ 18.0 ┆ 17.0 │ │ 19.0 ┆ 18.0 │ │ 20.0 ┆ 19.0 │ │ 28.75 ┆ 20.0 │ └──────────────┴──────────────┘
>>> # More conservative clipping with 3 IQRs >>> clipper_3iqr = IQRClipper(n_iqrs=3.0, inplace=False) >>> clipper_3iqr.fit(X) IQRClipper(n_iqrs=3.0, subset=['A', 'B'], drop_columns=True, inplace=False) >>> transformed_X_3iqr = clipper_3iqr.transform(X) >>> print(transformed_X_3iqr) shape: (12, 2) ┌──────────────┬──────────────┐ │ A__clip_iqr ┆ B__clip_iqr │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════════╪══════════════╡ │ 10.0 ┆ -15.0 │ │ 11.0 ┆ 10.0 │ │ 12.0 ┆ 11.0 │ │ 13.0 ┆ 12.0 │ │ 14.0 ┆ 13.0 │ │ 15.0 ┆ 14.0 │ │ 16.0 ┆ 15.0 │ │ 17.0 ┆ 16.0 │ │ 18.0 ┆ 17.0 │ │ 19.0 ┆ 18.0 │ │ 20.0 ┆ 19.0 │ │ 43.0 ┆ 20.0 │ └──────────────┴──────────────┘
- class gators.clippers.MADClipper[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinClip numeric values based on Median Absolute Deviation (MAD).
This transformer caps values that are more than n_mads times the MAD away from the median. MAD is a robust measure of variability that is less sensitive to outliers than standard deviation.
MAD = median(abs(X - median(X))) Clipping bounds: [median - n_mads*MAD, median + n_mads*MAD]
- Parameters:
n_mads (float, default=3.0) – Number of MADs from the median to use for clipping bounds. Must be a positive number.
subset (Optional[List[str]], default=None) – List of numeric columns to clip. If None, all numeric columns are selected.
inplace (bool, default=True) – If True, clip values in the original columns. If False, create new columns with suffix ‘__clip_mad’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after clipping. Ignored when inplace=True.
Examples
>>> import polars as pl >>> from gators.clipping import MADClipper
>>> # Sample DataFrame with outliers >>> X = pl.DataFrame({ ... 'A': [10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 100.0], ... 'B': [-100.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0], ... })
>>> # Clip using 3 MADs (default) >>> clipper = MADClipper(inplace=False) >>> clipper.fit(X) MADClipper(n_mads=3.0, subset=['A', 'B'], drop_columns=True, inplace=False) >>> transformed_X = clipper.transform(X) >>> print(transformed_X) shape: (12, 2) ┌──────────────┬──────────────┐ │ A__clip_mad ┆ B__clip_mad │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════════╪══════════════╡ │ 10.0 ┆ -12.5 │ │ 11.0 ┆ 10.0 │ │ 12.0 ┆ 11.0 │ │ 13.0 ┆ 12.0 │ │ 14.0 ┆ 13.0 │ │ 15.0 ┆ 14.0 │ │ 16.0 ┆ 15.0 │ │ 17.0 ┆ 16.0 │ │ 18.0 ┆ 17.0 │ │ 19.0 ┆ 18.0 │ │ 20.0 ┆ 19.0 │ │ 27.5 ┆ 20.0 │ └──────────────┴──────────────┘
>>> # More aggressive clipping with 2 MADs >>> clipper_2mad = MADClipper(n_mads=2.0, inplace=False) >>> clipper_2mad.fit(X) MADClipper(n_mads=2.0, subset=['A', 'B'], drop_columns=True, inplace=False) >>> transformed_X_2mad = clipper_2mad.transform(X) >>> print(transformed_X_2mad) shape: (12, 2) ┌──────────────┬──────────────┐ │ A__clip_mad ┆ B__clip_mad │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════════╪══════════════╡ │ 10.0 ┆ -5.0 │ │ 11.0 ┆ 10.0 │ │ 12.0 ┆ 11.0 │ │ 13.0 ┆ 12.0 │ │ 14.0 ┆ 13.0 │ │ 15.0 ┆ 14.0 │ │ 16.0 ┆ 15.0 │ │ 17.0 ┆ 16.0 │ │ 18.0 ┆ 17.0 │ │ 19.0 ┆ 18.0 │ │ 20.0 ┆ 19.0 │ │ 22.5 ┆ 20.0 │ └──────────────┴──────────────┘
- class gators.clippers.QuantileClipper[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinClip numeric values based on quantile thresholds.
This transformer caps values below the lower quantile and above the upper quantile. This is useful for removing extreme outliers while preserving the bulk of the data distribution.
- Parameters:
lower_quantile (float, default=0.01) – Lower quantile threshold (0 to 1). Values below this quantile are clipped.
upper_quantile (float, default=0.99) – Upper quantile threshold (0 to 1). Values above this quantile are clipped.
subset (Optional[List[str]], default=None) – List of numeric columns to clip. If None, all numeric columns are selected.
inplace (bool, default=True) – If True, clip values in the original columns. If False, create new columns with suffix ‘__clip_quantile’.
drop_columns (bool, default=True) – If inplace=False, whether to drop the original columns after clipping. Ignored when inplace=True.
Examples
>>> import polars as pl >>> from gators.clipping import QuantileClipper
>>> # Sample DataFrame with outliers >>> X = pl.DataFrame({ ... 'A': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 100.0], ... 'B': [-50.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0], ... })
>>> # Clip using 1st and 99th percentiles (default) >>> clipper = QuantileClipper(inplace=False) >>> clipper.fit(X) QuantileClipper(lower_quantile=0.01, upper_quantile=0.99, subset=['A', 'B'], drop_columns=True, inplace=False) >>> transformed_X = clipper.transform(X) >>> print(transformed_X) shape: (10, 2) ┌─────────────────────┬─────────────────────┐ │ A__clip_quantile ┆ B__clip_quantile │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞═════════════════════╪═════════════════════╡ │ 1.09 ┆ -45.1 │ │ 2.0 ┆ 2.0 │ │ 3.0 ┆ 3.0 │ │ 4.0 ┆ 4.0 │ │ 5.0 ┆ 5.0 │ │ 6.0 ┆ 6.0 │ │ 7.0 ┆ 7.0 │ │ 8.0 ┆ 8.0 │ │ 9.0 ┆ 9.0 │ │ 9.91 ┆ 9.91 │ └─────────────────────┴─────────────────────┘
>>> # More aggressive clipping with 5th and 95th percentiles >>> clipper_5_95 = QuantileClipper(lower_quantile=0.05, upper_quantile=0.95, inplace=False) >>> clipper_5_95.fit(X) QuantileClipper(lower_quantile=0.05, upper_quantile=0.95, subset=['A', 'B'], drop_columns=True, inplace=False) >>> transformed_X_5_95 = clipper_5_95.transform(X) >>> print(transformed_X_5_95) shape: (10, 2) ┌─────────────────────┬─────────────────────┐ │ A__clip_quantile ┆ B__clip_quantile │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞═════════════════════╪═════════════════════╡ │ 1.45 ┆ -21.5 │ │ 2.0 ┆ 2.0 │ │ 3.0 ┆ 3.0 │ │ 4.0 ┆ 4.0 │ │ 5.0 ┆ 5.0 │ │ 6.0 ┆ 6.0 │ │ 7.0 ┆ 7.0 │ │ 8.0 ┆ 8.0 │ │ 9.0 ┆ 9.0 │ │ 8.55 ┆ 9.55 │ └─────────────────────┴─────────────────────┘
>>> # Clip only specific columns >>> clipper_subset = QuantileClipper(subset=['A'], inplace=False) >>> clipper_subset.fit(X) QuantileClipper(lower_quantile=0.01, upper_quantile=0.99, subset=['A'], drop_columns=True, inplace=False) >>> transformed_X_subset = clipper_subset.transform(X) >>> print(transformed_X_subset) shape: (10, 2) ┌────────┬─────────────────────┐ │ B ┆ A__clip_quantile │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞════════╪═════════════════════╡ │ -50.0 ┆ 1.09 │ │ 2.0 ┆ 2.0 │ │ 3.0 ┆ 3.0 │ │ 4.0 ┆ 4.0 │ │ 5.0 ┆ 5.0 │ │ 6.0 ┆ 6.0 │ │ 7.0 ┆ 7.0 │ │ 8.0 ┆ 8.0 │ │ 9.0 ┆ 9.0 │ │ 10.0 ┆ 9.91 │ └────────┴─────────────────────┘