gators.pipeline package#

Module contents#

class gators.pipeline.Pipeline[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Pipeline of transformers for Polars DataFrames.

Sequentially applies a list of transforms. This is a lightweight alternative to sklearn.pipeline.Pipeline specifically designed for Gators transformers that work with Polars DataFrames.

Parameters:
  • steps (list[tuple[ str, Any]]) – List of (name, transform) tuples that are chained in the order they are specified. Each transform must implement fit and transform methods.

  • verbose (bool, default=False) – If True, emits a one-line summary per step to stdout showing the step name, row count, column count, total null count, and wall-clock time. When False there is zero measurement overhead.

Examples

>>> from gators.pipeline import Pipeline
>>> from gators.imputers import NumericImputer, StringImputer
>>> from gators.encoders import WOEEncoder
>>>
>>> steps = [
...     ('numeric_imputer', NumericImputer(strategy='median')),
...     ('string_imputer', StringImputer(strategy='constant', value='MISSING')),
...     ('woe_encoder', WOEEncoder(subset=['cat_col']))
... ]
>>> pipe = Pipeline(steps=steps)
>>> pipe.fit(X_train, y=y_train)
>>> X_transformed = pipe.transform(X_train)
fit(X: polars.DataFrame, y: polars.Series | None = None) gators.pipeline.pipeline.Pipeline[source]#

Fit all transformers in the pipeline.

Fits each transformer sequentially, transforming the data before fitting the next transformer. This ensures each transformer sees the output of the previous transformer.

Parameters:
  • X (pl.DataFrame) – Input DataFrame to fit.

  • y (pl.Series, default=None) – Target series for supervised transformers (e.g., WOEEncoder).

Returns:

The fitted pipeline instance.

Return type:

Pipeline

transform(X: polars.DataFrame) polars.DataFrame[source]#

Transform data by applying all transformers in sequence.

Parameters:

X (pl.DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame.

Return type:

pl.DataFrame

fit_transform(X: polars.DataFrame, y: polars.Series | None = None) polars.DataFrame[source]#

Fit all transformers and transform the data.

Fits and transforms each transformer sequentially. This is more efficient than calling fit() followed by transform() separately.

Parameters:
  • X (pl.DataFrame) – Input DataFrame to fit and transform.

  • y (pl.Series, default=None) – Target series for supervised transformers.

Returns:

Transformed DataFrame.

Return type:

pl.DataFrame

get_params(deep: bool = True) dict[source]#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, returns parameters of all sub-estimators. If False, only returns pipeline-level parameters.

Returns:

Parameter names mapped to their values. Nested parameters use double underscore notation (e.g., ‘step_name__param’).

Return type:

dict

clone() gators.pipeline.pipeline.Pipeline[source]#

Return a new unfitted pipeline with the same hyperparameters.

Each transformer is re-instantiated using only its public constructor parameters (obtained via get_params()). Private attributes that hold fitted state (e.g. _statistics, mapping_) are not copied, so the returned pipeline is guaranteed to be unfitted.

This is the recommended alternative to copy.deepcopy for cross-validation workflows where you need multiple independent copies of the same pipeline configuration.

Returns:

A new, unfitted Pipeline instance with identical hyperparameters.

Return type:

Pipeline

Examples

>>> from gators.pipeline import Pipeline
>>> from gators.imputers import NumericImputer
>>> from gators.scalers import StandardScaler
>>> pipe = Pipeline(steps=[
...     ('impute', NumericImputer(strategy='median')),
...     ('scale', StandardScaler()),
... ])
>>> pipe_clone = pipe.clone()
>>> pipe_clone is pipe
False
>>> pipe_clone.named_steps['impute'] is pipe.named_steps['impute']
False
set_params(**params)[source]#

Set parameters for this estimator.

Parameters:

**params (dict) – Estimator parameters. Use double underscore notation for nested parameters (e.g., step_name__param_name=value).

Returns:

The pipeline instance.

Return type:

Pipeline

Raises:

ValueError – If an invalid parameter name is provided.