Gators Logo

Gators is a lightning-fast data preprocessing and feature engineering library built on top of Polars, designed to streamline your entire ML workflow from raw data to production-ready models. Leveraging Polars’ blazing-fast multi-core processing.

Built by the PSP Data Team at PayPal, Gators makes data preprocessing and feature engineering both faster and simpler.

Key Features#

  • πŸš€ Lightning Fast: Built on Polars for multi-core parallel processing

  • πŸ”„ Unified API: Consistent sklearn-style .fit() and .transform() interface

  • πŸ“¦ Production Ready: Deploy the same Python code from notebook to production

  • 🎯 Comprehensive: 60+ preprocessing transformers covering every use case

  • πŸ”— Pipeline Support: Chain transformers seamlessly with the Pipeline class

  • πŸŽ“ Easy to Learn: If you know sklearn, you already know Gators

Quick Start#

import polars as pl
from gators.data_cleaning import DropHighNaNRatio, VarianceFilter
from gators.encoders import OneHotEncoder
from gators.imputers import NumericImputer
from gators.scalers import StandardScaler
from gators.pipeline import Pipeline

# Load your data
X =  pl.read_csv("data.csv")

# Build a preprocessing pipeline
pipeline = Pipeline([
    ('drop_nan', DropHighNaNRatio(threshold=0.5)),
    ('impute', NumericImputer(strategy='median')),
    ('variance', VarianceFilter(threshold=0.01)),
    ('encode', OneHotEncoder()),
    ('scale', StandardScaler())
])

# Fit and transform
X_processed = pipeline.fit_transform(X)

# Deploy the same pipeline in production!

What Can Gators Do?#

70+ transformers across 8 categories:

  • 🧹 Data Cleaning - Quality filters, deduplication, and more

  • βœ‚οΈ Clippers - Custom min/max bounds, Gaussian, IQR, MAD, Quantile, and more

  • 🧩 Encoders - OneHot, Target, WOE, CatBoost, and more

  • 🎯 Numeric Features - Polynomial, rule-based features, and more

  • πŸ“ String Features - Text properties, pattern detection, n-grams, and more

  • πŸ“… DateTime Features - Temporal patterns, cyclical encoding, holidays, and more

  • πŸ”„ Imputation - Numeric, string, boolean, and group-based strategies

  • πŸ“Š Discretization - Equal-width, quantile, tree-based binning, and more

  • βš–οΈ Scaling - Standard, min-max, Box-Cox, and more

  • πŸ”— Pipeline - Chain transformers seamlessly

Indices and tables#