House Price Prediction with Gators#

This notebook demonstrates how to use the gators library for feature engineering in a regression problem. We’ll predict house prices from the Kaggle House Prices dataset using various gators transformers.

Table of Contents#

Import Libraries
Load and Preprocess Data
Build Feature Engineering Pipeline
Feature Selection with Stability Index
Train Model and Evaluate
Generate Predictions
Summary

Key Features Demonstrated:#

Data cleaning and type casting
Missing value imputation (simple and group-based)
Mathematical feature engineering (ratios, statistics)
Feature encoding (one-hot and count encoding)
Feature stability analysis for robust feature selection
Regression modeling with XGBoost

Dataset: Kaggle House Prices - Advanced Regression Techniques

1. Import Libraries#

Import the necessary libraries including gators transformers for feature engineering.

[ ]:

import polars as pl
import numpy as np
from IPython.display import display

from gators.encoders import OneHotEncoder, CountEncoder
from gators.data_cleaning import DropColumns, CastColumns
from gators.feature_generation import MathFeatures, RatioFeatures
from gators.imputers import StringImputer, NumericImputer, GroupByImputer

from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import root_mean_squared_error, make_scorer

from xgboost import XGBRegressor
from gators.feature_selection.feature_stability_index import feature_stability_index

2. Load and Preprocess Data#

Load the house prices dataset and perform initial data cleaning:

Handle ‘NA’ string values as null
Apply log transformation to the target (SalePrice) to normalize distribution
Clean inconsistent category values
Fix problematic year values in GarageYrBlt

[2]:

# Load data with 'NA' strings interpreted as null values
X_train = pl.read_csv('../../../kaggle/house/train.csv', null_values=['NA'])
X_test = pl.read_csv('../../../kaggle/house/test.csv', null_values=['NA'])

# Apply log transformation to target variable for better distribution
X_train = X_train.with_columns(pl.col('SalePrice').log().alias('target'))
# target =

# Drop ID column and constant features
y_train = X_train["target"]
X_train = X_train.drop(['Id', 'Utilities', "target", "SalePrice"])
X_test = X_test.drop(['Utilities'])

def clean_data(X):
    """Clean inconsistent values and fix data quality issues"""
    return X.with_columns(
        # Fix Exterior2nd inconsistent naming
        pl.when(pl.col('Exterior2nd') == 'Brk Cmn')
        .then(pl.lit('BrkComm'))
        .otherwise(pl.col('Exterior2nd'))
        .alias('Exterior2nd'),

        # Fix future GarageYrBlt values (replace with YearBuilt if > 2010)
        pl.when(pl.col('GarageYrBlt') <= 2010)
        .then(pl.col('GarageYrBlt'))
        .otherwise(pl.col('YearBuilt'))
        .alias('GarageYrBlt')
    )

X_train = clean_data(X_train)
X_test = clean_data(X_test)

print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")

Training data shape: (1460, 78)
Test data shape: (1459, 79)

3. Build Feature Engineering Pipeline#

The gators Pipeline orchestrates multiple feature transformers:

Data Type Management:#

CastColumns: Convert boolean and numeric codes to strings for categorical encoding

Missing Value Imputation:#

StringImputer: Fill missing categorical values with constants or most frequent values
NumericImputer: Fill missing numeric values with constants (0 for counts/areas)
GroupByImputer: Fill missing values using group-based statistics (e.g., median by neighborhood)

Feature Engineering:#

RatioFeatures: Create ratio features (e.g., living area / lot area)
MathFeatures: Generate statistical features from related columns (sum, std, range, min, max)

Encoding:#

OneHotEncoder: Create binary indicator features for categories
CountEncoder: Replace categories with their frequency counts

[3]:

# Identify columns by data type for appropriate transformations
dtypes = dict(zip(X_train.columns, X_train.dtypes))
boolean_columns = [col for col, dtype in dtypes.items() if dtype == pl.Boolean]
int8_columns = [col for col, dtype in dtypes.items() if dtype == pl.UInt8]
columns_to_encode = [col for col, dtype in dtypes.items() if dtype == pl.String] + boolean_columns

# Numeric columns that should be treated as categories
num_columns_to_cast = ['MSSubClass', 'OverallCond', 'YrSold', 'MoSold']

# Build the pipeline
steps = [
    # Cast data types for proper handling
    ('cast_boolean_to_string', CastColumns(
        subset=boolean_columns,
        dtype=pl.String,
        inplace=True
    )),
    ('cast_num_to_string', CastColumns(
        subset=num_columns_to_cast,
        dtype=pl.String,
        inplace=True
    )),
    ('cast_int_to_float', CastColumns(
        subset=int8_columns,
        dtype=pl.Float64,
        inplace=True
    )),

    # Impute missing string values
    ('string_imputer_missing', StringImputer(
        strategy='constant',
        value='MISSING',
        subset=['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu',
                 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
                 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
                 'BsmtFinType2', 'MasVnrType'],
        inplace=True
    )),
    ('string_imputer_const', StringImputer(
        strategy='constant',
        value='Typ',
        subset=['Functional'],
        inplace=True
    )),
    ('string_imputer_most_freq', StringImputer(
        strategy='most_frequent',
        subset=['MSZoning', 'Electrical', 'KitchenQual', 'Exterior1st', 'SaleType'] + num_columns_to_cast,
        inplace=True
    )),

    # Impute missing numeric values
    ('numerical_imputer_const', NumericImputer(
        strategy='constant',
        value=0,
        subset=['GarageYrBlt', 'GarageArea', 'GarageCars', 'BsmtFinSF1',
                 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath',
                 'BsmtHalfBath', 'MasVnrArea'],
        inplace=True
    )),

    # Group-based imputation: use neighborhood median for lot frontage
    ('group_imputer', GroupByImputer(
        group_by_column='Neighborhood',
        strategy='median',
        subset=['LotFrontage'],
        inplace=True
    )),

    # Create ratio features
    ('ratio_features', RatioFeatures(
        numerator_columns=['GrLivArea'],
        denominator_columns=['LotArea'],
        new_column_names=['LivLotArea']
    )),

    # Generate statistical features from related area measurements
    ('stat', MathFeatures(
        groups=[['TotalBsmtSF', '1stFlrSF', '2nXlrSF']],
        operations=['sum', 'std', 'range', 'min', 'max']
    )),

    # Encode categorical variables
    ('onehot', OneHotEncoder(
        subset=columns_to_encode,
        drop_columns=False
    )),
    ('count_encoder', CountEncoder(
        subset=columns_to_encode,
        drop_columns=True,
        inplace=True
    )),
]

pipe = Pipeline(steps)
X_train = pipe.fit_transform(X_train)
X_test = pipe.transform(X_test)

print(f"\nOriginal features: {X_train.shape[1]}")
print(f"Engineered features: {X_test.shape[1]}")


Original features: 349
Engineered features: 350

4. Feature Selection with Stability Index#

Use the feature_stability_index to identify robust features that consistently contribute to model performance across multiple training runs. This helps prevent overfitting by selecting only stable, reliable features.

[ ]:

# Define conservative model parameters to prevent overfitting
conservative_params = {
    'max_depth': 4,
    'min_child_weight': 3,
    'learning_rate': 0.05,
    'n_estimators': 500,
    'gamma': 0.1,
    'reg_alpha': 0.5,
    'reg_lambda': 1.0,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse',
    'tree_method': 'hist',
    'random_state': 42,
    'n_jobs': -1
}

# Calculate feature stability index
estimator = XGBRegressor(**conservative_params)

from sklearn.model_selection import KFold
skf = KFold(n_splits=5, shuffle=True, random_state=0)
fsi_results = feature_stability_index(estimator, skf=skf, X=X_train, y=y_train)

# Filter to stable features with non-zero importance
fsi_results = fsi_results.filter(
    (pl.col("importance") != 0) & (pl.col("fsi") != 0)
)
fsi_results = fsi_results.sort(by="importance", descending=True)

# Select stable features
selected_features = fsi_results["feature"].to_list()
print(f"Number of selected stable features: {len(selected_features)}")
print(f"\nTop 10 features by importance:")
display(fsi_results.head(10))

Number of selected stable features: 151

Top 10 features by importance:

shape: (10, 3)

feature	fsi	importance
str	f64	f32
"ExterQual"	1.0	0.14994
"OverallQual"	1.0	0.133964
"TotalBsmtSF_1stFlrSF_2nXlrSF_…	1.0	0.069442
"GarageCars"	1.0	0.02874
"GrLivArea"	1.0	0.024384
"TotalBsmtSF"	1.0	0.022596
"CentralAir"	1.0	0.021065
"CentralAir__Y"	1.0	0.020707
"BsmtQual__Gd"	0.8	0.017674
"GarageQual"	1.0	0.016767

5. Train Model and Evaluate#

Train an XGBoost regression model using only the stable features and evaluate performance with cross-validation.

[5]:

"target" in X_train.columns

[5]:

False

[6]:

# Prepare training data with selected features
X_train = X_train.select(selected_features).with_columns(pl.all().cast(pl.Float64))

# Update model parameters for final training
final_params = conservative_params.copy()
final_params['max_depth'] = 4  # Slightly deeper trees for better performance

# Create RMSE scorer (lower is better)
rmse_scorer = make_scorer(root_mean_squared_error, greater_is_better=False)

# Train and evaluate with cross-validation
estimator = XGBRegressor(**final_params)
cv_scores = cross_val_score(estimator, X_train, y_train, scoring=rmse_scorer, cv=5)

print(f"Cross-Validation RMSE: {-cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Train final model on all training data
estimator.fit(X_train, y_train)
print("\nFinal model trained successfully!")

Cross-Validation RMSE: 0.1286 (+/- 0.0095)

Final model trained successfully!

6. Generate Predictions#

Apply the trained model to generate predictions for the test set and create a submission file.

[7]:

# Prepare test data
id_test = X_test["Id"]
X_test =  X_test.select(selected_features).with_columns(pl.all().cast(pl.Float64))

# Generate predictions (remember to reverse log transformation)
log_predictions = estimator.predict(X_test)
predictions = np.exp(log_predictions)  # Reverse log transformation

# Create submission file
submission = pl.DataFrame({
    "Id": id_test,
    "SalePrice": predictions
})
submission.write_csv("house_price_submission.csv")

print("Submission file created successfully!")
print(f"\nPredictions summary:")
print(f"Mean: ${predictions.mean():,.2f}")
print(f"Median: ${np.median(predictions):,.2f}")
print(f"Min: ${predictions.min():,.2f}")
print(f"Max: ${predictions.max():,.2f}")

Submission file created successfully!

Predictions summary:
Mean: $177,096.95
Median: $155,665.30
Min: $47,344.79
Max: $484,264.44

Summary#

This notebook demonstrates the power of the gators library for feature engineering in regression tasks:

Key Takeaways:#

Comprehensive Imputation: Multiple strategies for handling missing values (constant, most frequent, group-based)
Smart Type Handling: Automatic casting and conversion of data types for proper feature engineering
Mathematical Features: Create ratio and statistical features from related columns
Feature Stability: Use FSI (Feature Stability Index) for robust feature selection
Flexible Encoding: Support for both one-hot and count encoding strategies
Pipeline Integration: Seamlessly integrates with scikit-learn Pipeline and XGBoost

The gators library simplified complex feature engineering workflows while maintaining code clarity and reproducibility, particularly excelling at handling missing data through the innovative GroupByImputer transformer.