House Price Prediction with Gators#

This notebook demonstrates how to use the gators library for feature engineering in a regression problem. We’ll predict house prices from the Kaggle House Prices dataset using various gators transformers.

Table of Contents#

  1. Import Libraries

  2. Load and Preprocess Data

  3. Build Feature Engineering Pipeline

  4. Feature Selection with Stability Index

  5. Train Model and Evaluate

  6. Generate Predictions

  7. Summary

Key Features Demonstrated:#

  • Data cleaning and type casting

  • Missing value imputation (simple and group-based)

  • Mathematical feature engineering (ratios, statistics)

  • Feature encoding (one-hot and count encoding)

  • Feature stability analysis for robust feature selection

  • Regression modeling with XGBoost

Dataset: Kaggle House Prices - Advanced Regression Techniques

1. Import Libraries#

Import the necessary libraries including gators transformers for feature engineering.

[1]:
import polars as pl
# import pandas as pd
import numpy as np
from IPython.display import display

from gators.encoders import OneHotEncoder, CountEncoder
from gators.data_cleaning import DropColumns, CastColumns
from gators.feature_generation import MathFeatures, RatioFeatures
from gators.imputers import StringImputer, NumericImputer, GroupByImputer

from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import root_mean_squared_error, make_scorer

from xgboost import XGBRegressor
from gators.feature_selection.feature_stability_index import feature_stability_index

2. Load and Preprocess Data#

Load the house prices dataset and perform initial data cleaning:

  • Handle ‘NA’ string values as null

  • Apply log transformation to the target (SalePrice) to normalize distribution

  • Clean inconsistent category values

  • Fix problematic year values in GarageYrBlt

[2]:
# Load data with 'NA' strings interpreted as null values
X_train = pl.read_csv('../../../kaggle/house/train.csv', null_values=['NA'])
X_test = pl.read_csv('../../../kaggle/house/test.csv', null_values=['NA'])

# Apply log transformation to target variable for better distribution
X_train = X_train.with_columns(pl.col('SalePrice').log().alias('target'))
# target =

# Drop ID column and constant features
y_train = X_train["target"]
X_train = X_train.drop(['Id', 'Utilities', "target", "SalePrice"])
X_test = X_test.drop(['Utilities'])

def clean_data(X):
    """Clean inconsistent values and fix data quality issues"""
    return X.with_columns(
        # Fix Exterior2nd inconsistent naming
        pl.when(pl.col('Exterior2nd') == 'Brk Cmn')
        .then(pl.lit('BrkComm'))
        .otherwise(pl.col('Exterior2nd'))
        .alias('Exterior2nd'),

        # Fix future GarageYrBlt values (replace with YearBuilt if > 2010)
        pl.when(pl.col('GarageYrBlt') <= 2010)
        .then(pl.col('GarageYrBlt'))
        .otherwise(pl.col('YearBuilt'))
        .alias('GarageYrBlt')
    )

X_train = clean_data(X_train)
X_test = clean_data(X_test)

print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")
Training data shape: (1460, 78)
Test data shape: (1459, 79)

3. Build Feature Engineering Pipeline#

The gators Pipeline orchestrates multiple feature transformers:

Data Type Management:#

  • CastColumns: Convert boolean and numeric codes to strings for categorical encoding

Missing Value Imputation:#

  • StringImputer: Fill missing categorical values with constants or most frequent values

  • NumericImputer: Fill missing numeric values with constants (0 for counts/areas)

  • GroupByImputer: Fill missing values using group-based statistics (e.g., median by neighborhood)

Feature Engineering:#

  • RatioFeatures: Create ratio features (e.g., living area / lot area)

  • MathFeatures: Generate statistical features from related columns (sum, std, range, min, max)

Encoding:#

  • OneHotEncoder: Create binary indicator features for categories

  • CountEncoder: Replace categories with their frequency counts

[3]:
# Identify columns by data type for appropriate transformations
dtypes = dict(zip(X_train.columns, X_train.dtypes))
boolean_columns = [col for col, dtype in dtypes.items() if dtype == pl.Boolean]
int8_columns = [col for col, dtype in dtypes.items() if dtype == pl.UInt8]
columns_to_encode = [col for col, dtype in dtypes.items() if dtype == pl.String] + boolean_columns

# Numeric columns that should be treated as categories
num_columns_to_cast = ['MSSubClass', 'OverallCond', 'YrSold', 'MoSold']

# Build the pipeline
steps = [
    # Cast data types for proper handling
    ('cast_boolean_to_string', CastColumns(
        subset=boolean_columns,
        dtype=pl.String,
        inplace=True
    )),
    ('cast_num_to_string', CastColumns(
        subset=num_columns_to_cast,
        dtype=pl.String,
        inplace=True
    )),
    ('cast_int_to_float', CastColumns(
        subset=int8_columns,
        dtype=pl.Float64,
        inplace=True
    )),

    # Impute missing string values
    ('string_imputer_missing', StringImputer(
        strategy='constant',
        value='MISSING',
        subset=['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu',
                 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
                 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
                 'BsmtFinType2', 'MasVnrType'],
        inplace=True
    )),
    ('string_imputer_const', StringImputer(
        strategy='constant',
        value='Typ',
        subset=['Functional'],
        inplace=True
    )),
    ('string_imputer_most_freq', StringImputer(
        strategy='most_frequent',
        subset=['MSZoning', 'Electrical', 'KitchenQual', 'Exterior1st', 'SaleType'] + num_columns_to_cast,
        inplace=True
    )),

    # Impute missing numeric values
    ('numerical_imputer_const', NumericImputer(
        strategy='constant',
        value=0,
        subset=['GarageYrBlt', 'GarageArea', 'GarageCars', 'BsmtFinSF1',
                 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath',
                 'BsmtHalfBath', 'MasVnrArea'],
        inplace=True
    )),

    # Group-based imputation: use neighborhood median for lot frontage
    ('group_imputer', GroupByImputer(
        group_by_column='Neighborhood',
        strategy='median',
        subset=['LotFrontage'],
        inplace=True
    )),

    # Create ratio features
    ('ratio_features', RatioFeatures(
        numerator_columns=['GrLivArea'],
        denominator_columns=['LotArea'],
        new_column_names=['LivLotArea']
    )),

    # Generate statistical features from related area measurements
    ('stat', MathFeatures(
        groups=[['TotalBsmtSF', '1stFlrSF', '2nXlrSF']],
        operations=['sum', 'std', 'range', 'min', 'max']
    )),

    # Encode categorical variables
    ('onehot', OneHotEncoder(
        subset=columns_to_encode,
        drop_columns=False
    )),
    ('count_encoder', CountEncoder(
        subset=columns_to_encode,
        drop_columns=True,
        inplace=True
    )),
]

pipe = Pipeline(steps)
X_train = pipe.fit_transform(X_train)
X_test = pipe.transform(X_test)

print(f"\nOriginal features: {X_train.shape[1]}")
print(f"Engineered features: {X_test.shape[1]}")

Original features: 349
Engineered features: 350

4. Feature Selection with Stability Index#

Use the feature_stability_index to identify robust features that consistently contribute to model performance across multiple training runs. This helps prevent overfitting by selecting only stable, reliable features.

[ ]:
# Define conservative model parameters to prevent overfitting
conservative_params = {
    'max_depth': 4,
    'min_child_weight': 3,
    'learning_rate': 0.05,
    'n_estimators': 500,
    'gamma': 0.1,
    'reg_alpha': 0.5,
    'reg_lambda': 1.0,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse',
    'tree_method': 'hist',
    'random_state': 42,
    'n_jobs': -1
}

# Calculate feature stability index
estimator = XGBRegressor(**conservative_params)

from sklearn.model_selection import KFold
skf = KFold(n_splits=5, shuffle=True, random_state=0)
fsi_results = feature_stability_index(estimator, skf=skf, X=X_train, y=y_train)

# Filter to stable features with non-zero importance
fsi_results = fsi_results.filter(
    (pl.col("importance") != 0) & (pl.col("fsi") != 0)
)
fsi_results = fsi_results.sort(by="importance", descending=True)

# Select stable features
selected_features = fsi_results["feature"].to_list()
print(f"Number of selected stable features: {len(selected_features)}")
print(f"\nTop 10 features by importance:")
display(fsi_results.head(10))
Number of selected stable features: 151

Top 10 features by importance:
shape: (10, 3)
featurefsiimportance
strf64f32
"ExterQual"1.00.14994
"OverallQual"1.00.133964
"TotalBsmtSF_1stFlrSF_2nXlrSF_…1.00.069442
"GarageCars"1.00.02874
"GrLivArea"1.00.024384
"TotalBsmtSF"1.00.022596
"CentralAir"1.00.021065
"CentralAir__Y"1.00.020707
"BsmtQual__Gd"0.80.017674
"GarageQual"1.00.016767

5. Train Model and Evaluate#

Train an XGBoost regression model using only the stable features and evaluate performance with cross-validation.

[5]:
"target" in X_train.columns
[5]:
False
[6]:
# Prepare training data with selected features
X_train = X_train.select(selected_features).with_columns(pl.all().cast(pl.Float64))

# Update model parameters for final training
final_params = conservative_params.copy()
final_params['max_depth'] = 4  # Slightly deeper trees for better performance

# Create RMSE scorer (lower is better)
rmse_scorer = make_scorer(root_mean_squared_error, greater_is_better=False)

# Train and evaluate with cross-validation
estimator = XGBRegressor(**final_params)
cv_scores = cross_val_score(estimator, X_train, y_train, scoring=rmse_scorer, cv=5)

print(f"Cross-Validation RMSE: {-cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Train final model on all training data
estimator.fit(X_train, y_train)
print("\nFinal model trained successfully!")
Cross-Validation RMSE: 0.1286 (+/- 0.0095)

Final model trained successfully!

6. Generate Predictions#

Apply the trained model to generate predictions for the test set and create a submission file.

[7]:
# Prepare test data
id_test = X_test["Id"]
X_test =  X_test.select(selected_features).with_columns(pl.all().cast(pl.Float64))

# Generate predictions (remember to reverse log transformation)
log_predictions = estimator.predict(X_test)
predictions = np.exp(log_predictions)  # Reverse log transformation

# Create submission file
submission = pl.DataFrame({
    "Id": id_test,
    "SalePrice": predictions
})
submission.write_csv("house_price_submission.csv")

print("Submission file created successfully!")
print(f"\nPredictions summary:")
print(f"Mean: ${predictions.mean():,.2f}")
print(f"Median: ${np.median(predictions):,.2f}")
print(f"Min: ${predictions.min():,.2f}")
print(f"Max: ${predictions.max():,.2f}")
Submission file created successfully!

Predictions summary:
Mean: $177,096.95
Median: $155,665.30
Min: $47,344.79
Max: $484,264.44

Summary#

This notebook demonstrates the power of the gators library for feature engineering in regression tasks:

Key Takeaways:#

  1. Comprehensive Imputation: Multiple strategies for handling missing values (constant, most frequent, group-based)

  2. Smart Type Handling: Automatic casting and conversion of data types for proper feature engineering

  3. Mathematical Features: Create ratio and statistical features from related columns

  4. Feature Stability: Use FSI (Feature Stability Index) for robust feature selection

  5. Flexible Encoding: Support for both one-hot and count encoding strategies

  6. Pipeline Integration: Seamlessly integrates with scikit-learn Pipeline and XGBoost

The gators library simplified complex feature engineering workflows while maintaining code clarity and reproducibility, particularly excelling at handling missing data through the innovative GroupByImputer transformer.