House Price Prediction with Gators#
This notebook demonstrates how to use the gators library for feature engineering in a regression problem. We’ll predict house prices from the Kaggle House Prices dataset using various gators transformers.
Table of Contents#
Key Features Demonstrated:#
Data cleaning and type casting
Missing value imputation (simple and group-based)
Mathematical feature engineering (ratios, statistics)
Feature encoding (one-hot and count encoding)
Feature stability analysis for robust feature selection
Regression modeling with XGBoost
Dataset: Kaggle House Prices - Advanced Regression Techniques
1. Import Libraries#
Import the necessary libraries including gators transformers for feature engineering.
[1]:
import polars as pl
# import pandas as pd
import numpy as np
from IPython.display import display
from gators.encoders import OneHotEncoder, CountEncoder
from gators.data_cleaning import DropColumns, CastColumns
from gators.feature_generation import MathFeatures, RatioFeatures
from gators.imputers import StringImputer, NumericImputer, GroupByImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import root_mean_squared_error, make_scorer
from xgboost import XGBRegressor
from gators.feature_selection.feature_stability_index import feature_stability_index
2. Load and Preprocess Data#
Load the house prices dataset and perform initial data cleaning:
Handle ‘NA’ string values as null
Apply log transformation to the target (SalePrice) to normalize distribution
Clean inconsistent category values
Fix problematic year values in GarageYrBlt
[2]:
# Load data with 'NA' strings interpreted as null values
X_train = pl.read_csv('../../../kaggle/house/train.csv', null_values=['NA'])
X_test = pl.read_csv('../../../kaggle/house/test.csv', null_values=['NA'])
# Apply log transformation to target variable for better distribution
X_train = X_train.with_columns(pl.col('SalePrice').log().alias('target'))
# target =
# Drop ID column and constant features
y_train = X_train["target"]
X_train = X_train.drop(['Id', 'Utilities', "target", "SalePrice"])
X_test = X_test.drop(['Utilities'])
def clean_data(X):
"""Clean inconsistent values and fix data quality issues"""
return X.with_columns(
# Fix Exterior2nd inconsistent naming
pl.when(pl.col('Exterior2nd') == 'Brk Cmn')
.then(pl.lit('BrkComm'))
.otherwise(pl.col('Exterior2nd'))
.alias('Exterior2nd'),
# Fix future GarageYrBlt values (replace with YearBuilt if > 2010)
pl.when(pl.col('GarageYrBlt') <= 2010)
.then(pl.col('GarageYrBlt'))
.otherwise(pl.col('YearBuilt'))
.alias('GarageYrBlt')
)
X_train = clean_data(X_train)
X_test = clean_data(X_test)
print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")
Training data shape: (1460, 78)
Test data shape: (1459, 79)
3. Build Feature Engineering Pipeline#
The gators Pipeline orchestrates multiple feature transformers:
Data Type Management:#
CastColumns: Convert boolean and numeric codes to strings for categorical encoding
Missing Value Imputation:#
StringImputer: Fill missing categorical values with constants or most frequent values
NumericImputer: Fill missing numeric values with constants (0 for counts/areas)
GroupByImputer: Fill missing values using group-based statistics (e.g., median by neighborhood)
Feature Engineering:#
RatioFeatures: Create ratio features (e.g., living area / lot area)
MathFeatures: Generate statistical features from related columns (sum, std, range, min, max)
Encoding:#
OneHotEncoder: Create binary indicator features for categories
CountEncoder: Replace categories with their frequency counts
[3]:
# Identify columns by data type for appropriate transformations
dtypes = dict(zip(X_train.columns, X_train.dtypes))
boolean_columns = [col for col, dtype in dtypes.items() if dtype == pl.Boolean]
int8_columns = [col for col, dtype in dtypes.items() if dtype == pl.UInt8]
columns_to_encode = [col for col, dtype in dtypes.items() if dtype == pl.String] + boolean_columns
# Numeric columns that should be treated as categories
num_columns_to_cast = ['MSSubClass', 'OverallCond', 'YrSold', 'MoSold']
# Build the pipeline
steps = [
# Cast data types for proper handling
('cast_boolean_to_string', CastColumns(
subset=boolean_columns,
dtype=pl.String,
inplace=True
)),
('cast_num_to_string', CastColumns(
subset=num_columns_to_cast,
dtype=pl.String,
inplace=True
)),
('cast_int_to_float', CastColumns(
subset=int8_columns,
dtype=pl.Float64,
inplace=True
)),
# Impute missing string values
('string_imputer_missing', StringImputer(
strategy='constant',
value='MISSING',
subset=['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu',
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
'BsmtFinType2', 'MasVnrType'],
inplace=True
)),
('string_imputer_const', StringImputer(
strategy='constant',
value='Typ',
subset=['Functional'],
inplace=True
)),
('string_imputer_most_freq', StringImputer(
strategy='most_frequent',
subset=['MSZoning', 'Electrical', 'KitchenQual', 'Exterior1st', 'SaleType'] + num_columns_to_cast,
inplace=True
)),
# Impute missing numeric values
('numerical_imputer_const', NumericImputer(
strategy='constant',
value=0,
subset=['GarageYrBlt', 'GarageArea', 'GarageCars', 'BsmtFinSF1',
'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath',
'BsmtHalfBath', 'MasVnrArea'],
inplace=True
)),
# Group-based imputation: use neighborhood median for lot frontage
('group_imputer', GroupByImputer(
group_by_column='Neighborhood',
strategy='median',
subset=['LotFrontage'],
inplace=True
)),
# Create ratio features
('ratio_features', RatioFeatures(
numerator_columns=['GrLivArea'],
denominator_columns=['LotArea'],
new_column_names=['LivLotArea']
)),
# Generate statistical features from related area measurements
('stat', MathFeatures(
groups=[['TotalBsmtSF', '1stFlrSF', '2nXlrSF']],
operations=['sum', 'std', 'range', 'min', 'max']
)),
# Encode categorical variables
('onehot', OneHotEncoder(
subset=columns_to_encode,
drop_columns=False
)),
('count_encoder', CountEncoder(
subset=columns_to_encode,
drop_columns=True,
inplace=True
)),
]
pipe = Pipeline(steps)
X_train = pipe.fit_transform(X_train)
X_test = pipe.transform(X_test)
print(f"\nOriginal features: {X_train.shape[1]}")
print(f"Engineered features: {X_test.shape[1]}")
Original features: 349
Engineered features: 350
4. Feature Selection with Stability Index#
Use the feature_stability_index to identify robust features that consistently contribute to model performance across multiple training runs. This helps prevent overfitting by selecting only stable, reliable features.
[ ]:
# Define conservative model parameters to prevent overfitting
conservative_params = {
'max_depth': 4,
'min_child_weight': 3,
'learning_rate': 0.05,
'n_estimators': 500,
'gamma': 0.1,
'reg_alpha': 0.5,
'reg_lambda': 1.0,
'subsample': 0.8,
'colsample_bytree': 0.8,
'objective': 'reg:squarederror',
'eval_metric': 'rmse',
'tree_method': 'hist',
'random_state': 42,
'n_jobs': -1
}
# Calculate feature stability index
estimator = XGBRegressor(**conservative_params)
from sklearn.model_selection import KFold
skf = KFold(n_splits=5, shuffle=True, random_state=0)
fsi_results = feature_stability_index(estimator, skf=skf, X=X_train, y=y_train)
# Filter to stable features with non-zero importance
fsi_results = fsi_results.filter(
(pl.col("importance") != 0) & (pl.col("fsi") != 0)
)
fsi_results = fsi_results.sort(by="importance", descending=True)
# Select stable features
selected_features = fsi_results["feature"].to_list()
print(f"Number of selected stable features: {len(selected_features)}")
print(f"\nTop 10 features by importance:")
display(fsi_results.head(10))
Number of selected stable features: 151
Top 10 features by importance:
| feature | fsi | importance |
|---|---|---|
| str | f64 | f32 |
| "ExterQual" | 1.0 | 0.14994 |
| "OverallQual" | 1.0 | 0.133964 |
| "TotalBsmtSF_1stFlrSF_2nXlrSF_… | 1.0 | 0.069442 |
| "GarageCars" | 1.0 | 0.02874 |
| "GrLivArea" | 1.0 | 0.024384 |
| "TotalBsmtSF" | 1.0 | 0.022596 |
| "CentralAir" | 1.0 | 0.021065 |
| "CentralAir__Y" | 1.0 | 0.020707 |
| "BsmtQual__Gd" | 0.8 | 0.017674 |
| "GarageQual" | 1.0 | 0.016767 |
5. Train Model and Evaluate#
Train an XGBoost regression model using only the stable features and evaluate performance with cross-validation.
[5]:
"target" in X_train.columns
[5]:
False
[6]:
# Prepare training data with selected features
X_train = X_train.select(selected_features).with_columns(pl.all().cast(pl.Float64))
# Update model parameters for final training
final_params = conservative_params.copy()
final_params['max_depth'] = 4 # Slightly deeper trees for better performance
# Create RMSE scorer (lower is better)
rmse_scorer = make_scorer(root_mean_squared_error, greater_is_better=False)
# Train and evaluate with cross-validation
estimator = XGBRegressor(**final_params)
cv_scores = cross_val_score(estimator, X_train, y_train, scoring=rmse_scorer, cv=5)
print(f"Cross-Validation RMSE: {-cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# Train final model on all training data
estimator.fit(X_train, y_train)
print("\nFinal model trained successfully!")
Cross-Validation RMSE: 0.1286 (+/- 0.0095)
Final model trained successfully!
6. Generate Predictions#
Apply the trained model to generate predictions for the test set and create a submission file.
[7]:
# Prepare test data
id_test = X_test["Id"]
X_test = X_test.select(selected_features).with_columns(pl.all().cast(pl.Float64))
# Generate predictions (remember to reverse log transformation)
log_predictions = estimator.predict(X_test)
predictions = np.exp(log_predictions) # Reverse log transformation
# Create submission file
submission = pl.DataFrame({
"Id": id_test,
"SalePrice": predictions
})
submission.write_csv("house_price_submission.csv")
print("Submission file created successfully!")
print(f"\nPredictions summary:")
print(f"Mean: ${predictions.mean():,.2f}")
print(f"Median: ${np.median(predictions):,.2f}")
print(f"Min: ${predictions.min():,.2f}")
print(f"Max: ${predictions.max():,.2f}")
Submission file created successfully!
Predictions summary:
Mean: $177,096.95
Median: $155,665.30
Min: $47,344.79
Max: $484,264.44
Summary#
This notebook demonstrates the power of the gators library for feature engineering in regression tasks:
Key Takeaways:#
Comprehensive Imputation: Multiple strategies for handling missing values (constant, most frequent, group-based)
Smart Type Handling: Automatic casting and conversion of data types for proper feature engineering
Mathematical Features: Create ratio and statistical features from related columns
Feature Stability: Use FSI (Feature Stability Index) for robust feature selection
Flexible Encoding: Support for both one-hot and count encoding strategies
Pipeline Integration: Seamlessly integrates with scikit-learn Pipeline and XGBoost
The gators library simplified complex feature engineering workflows while maintaining code clarity and reproducibility, particularly excelling at handling missing data through the innovative GroupByImputer transformer.