Kaggle Challenge: House Prices

Thu, Jun 13, 2019 11-minute read View on GitHub

This blogpost contains code to the Kaggle House Price Challenge. The aim of this blogpost is to get familiar with the most commonly used Python data science stack - data preprocessing and cleaning with pandas, model building and evaluation using scikit-learn and different algorithms.

First, we start by importing the libraries needed for this project.

import pandas as pd
import numpy as np

Loading the data into dataframes.

test = pd.read_csv("test.csv")
train = pd.read_csv("train.csv")
test.shape, train.shape
test.columns, train.columns

We can see that the test dataset contains one additional variable compared to the train dataset, which is the “SalePrice” variable. In our analysis / prediction this serves as the dependent variable we want to predict given the houses’ characteristics.

train.SalePrice.describe()

1. Data preprocessing and cleaning

We’ll start the Data Cleaning by checking if the dependent variable in the test dataset contains any missing values.

train.SalePrice.isnull().sum()

All observations contain data for the target variable, therefore we can continue by taking a look at all the other variables contained in the train and test dataset.

miss_count_train = train.isnull().sum().sort_values(ascending=False)
perc_miss_train = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missings_train = pd.concat([miss_count_train, perc_miss_train], axis=1, keys=["Total", "Percent"])

miss_count_test = test.isnull().sum().sort_values(ascending=False)
perc_miss_test = (test.isnull().sum() / test.isnull().count()).sort_values(ascending=False)
missings_test = pd.concat([miss_count_test, perc_miss_test], axis=1, keys=["Total", "Percent"])
missings_train.head(10)
missings_test.head(10)

As a rule of thumb we completely ignore columns that contain at least 15% missing values and will not try to impute the missing values with any kind of computation, e.g. using means. Therefore, we will delete the variables “PoolQC”, “MiscFeature”, “Alley”, “Fence”, “FireplaceQu” and “LotFrontage”.

train = train.drop(columns=["PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu", "LotFrontage"])

The variables “GarageCond”, “GarageType”, “GarageQual”, “GarageYrBlt” and “GarageFinish” contain exactly the same number of missing values, which seems kind of odd. Therefore, we’ll take a closer look at these variables.

for var in ["GarageCond", "GarageType", "GarageQual", "GarageYrBlt", "GarageFinish"]:
    print(pd.crosstab(index=train[var], columns="count"))

We can see that for “GarageCond” and “GarageQual” the most frequently occurring value is “TA”, which means that the condition and quality of the garages are average/typical. We will replace the missing values of these two variables therefore with “TA” as well. The variable “GarageYrBlt” refers to the year in which the garage was built. Since we also have the year in which the houses themselves are built, we can drop this variable without losing much explanatory information. In addition to that we also drop the “GarageFinish” and “GarageType” variable.

train = train.drop(columns=["GarageYrBlt", "GarageFinish", "GarageType"])
train["GarageCond"] = train.GarageCond.fillna(value="TA")
train["GarageQual"] = train.GarageQual.fillna(value="TA")

In the same way as above, we take a closer look at the “Bsmt*” variables.

for var in ["BsmtFinType2", "BsmtExposure", "BsmtCond", "BsmtFinType1", "BsmtQual"]:
    print(pd.crosstab(index=train[var], columns="count"))

We delete the “BsmtFinType*” variables since these are highly subjective and do not add much information to our model. The missing values of “BsmtCond” will be imputed with the most common value “TA”. The rows containing missing values for “BsmtQual” and “BsmtExposure” will be deleted from the dataset.

train = train.drop(columns=["BsmtFinType1", "BsmtFinType2"])
train.BsmtCond = train["BsmtCond"].fillna(value="TA")
for var in ["BsmtQual", "BsmtExposure"]:
    train = train.drop(train.loc[train[var].isnull()].index)

The variable “Electrical” contains only one missing value, therefore we only delete this specific row of data. We proceed in the same way with “MasVnrType” and “MasVnrArea”.

for var in ["Electrical", "MasVnrType", "MasVnrArea"]:
    train = train.drop(train.loc[train[var].isnull()].index)

Running the above code again to check if all missing values are deleted.

miss_count_train = train.isnull().sum().sort_values(ascending=False)
perc_miss_train = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missings_train = pd.concat([miss_count_train, perc_miss_train], axis=1, keys=["Total", "Percent"])
sum(missings_train.Percent > 0)
0

This looks good.

We now handled all the missing data on the training set. As the next step we will clean the test data. Since we want to evaluate our model on Kaggle after finishing the modeling, we cannot drop any observations because we need predicted house prices for each row of the test data. Therefore, we will impute the missing data in the test dataset with the most frequent category for categorical features and the mean for numerical features.

test = test.drop(columns=["PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu", "LotFrontage"])
test = test.drop(columns=["GarageYrBlt", "GarageFinish", "GarageType"])
test["GarageCond"] = test.GarageCond.fillna(value="TA")
test["GarageQual"] = test.GarageQual.fillna(value="TA")
test = test.drop(columns=["BsmtFinType1", "BsmtFinType2"])
test.BsmtCond = test["BsmtCond"].fillna(value="TA")

test.shape, train.shape
# categorical:
for var in ["BsmtExposure", "BsmtQual", "MasVnrType", "MSZoning", "Utilities", "Functional", "SaleType", "Exterior2nd", "Exterior1st", "KitchenQual"]:
    test[var] = test[var].fillna(value=test[var].value_counts().index[0])

# numerical
for var in ["MasVnrArea", "BsmtHalfBath", "BsmtFullBath", "BsmtUnfSF", "GarageArea", "GarageCars", "BsmtFinSF1", "BsmtFinSF2", "TotalBsmtSF"]:
    test[var] = test[var].fillna(test[var].mean())
miss_count_test = test.isnull().sum().sort_values(ascending=False)
perc_miss_test = (test.isnull().sum() / test.isnull().count()).sort_values(ascending=False)
missings_test = pd.concat([miss_count_test, perc_miss_test], axis=1, keys=["Total", "Percent"])
sum(missings_test.Percent > 0)
0
test.shape, train.shape

Perfect. We do not have any missing values in our test set anymore and have kept all observations, for which we are going to predict the House Sale Price.

For now, we will not go deeply into feature engineering. The only additional variable besides the ones already in the dataset we are going to create is a variable containing the total area of the living space.

train["TotalSF"] = train["TotalBsmtSF"] + train["GrLivArea"]
test["TotalSF"] = test["TotalBsmtSF"] + test["GrLivArea"]

We will also take a look at the skewness and the kurtosis of the numerical feature to see if they are (at least somehow close to) normally distributed.

num_cols = train.dtypes != "object"
skewness_train = (train.loc[:, num_cols].skew()).sort_values(ascending=False) 
skewness_test = (test.loc[:, num_cols].skew()).sort_values(ascending=False)
skewness_train, skewness_test

We see that there are some variables with a very high skewness in the training data as well as in the test data. The skewness of a distribution is always compared to the skewness of the normal distribution, which is zero. To get our data to look more normal, we will log(feature + 1) transform it.

# log1p transforming the training data
for var in num_cols[num_cols].index:
    train[var] = np.log1p(train[var])
num_cols_test = num_cols.drop(labels="SalePrice")
# log1p transfroming the test data
for var in num_cols_test[num_cols_test].index:
    test[var] = np.log1p(test[var])

2. Model selection

We continue by building a Machine Learning Pipeline using Scikit Learn. A pipeline object sequentially applies a list of transformers and a final estimator.

We will play around with the KNearestNeighbor and RandomForest algorithms, tune their hyperparameters using Cross Validation and pick the best performing one.

Before we can work with our data, we first need to create separate dataframes containing our feature variables and the target variable. This needs to be done only for our training data since it is our aim to predict SalePrice for the test data, which is why it is not contained in this data.

y = train.SalePrice.values
X = train.drop("SalePrice", axis=1)

To make predictions and to fit models, the last step that has to be done is to convert all categorical features into numeric ones. This way scikit-learn can handle them. We do this by using pandas’ get_dummies() function. To make sure we end up with the same number of columns in both the training and the test dataset we first concatenate both, then apply get_dummies() and then separate them again.

X.shape, test.shape
# creating a dummy to distinguish between train and test data
X["train"] = 1
test["train"] = 0
# concatenating dataframes and creating dummies from categorical features
combined = pd.concat([X, test])
df = pd.get_dummies(combined, drop_first=True)


df.shape
X = df[df["train"]==1]
X = X.drop("train", axis=1)
test = df[df["train"]==0]
test = test.drop("train", axis=1)
X.shape, test.shape

Before training and tuning our models we separate the data into training and test data to evaluate them.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

2.1 k-nearest Neighbors (KNN)

We will start with a relatively simple algorithm - k-nearest Neighbor or KNN.

from sklearn import preprocessing
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

Looking at the data, we notice that are our feature variables’ ranges vary substantially between each other. Therefore, we will add a transformer to our pipeline which standardizes the data. Standardization centers each variable around zero with unit variance. This is done by subtracting the means from each feature and dividing by its standard deviation.

After that we instantiate our KNN estimator, create a list containing the steps applied by the pipeline and then defining the pipeline.

# instantiate the scaling transformator 
scaler = preprocessing.StandardScaler()
# instantiate the KNN estimator
knn = KNeighborsRegressor(n_neighbors=10)
# creating a list containing the steps the pipeline is to apply
steps_knn =  [("scaler", scaler), ("knn", knn)]
# define the pipeline object
pipeline_knn = Pipeline(steps_knn)

The KNN algorithm has one parameter that can and should be tuned, which is the number of neighbors that should be considered. We will therefore define a dictionary containing all hyperparameters that should be tuned and define the different values that should be tested.

neighbors = {"knn__n_neighbors":list(range(1,21))}

Next, we are going to set up our Cross Validation (CV) object using 5-fold CV and fit it to our data.

cv_knn = GridSearchCV(pipeline_knn, neighbors, cv=50, n_jobs=-1)
cv_knn.fit(X_train, y_train)
cv_knn.best_params_
print(cv_knn.refit)

From the above output we conclude that the default of the argument “refit” is “True”. This means that, by default, our CV pipeline automatically refits the model on the entire training set using the best parameters found by CV, which is 9 in our case. Therefore, we can now directly use the CV object to make predictions for unseen data.

As in the Kaggle challenge we will use the log of the root mean squared error metric to evaluate our model’s performance on the test set. To do this we first need to predict the sales price for the unseen test data.

from sklearn.metrics import mean_squared_error
y_pred = cv_knn.predict(X_test)
rmse_knn = np.sqrt(mean_squared_error(np.log(y_test), np.log(y_pred)))
print("The KNN algorithm with " + str(cv_knn.best_params_) + " yields a (log) RMSE of: " + str(rmse_knn))

2.2 Random forest

After trying out the KNN algorithm, we now continue with the Random Forest Algorithm. Using similar steps as before we will build up a pipeline object applying the transformations and the estimation on our dataset sequentially automatically.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# instantiate the RandomForest Regressor
rf_reg = RandomForestRegressor(random_state=123)
# creating a list containing the steps the pipeline should apply
steps_rf = [("scaler", scaler), ("rf_reg", rf_reg)]
# create the pipeline object
pipeline_rf = Pipeline(steps_rf)

For Random Forests there are a large number of hyperparameters that can be tuned. In this project we are going to tune the number of trees in the random forest [n_estimators], the number of features considered at every split [max_features], the maximum number of levels in a tree [max_depth], the minimum number of samples required to split a node [min_samples_split], the minimum number of observations required at each leaf node [min_samples_leaf] and if bootstrap should be used training each tree [bootstrap].

We first define the ranges for each hyperparameter that should be considered in our CV tuning and then create our search grid containing the before created lists.

n_estimators = list(np.arange(200, 2001, 200))
max_features = ["auto", "sqrt", "log2"]
max_depth = list(np.arange(10, 101, 10))
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4, 8]
bootstrap = [True, False]
param_dist = {"rf_reg__n_estimators": n_estimators,
              "rf_reg__max_features": max_features,
              "rf_reg__max_depth": max_depth,
              "rf_reg__min_samples_split": min_samples_split,
              "rf_reg__min_samples_leaf": min_samples_leaf,
              "rf_reg__bootstrap": bootstrap}

Testing all possible combinations of hyperparameters would amount to testing 10*3*11*3*3*2 = 5,940 combinations, so instead of testing all of these we will use RandomizedSearchCV to select randomly from our defined distributions which combinations are tested.

from sklearn.model_selection import RandomizedSearchCV
cv_random_rf = RandomizedSearchCV(pipeline_rf, param_dist, cv=3, n_iter=100, n_jobs=-1)

Just as before with the KNN algorithm we now can fit the pipeline to our data.

cv_random_rf.fit(X_train, y_train)
cv_random_rf.best_params_

Based on the chosen parameters from RandomizedSearchCV we can now manually decrease the range of the hyperparameters to be tested and use GridSearchCV as before to find the best parameters for our model.

n_estimators_2 = list(np.arange(100, 301, 50))
max_depth_2 = list(np.arange(40, 81, 10))
max_features_2 = ["sqrt"]
min_samples_split_2 = [2, 3]
min_samples_leaf_2 = [1, 2, 3]
bootstrap_2 = [True, False]
param_grid = {"rf_reg__n_estimators": n_estimators_2,
              "rf_reg__max_features": max_features_2,
              "rf_reg__max_depth": max_depth_2,
              "rf_reg__min_samples_split": min_samples_split_2,
              "rf_reg__min_samples_leaf": min_samples_leaf_2,
              "rf_reg__bootstrap": bootstrap_2}
param_grid

We can now instantiate and then fit our GridSearchCV object as before.

cv_rf = GridSearchCV(pipeline_rf, param_grid, cv=3, n_jobs=-1)
cv_rf.fit(X_train, y_train)
cv_rf.best_params_
print(cv_rf.refit)

We see that the parameters chosen by RandomizedSearchCV were already pretty good and did not really need that much of a finetuning. Furthermore, the GridSearchCV function also already refit the RandomForest model on the whole training data using the parameters that are found to work best.

As before, we now can create predictions for our hold out test set and evaluate our model using log RMSE as before.

from sklearn.metrics import mean_squared_error
y_pred_rf = cv_rf.predict(X_test)
rmse_rf = np.sqrt(mean_squared_error(np.log(y_test), np.log(y_pred_rf)))
print("The RandomForest algorithm with " + str(cv_rf.best_params_) + " yields a (log) RMSE of: " + str(rmse_rf))

Great! We see a clear improvement in our evaluation metric, i.e. the RandomForest Regressor performs much better than the KNN tried out first.

3. Making predictions and creating the submission file

Since we want to check our model’s performance in comparison with other people’s models, we need to make predictions for the test dataset provided by Kaggle and upload our results to Kaggle. According to the challenge’s description, our submission should be a file containing the observation’s ID and the predicted SalesPrice. We first create a dataframe and then save it as a .csv file.

ident = np.arange(1461, 2920)
print(ident)
test_pred = list(cv_rf.predict(test))
len(ident), len(test_pred)
submiss = {"Id": ident, "SalePrice": test_pred}
submission = pd.DataFrame(submiss)
print(submission.head())
"\n"
"\n"
"\n"
print(submission.tail())

Looking at our submission file we see that we get kind of strange predictions for the SalePrice variable. This is due to the log(x+1) transformation we did at first. We therefore have to get the data into the desired format, which can be done using the inverse of the log(x+1) transformation which is an exp(x)-1 transformation.

submission.SalePrice = np.expm1(submission.SalePrice)
submission.to_csv('submission_rf2.csv', index=False)
submission.to_csv("submission2020.csv", index=False)

The submission files can then be uploaded to Kaggle to check how well the models perform on the unseen test dataset.