πŸ₯• Dortmund real estate market analysis: data preprocessing with caret

October 24, 2017
By

(This article was first published on Iegor Rudnytskyi, and kindly contributed to R-bloggers)

This is rather a short note, which is more related to an amazing package caret, than to our data set. The package allows for manipulating the model with less typing, for instance cross-validation or data preprocessing can be done by just specifying a couple of arguments in the key function of package train.

Perhaps, the median imputation and $k$-nearest neighbors algorithm I will live for the better times, since Dortmund real estate data set contains no missed values. Furthermore, caret makes it possible to apply various transformation of data, e.g. centering, scaling, principle component analysis (PCA), independent component analysis (ICA) etc. As one can remember, the model with the smallest out-of-sample RMSE is GAM with inverse Gaussian responses and log-link function. GAM assumes applying smooth functions to regressors, and thus, centering and scaling won’t improve our metric. On the other hand, I am not a big fan of centering and scaling, since both process make the sample dependent. In other words, subtracting the sample mean from each observation makes this observation dependent on the whole sample.

On the other hand, PCA might be very useful, since rooms and area are correlated. Can this improve the model? Let’s experiment and see. As usual, we start from loading packages and data.

packages <- c("mgcv", "magrittr", "vtreat", "caret")
sapply(packages, library, character.only = TRUE, logical.return = TRUE)
# options(scipen=999)
rm(list = ls())

setwd("/Users/irudnyts/Documents/data/")
property <- read.csv("dortmund.csv")

Below our GAM with IG outcome and log-link function is rewritten in caret syntax yielding the same RMSE:

model <- train(price ~ rooms + area, data = property,
               method = "gam", family = inverse.gaussian(link = "log"))
pred <- predict(model, property[, c("area", "rooms")])
(pred - property[, "price"]) ^ 2 %>% mean() %>% sqrt()
# [1] 134.9973

I don’t utilize the power of the function trainControl(), which can be used for cross-validation, in order to obtain consistency with previous posts. The model with PCA transformation is shown below:

model_pca <- train(price ~ rooms + area, data = property,
                   method = "gam", family = inverse.gaussian(link = "log"), 
                   preProcess = "pca")
pred <- predict(model_pca, property[, c("area", "rooms")])
(pred - property[, "price"]) ^ 2 %>% mean() %>% sqrt()
# [1] 146.4144

The in-sample RMSE is higher than our best model, which is not very encouraging. Even though it makes a little sense to go further, we calculate our-of-sample RMSE.

set.seed(3)
folds <- kWayCrossValidation(nRows = nrow(property), nSplits = 3)

pred <- rep(NA, nrow(property))
for(fold in folds) {
    model_pca <- train(price ~ rooms + area, data = property[fold$train, ],
                       method = "gam", family = inverse.gaussian(link = "log"), 
                       preProcess = "pca")
    
    pred[fold$app] <- predict(model_pca, property[fold$app, ])
}

sqrt(mean((pred - property$price) ^ 2))
# [1] 151.7058

I can summarize in one short sentence: PCA is not helpful for this case.

To leave a comment for the author, please follow the link and comment on their blog: Iegor Rudnytskyi.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)