# Taking Expectations to the Next Level

**jacobsimmering.com**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Higher Expectations

I came across this post on Thursday and found it to be quite interesting. Clearly rental prices vary according to where you live. That isn't too surprising. I started thinking a bit more about it and thought that Boston and the nearby communities would have to have some differences — the area near campus would probably be high bedroom places split among students and other areas would be more “adult” lodging with lower intended occupancy rates.

Some people might know what parts of Boston are the snobby parts and which are the low rent student areas, but I don't have a clue. This might be a wonderful use of EM like I posted about yesterday.

And I was lucky enough the the scripts and data were openly shared by the author Jeff Kaufman. This seems like the prefect time to load it up and see what we can do with it.

Lets first load the data and get a feel for it.

# data at # http://raw.github.com/jeffkaufman/apartment_prices/master/apts-2013-01-29.txt bostonRent <- read.table("~/bostonRents") # from the GitHub code we know the variable names, so lets make them human # readable names(bostonRent) <- c("rent", "bedrooms", "id", "long", "lat") summary(bostonRent[c(1, 2)])

## rent bedrooms ## Min. : 250 Min. :0.00 ## 1st Qu.: 1700 1st Qu.:1.00 ## Median : 2400 Median :2.00 ## Mean : 2697 Mean :2.31 ## 3rd Qu.: 3300 3rd Qu.:3.00 ## Max. :23000 Max. :9.00

So there are a few apartments with zero bedrooms. Those could either be studio apartments, apartments where the number of bedrooms was incorrectly entered or other properties. Since we don't know and they are relatively rare `{r} NROW(bostonRent$bedrooms[bostonRent[2] == 0])`

out of `{r} nrow(bostonRent)`

we will just delete them.

bostonRentNonZero <- bostonRent[bostonRent$bedrooms != 0, ] summary(bostonRentNonZero[c(1, 2)])

## rent bedrooms ## Min. : 550 Min. :1.00 ## 1st Qu.: 1850 1st Qu.:1.00 ## Median : 2500 Median :2.00 ## Mean : 2808 Mean :2.55 ## 3rd Qu.: 3456 3rd Qu.:3.00 ## Max. :23000 Max. :9.00

That looks better. Lets put together the simple model where rent is simply a function of the number of bedrooms:

# just to visually confirm that this is a sensible model library(ggplot2) ggplot(bostonRentNonZero, aes(x = bedrooms, y = rent)) + geom_point(position = "jitter")

# that looks promising, a few extreme values but mostly linear and well # behaved lets make it second order to account for potentially decreasing # marginal cost of bedrooms bedroomModel <- lm(rent ~ I(bedrooms^2) + bedrooms, data = bostonRentNonZero) summary(bedroomModel)

## ## Call: ## lm(formula = rent ~ I(bedrooms^2) + bedrooms, data = bostonRentNonZero) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2761 -631 -158 367 19967 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1463.72 75.21 19.46 <2e-16 *** ## I(bedrooms^2) 14.59 7.78 1.87 0.061 . ## bedrooms 479.48 52.11 9.20 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1130 on 3081 degrees of freedom ## Multiple R-squared: 0.33, Adjusted R-squared: 0.33 ## F-statistic: 760 on 2 and 3081 DF, p-value: <2e-16

Considering how simple the model is, it does relatively well with an adjusted R squared of `{r} signif(summary(bedroomModel)$adj, 3)`

. I mean, we aren't considering the number of bathrooms, on or off-street parking or the size of the unit. Glad to know that housing prices are still somewhat sensible.

Now what happens if we try and discover what are the different sections of Boston using two different sets of data to run EM on. Also, I don't know anything about Boston so lets set the max number of clusters to be 20 as that seems reasonable.

library(mclust)

## Package 'mclust' version 4.0

# for some reason mclust treats the matrix as univariate unless we build # it this way simple just considering location simpleEM <- matrix(NA, nrow = nrow(bostonRentNonZero), ncol = 2) simpleEM[, 1] <- bostonRentNonZero$long simpleEM[, 2] <- bostonRentNonZero$lat # complex considering location and the number of bedrooms complexEM <- matrix(NA, nrow = nrow(bostonRentNonZero), ncol = 3) complexEM[, 1] <- bostonRentNonZero$long complexEM[, 2] <- bostonRentNonZero$lat complexEM[, 3] <- bostonRentNonZero$bedrooms simpleModel <- Mclust(simpleEM, G = 1:20) summary(simpleModel)

## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VEV (ellipsoidal, equal shape) model with 19 components: ## ## log.likelihood n df BIC ## 15303 3084 95 29842 ## ## Clustering table: ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ## 110 83 88 181 239 130 521 112 229 79 336 107 108 81 373 60 60 47 ## 19 ## 140

complexModel <- Mclust(complexEM, G = 1:20) summary(complexModel)

## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust EEV (ellipsoidal, equal volume and shape) model with 12 components: ## ## log.likelihood n df BIC ## 35443 3084 86 70194 ## ## Clustering table: ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 737 725 541 41 430 60 261 21 152 79 32 5

bostonRentNonZero$simple <- as.factor(simpleModel$classification) bostonRentNonZero$complex <- as.factor(complexModel$classification)

We found somewhat fewer clusters using the model that considered both location and number of bedrooms (`{r} complexModel$G`

vs `{r} simpleModel$G`

). Lets see what these clusters look like in space:

ggplot(bostonRentNonZero) + geom_point(aes(x = long, y = lat, color = simple))

ggplot(bostonRentNonZero) + geom_point(aes(x = long, y = lat, color = complex))

I can plausibly buy that those map to distinct areas of Boston. I am less sure about the value of including bedrooms in the EM model but we will check it out anyway.

Lets see up the EM-improved regression models:

simpleLm <- lm(rent ~ I(bedrooms^2) + bedrooms + simple, data = bostonRentNonZero) complexLm <- lm(rent ~ I(bedrooms^2) + bedrooms + complex, data = bostonRentNonZero) summary(simpleLm)

## ## Call: ## lm(formula = rent ~ I(bedrooms^2) + bedrooms + simple, data = bostonRentNonZero) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3399 -498 -123 333 19007 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 685.43 113.41 6.04 1.7e-09 *** ## I(bedrooms^2) 8.40 6.88 1.22 0.22258 ## bedrooms 613.00 46.60 13.15 < 2e-16 *** ## simple2 948.74 143.00 6.63 3.8e-11 *** ## simple3 39.29 140.41 0.28 0.77964 ## simple4 -339.03 118.85 -2.85 0.00436 ** ## simple5 470.97 115.80 4.07 4.9e-05 *** ## simple6 1820.51 127.32 14.30 < 2e-16 *** ## simple7 26.43 105.18 0.25 0.80160 ## simple8 51.02 132.03 0.39 0.69917 ## simple9 378.53 114.08 3.32 0.00092 *** ## simple10 -279.98 144.73 -1.93 0.05314 . ## simple11 367.45 107.82 3.41 0.00066 *** ## simple12 -63.40 133.27 -0.48 0.63433 ## simple13 724.24 133.23 5.44 5.9e-08 *** ## simple14 534.48 144.63 3.70 0.00022 *** ## simple15 1393.39 106.51 13.08 < 2e-16 *** ## simple16 899.21 157.51 5.71 1.2e-08 *** ## simple17 1351.83 157.49 8.58 < 2e-16 *** ## simple18 343.93 171.01 2.01 0.04440 * ## simple19 1060.51 125.08 8.48 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 981 on 3063 degrees of freedom ## Multiple R-squared: 0.499, Adjusted R-squared: 0.496 ## F-statistic: 153 on 20 and 3063 DF, p-value: <2e-16

summary(complexLm)

## ## Call: ## lm(formula = rent ~ I(bedrooms^2) + bedrooms + complex, data = bostonRentNonZero) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2546 -618 -169 357 19804 ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 374 1200 0.31 0.756 ## I(bedrooms^2) -121 138 -0.88 0.382 ## bedrooms 1740 1337 1.30 0.193 ## complex2 -825 925 -0.89 0.372 ## complex3 -1308 1572 -0.83 0.405 ## complex4 -552 179 -3.09 0.002 ** ## complex5 -1800 1944 -0.93 0.355 ## complex6 -2463 1577 -1.56 0.118 ## complex7 -1119 926 -1.21 0.227 ## complex8 -3054 1958 -1.56 0.119 ## complex9 -1728 2043 -0.85 0.398 ## complex10 -1578 1873 -0.84 0.400 ## complex11 -1039 1446 -0.72 0.473 ## complex12 NA NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1110 on 3071 degrees of freedom ## Multiple R-squared: 0.354, Adjusted R-squared: 0.351 ## F-statistic: 140 on 12 and 3071 DF, p-value: <2e-16

I don't particularly like the effect the clustering method that includes bedrooms. I feel like it is too co-linear with the bedrooms variable. In fact, the variance inflation factor suggests that this is true.

In addition to the technical issues, including the bedroom number seems to move the clusters too far from geographic space. So I think we can safely conclude that adding bedrooms to the cluster definition doesn't help and might hurt our performance.

However, the model that clustered simply on geographic space did wonderfully. We increased our adjusted R squared to `{r} summary(simpleLm)$adj`

, an increase of `{r} summary(simpleLm)$adj - summary(bedroomModel)$adj`

. That is pretty impressive. Lets test the value of the addition more formally with an ANOVA:

anova(simpleLm, bedroomModel)

## Analysis of Variance Table ## ## Model 1: rent ~ I(bedrooms^2) + bedrooms + simple ## Model 2: rent ~ I(bedrooms^2) + bedrooms ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 3063 2.95e+09 ## 2 3081 3.95e+09 -18 -9.96e+08 57.5 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I think it is pretty safe to say that the addition of the geographically defined clusters has use in improving our housing price model.

I feel like this is a pretty interesting use of EM in real world data. I know nothing about Boston housing pricing but using EM to automagically find geographic clusters and the number of bedrooms, I was able to build a fairly simple linear model to explain roughly 50% of the variance in rent prices. The expected pattern of prices holds up in the model — there are places you would pay more (as much as 1,800 dollars more) to live and other places where prices are very low. We also learned that Boston is a bit pricey, your zero bedroom apartment will run you 685 a month.

**leave a comment**for the author, please follow the link and comment on their blog:

**jacobsimmering.com**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.