IBM Developer Works has several new articles on Data Mining with WEKA by Michael Abernethy. I decided to implement the example provided in the first article in the series using R. I realize that I could have used WEKA through R (using the RWeka package) to exactly emulate the process in the article, but I was interested in getting a better understanding of the process (of multivariate linear regression) using R.
So I created a file called data.txt with the data as it appears in the article:
Next, I read the file into R and assigned some names to the columns.
I checked my results with the summary statistics that appear in the article.
The min, max, mean and standard deviation all match up. I proceeded to the section where it discusses creating the regression model. I figured I would start by doing an analysis on all of the available columns.
res.lm = lm(sellingPrice ~ ., data= data)
The values did not match the model in the article. In addition, the updateGranite column is not statistically significant according to the article, but was not eliminated by R. I eventually tried different variations of the model and available commands and found that the following produced the desired results (meaning they matched up with the IBM article and Weka’s functionality).
res.lm = step(lm(sellingPrice ~ houseSize + lotSize + bedrooms + upgradeBathroom, data= data))
I manually removed updateGranite from the model and utilized the step command. Apparently step chooses a model by AIC. Maybe this option is something that statisticians routinely include or exclude when doing this sort of work – but this was not obvious in the statistics texts and examples I consulted.
There are a number of other R packages that might provide results that match up with those presented in the article. If anyone has insights into the differences between how R and Weka are used for this type of task, please add a comment.