Data Mining with WEKA example implemented in R

June 9, 2010
By

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

IBM Developer Works has several new articles on Data Mining with WEKA by Michael Abernethy. I decided to implement the example provided in the first article in the series using R. I realize that I could have used WEKA through R (using the RWeka package) to exactly emulate the process in the article, but I was interested in getting a better understanding of the process (of multivariate linear regression) using R.

So I created a file called data.txt with the data as it appears in the article:

3529,9191,6,0,0,205000
3247,10061,5,1,1,224900
4032,10150,5,0,1,197900
2397,14156,4,1,0,189900
2200,9600,4,0,1,195000
3536,19994,6,1,1,325000
2983,9365,5,0,1,230000

Next, I read the file into R and assigned some names to the columns.

data=read.csv(file='data.txt', header=FALSE)
names(data)=c('houseSize','lotSize','bedrooms',
'updateGranite','upgradeBathroom','sellingPrice')

I checked my results with the summary statistics that appear in the article.

summary(data$houseSize)
sd(data$houseSize)

The min, max, mean and standard deviation all match up. I proceeded to the section where it discusses creating the regression model. I figured I would start by doing an analysis on all of the available columns.

res.lm = lm(sellingPrice ~ ., data= data)
summary(res.lm)

The values did not match the model in the article. In addition, the updateGranite column is not statistically significant according to the article, but was not eliminated by R. I eventually tried different variations of the model and available commands and found that the following produced the desired results (meaning they matched up with the IBM article and Weka's functionality).

res.lm = step(lm(sellingPrice ~ houseSize + lotSize + bedrooms + upgradeBathroom, data= data))
summary(res.lm)

I manually removed updateGranite from the model and utilized the step command. Apparently step chooses a model by AIC. Maybe this option is something that statisticians routinely include or exclude when doing this sort of work - but this was not obvious in the statistics texts and examples I consulted.

There are a number of other R packages that might provide results that match up with those presented in the article. If anyone has insights into the differences between how R and Weka are used for this type of task, please add a comment.

To leave a comment for the author, please follow the link and comment on his blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags:

Comments are closed.