The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform.
We do not have the full set of data yet, so this is a simple warmup session to predict the days in hospital in year 2 based on the year 1 data.
Prerequisites
Obviously you need to have R installed, and you should also have signed up for the competition (be sure to read the terms carefully) and downloaded and extracted the release 1 data file.
Data preparation
Let’s load the data into R and do some basic housekeeping:
#!/usr/bin/Rscript ## example001.R  simple benchmarks for the HHP ## Copyright © 2011 CYBAEA Limited  http://www.cybaea.net/ ############################## #### DATA PREPARATION ##++++ ## Members members < read.csv(file = "HHP_release1/Members_Y1.csv", colClasses = rep("factor", 3), comment.char = "") ## ##++++ ## Claims claims.Y1 < read.csv(file = "HHP_release1/Claims_Y1.csv", colClasses = c( rep("factor", 7), "integer", # paydelay "character", # LengthOfStay "character", # dsfs "factor", # PrimaryConditionGroup "character" # CharlsonIndex ), comment.char = "") ## Utility function make.numeric < function (cv, FUN = mean) { ### make a character vector numeric by splitting on '' sapply(strsplit(gsub("[^[:digit:]]+", " ", cv, perl = TRUE), " ", fixed = TRUE), function (x) FUN(as.numeric(x))) } ## Length of stay as days { z < make.numeric(claims.Y1$LengthOfStay) z.week < grepl("week", claims.Y1$LengthOfStay, fixed = TRUE) z[z.week] < z[z.week] * 7 # Weeks are 7 days z[is.nan(z)] < 0 claims.Y1$LengthOfStay.days < z } los.levels < c("", "1 day", sprintf("%d days", 2:6), "1 2 weeks", "2 4 weeks", "4 8 weeks", "812 weeks", "1226 weeks", "26+ weeks") stopifnot(all(claims.Y1$LengthOfStay %in% los.levels)) claims.Y1$LengthOfStay < factor(claims.Y1$LengthOfStay, levels = los.levels, labels = c("0 days", los.levels[1]), ordered = TRUE) ## Months since first claim claims.Y1$dsfs.months < make.numeric(claims.Y1$dsfs) ## dsfs is an ordered factor and gives the ordering of the claims dsfs.levels < c("0 1 month", sprintf("%d%2d months", 1:11, 2:12)) claims.Y1$dsfs < factor(claims.Y1$dsfs, levels = dsfs.levels, ordered = TRUE) ## Index as numeric claims.Y1$CharlsonIndex.numeric < make.numeric(claims.Y1$CharlsonIndex) claims.Y1$CharlsonIndex < factor(claims.Y1$CharlsonIndex, ordered = TRUE) ## ##++++ ## Days in hospital dih.Y2 < read.csv(file = "HHP_release1/DayInHospital_Y2.csv", colClasses = c("factor", "integer"), comment.char = "") names(dih.Y2)[1] < "MemberID" # Fix broken file ## save(members, claims.Y1, dih.Y2, file = "HHPR1.RData") ##############################
Scoring
We will need a function to score our predictions p
against the actual values a
. The formula is on the evaluation page and we implement it as:
#!/usr/bin/Rscript ## example001.R  simple benchmarks for the HHP ## Copyright © 2011 CYBAEA Limited  http://www.cybaea.net/ ############################## #### FUNCTION TO CALCULATE SCORE HPPScore < function (p, a) { ### Scorng function after ### http://www.heritagehealthprize.com/c/hhp/Details/Evaluation ### Base 10 log from http://www.heritagehealthprize.com/forums/default.aspx?g=posts&m=2226#post2226 sqrt(mean((log(1+p, 10)  log(1+a, 10))^2)) } ##############################
The simplest benchmarks
The simplest models don’t really model at all: they just use the average and are simple benchmarks.
#!/usr/bin/Rscript ## example001.R  simple benchmarks for the HHP ## Copyright © 2011 CYBAEA Limited  http://www.cybaea.net/ y < dih.Y2$DaysInHospital_Y2 # Actual p < rep(mean(y), NROW(dih.Y2)) cat(sprintf("Score using mean : %8.6f\n", HPPScore(p, y))) # Score using mean : 0.278725 p < rep(median(y), NROW(dih.Y2)) cat(sprintf("Score using median: %8.6f\n", HPPScore(p, y))) # Score using median: 0.267969
Simple singlevariable linear models
OK, a model that doesn’t use past data isn’t much of a model, so let’s improve on that:
#!/usr/bin/Rscript ## example001.R  simple benchmarks for the HHP ## Copyright © 2011 CYBAEA Limited  http://www.cybaea.net/ library("reshape2") vars < dcast(claims.Y1, MemberID ~ ., sum, value_var = "LengthOfStay.days") names(vars)[2] < "LengthOfStay" data < merge(vars, dih.Y2) model < lm(DaysInHospital_Y2 ~ LengthOfStay, data = data) p < predict(model) cat(sprintf("Score using lm(LengthOfStay): %8.6f\n", HPPScore(p, y))) # Score using lm(LengthOfStay): 0.279062 model < glm(DaysInHospital_Y2 ~ LengthOfStay, family = quasipoisson(), data = data) p < predict(model, type="response") cat(sprintf("Score using glm(LengthOfStay): %8.6f\n", HPPScore(p, y))) # Score using glm(LengthOfStay): 0.278914
Let the competition begin.
Jump to comments.
You may also like these posts:

Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how.

Employee productivity as function of number of workers revisited
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE100 constituent companies and find that the relation still holds four years later and across a continent.

R code for Chapter 2 of NonLife Insurance Pricing with GLM
We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as NonLife Insurance Pricing with Generalized Linear Models by Esbjörn Ohl…

Area Plots with Intensity Coloring
I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not …

R: Eliminating observed values with zero variance
I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible…
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...