# Getting started with the Heritage Health Price competition

April 8, 2011
By

(This article was first published on CYBAEA Data and Analysis, and kindly contributed to R-bloggers)

The US$3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform. We do not have the full set of data yet, so this is a simple warm-up session to predict the days in hospital in year 2 based on the year 1 data. ## Prerequisites Obviously you need to have R installed, and you should also have signed up for the competition (be sure to read the terms carefully) and downloaded and extracted the release 1 data file. ## Data preparation Let’s load the data into R and do some basic housekeeping: #!/usr/bin/Rscript ## example001.R - simple benchmarks for the HHP ## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/ ############################## #### DATA PREPARATION ##++++ ## Members members <- read.csv(file = "HHP_release1/Members_Y1.csv", colClasses = rep("factor", 3), comment.char = "") ##---- ##++++ ## Claims claims.Y1 <- read.csv(file = "HHP_release1/Claims_Y1.csv", colClasses = c( rep("factor", 7), "integer", # paydelay "character", # LengthOfStay "character", # dsfs "factor", # PrimaryConditionGroup "character" # CharlsonIndex ), comment.char = "") ## Utility function make.numeric <- function (cv, FUN = mean) { ### make a character vector numeric by splitting on '-' sapply(strsplit(gsub("[^[:digit:]]+", " ", cv, perl = TRUE), " ", fixed = TRUE), function (x) FUN(as.numeric(x))) } ## Length of stay as days { z <- make.numeric(claims.Y1$LengthOfStay)
z.week <- grepl("week", claims.Y1$LengthOfStay, fixed = TRUE) z[z.week] <- z[z.week] * 7 # Weeks are 7 days z[is.nan(z)] <- 0 claims.Y1$LengthOfStay.days <- z
}
los.levels <- c("", "1 day", sprintf("%d days", 2:6),
"1- 2 weeks", "2- 4 weeks", "4- 8 weeks", "8-12 weeks",
"12-26 weeks", "26+ weeks")
stopifnot(all(claims.Y1$LengthOfStay %in% los.levels)) claims.Y1$LengthOfStay <- factor(claims.Y1$LengthOfStay, levels = los.levels, labels = c("0 days", los.levels[-1]), ordered = TRUE) ## Months since first claim claims.Y1$dsfs.months <- make.numeric(claims.Y1$dsfs) ## dsfs is an ordered factor and gives the ordering of the claims dsfs.levels <- c("0- 1 month", sprintf("%d-%2d months", 1:11, 2:12)) claims.Y1$dsfs <- factor(claims.Y1$dsfs, levels = dsfs.levels, ordered = TRUE) ## Index as numeric claims.Y1$CharlsonIndex.numeric <- make.numeric(claims.Y1$CharlsonIndex) claims.Y1$CharlsonIndex <- factor(claims.Y1$CharlsonIndex, ordered = TRUE) ##---- ##++++ ## Days in hospital dih.Y2 <- read.csv(file = "HHP_release1/DayInHospital_Y2.csv", colClasses = c("factor", "integer"), comment.char = "") names(dih.Y2)[1] <- "MemberID" # Fix broken file ##---- save(members, claims.Y1, dih.Y2, file = "HHPR1.RData") ##############################  ## Scoring We will need a function to score our predictions p against the actual values a. The formula is on the evaluation page and we implement it as: #!/usr/bin/Rscript ## example001.R - simple benchmarks for the HHP ## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/ ############################## #### FUNCTION TO CALCULATE SCORE HPPScore <- function (p, a) { ### Scorng function after ### http://www.heritagehealthprize.com/c/hhp/Details/Evaluation ### Base 10 log from http://www.heritagehealthprize.com/forums/default.aspx?g=posts&m=2226#post2226 sqrt(mean((log(1+p, 10) - log(1+a, 10))^2)) } ##############################  ## The simplest benchmarks The simplest models don’t really model at all: they just use the average and are simple benchmarks. #!/usr/bin/Rscript ## example001.R - simple benchmarks for the HHP ## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/ y <- dih.Y2$DaysInHospital_Y2           # Actual
p <- rep(mean(y), NROW(dih.Y2))
cat(sprintf("Score using mean  : %8.6f\n", HPPScore(p, y)))
# Score using mean  : 0.278725

p <- rep(median(y), NROW(dih.Y2))
cat(sprintf("Score using median: %8.6f\n", HPPScore(p, y)))
# Score using median: 0.267969


## Simple single-variable linear models

OK, a model that doesn’t use past data isn’t much of a model, so let’s improve on that:

#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
library("reshape2")

vars <- dcast(claims.Y1, MemberID ~ ., sum, value_var = "LengthOfStay.days")
names(vars)[2] <- "LengthOfStay"
data <- merge(vars, dih.Y2)

model <- lm(DaysInHospital_Y2 ~ LengthOfStay, data = data)
p <- predict(model)
cat(sprintf("Score using lm(LengthOfStay): %8.6f\n", HPPScore(p, y)))
# Score using lm(LengthOfStay): 0.279062

model <- glm(DaysInHospital_Y2 ~ LengthOfStay,
family = quasipoisson(),
data = data)
p <- predict(model, type="response")
cat(sprintf("Score using glm(LengthOfStay): %8.6f\n", HPPScore(p, y)))
# Score using glm(LengthOfStay): 0.278914


Let the competition begin.

# You may also like these posts:

1. Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how.

2. We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.

3. We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohl…

4. I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not …

5. I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible…