# Getting started with the Heritage Health Price competition

April 8, 2011
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The US\$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform.

We do not have the full set of data yet, so this is a simple warm-up session to predict the days in hospital in year 2 based on the year 1 data.

## Prerequisites

Obviously you need to have R installed, and you should also have signed up for the competition (be sure to read the terms carefully) and downloaded and extracted the release 1 data file.

## Data preparation

Let’s load the data into R and do some basic housekeeping:

```#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP

##############################
#### DATA PREPARATION

##++++
## Members
colClasses = rep("factor", 3),
comment.char = "")
##----
##++++
## Claims
colClasses = c(
rep("factor", 7),
"integer",    # paydelay
"character",  # LengthOfStay
"character",  # dsfs
"factor",     # PrimaryConditionGroup
"character"   # CharlsonIndex
),
comment.char = "")
## Utility function
make.numeric <- function (cv, FUN = mean) {
### make a character vector numeric by splitting on '-'
sapply(strsplit(gsub("[^[:digit:]]+",
" ",
cv,
perl = TRUE),
" ",
fixed = TRUE),
function (x) FUN(as.numeric(x)))
}
## Length of stay as days
{
z <- make.numeric(claims.Y1\$LengthOfStay)
z.week <- grepl("week", claims.Y1\$LengthOfStay, fixed = TRUE)
z[z.week] <- z[z.week] * 7          # Weeks are 7 days
z[is.nan(z)] <- 0
claims.Y1\$LengthOfStay.days <- z
}
los.levels <- c("", "1 day", sprintf("%d days", 2:6),
"1- 2 weeks", "2- 4 weeks", "4- 8 weeks", "8-12 weeks",
"12-26 weeks", "26+ weeks")
stopifnot(all(claims.Y1\$LengthOfStay %in% los.levels))
claims.Y1\$LengthOfStay <- factor(claims.Y1\$LengthOfStay,
levels = los.levels,
labels = c("0 days", los.levels[-1]),
ordered = TRUE)
## Months since first claim
claims.Y1\$dsfs.months <- make.numeric(claims.Y1\$dsfs)
## dsfs is an ordered factor and gives the ordering of the claims
dsfs.levels <- c("0- 1 month", sprintf("%d-%2d months", 1:11, 2:12))
claims.Y1\$dsfs <- factor(claims.Y1\$dsfs, levels = dsfs.levels, ordered = TRUE)
## Index as numeric
claims.Y1\$CharlsonIndex.numeric <- make.numeric(claims.Y1\$CharlsonIndex)
claims.Y1\$CharlsonIndex <- factor(claims.Y1\$CharlsonIndex, ordered = TRUE)
##----
##++++
## Days in hospital
colClasses = c("factor", "integer"),
comment.char = "")
names(dih.Y2) <- "MemberID"          # Fix broken file
##----
save(members, claims.Y1, dih.Y2,
file = "HHPR1.RData")
##############################
```

## Scoring

We will need a function to score our predictions `p` against the actual values `a`. The formula is on the evaluation page and we implement it as:

```#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP

##############################
#### FUNCTION TO CALCULATE SCORE
HPPScore <- function (p, a) {
### Scorng function after
### http://www.heritagehealthprize.com/c/hhp/Details/Evaluation
### Base 10 log from http://www.heritagehealthprize.com/forums/default.aspx?g=posts&m=2226#post2226
sqrt(mean((log(1+p, 10) - log(1+a, 10))^2))
}
##############################
```

## The simplest benchmarks

The simplest models don’t really model at all: they just use the average and are simple benchmarks.

```#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP

y <- dih.Y2\$DaysInHospital_Y2           # Actual
p <- rep(mean(y), NROW(dih.Y2))
cat(sprintf("Score using mean  : %8.6f\n", HPPScore(p, y)))
# Score using mean  : 0.278725

p <- rep(median(y), NROW(dih.Y2))
cat(sprintf("Score using median: %8.6f\n", HPPScore(p, y)))
# Score using median: 0.267969
```

## Simple single-variable linear models

OK, a model that doesn’t use past data isn’t much of a model, so let’s improve on that:

```#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
library("reshape2")

vars <- dcast(claims.Y1, MemberID ~ ., sum, value_var = "LengthOfStay.days")
names(vars) <- "LengthOfStay"
data <- merge(vars, dih.Y2)

model <- lm(DaysInHospital_Y2 ~ LengthOfStay, data = data)
p <- predict(model)
cat(sprintf("Score using lm(LengthOfStay): %8.6f\n", HPPScore(p, y)))
# Score using lm(LengthOfStay): 0.279062

model <- glm(DaysInHospital_Y2 ~ LengthOfStay,
family = quasipoisson(),
data = data)
p <- predict(model, type="response")
cat(sprintf("Score using glm(LengthOfStay): %8.6f\n", HPPScore(p, y)))
# Score using glm(LengthOfStay): 0.278914
```

Let the competition begin. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.