# Getting started with the Heritage Health Price competition

April 8, 2011
By

(This article was first published on CYBAEA Data and Analysis, and kindly contributed to R-bloggers)

The US\$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform.

We do not have the full set of data yet, so this is a simple warm-up session to predict the days in hospital in year 2 based on the year 1 data.

## Prerequisites

Obviously you need to have R installed, and you should also have signed up for the competition (be sure to read the terms carefully) and downloaded and extracted the release 1 data file.

## Data preparation

Let’s load the data into R and do some basic housekeeping:

```#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP

##############################
#### DATA PREPARATION

##++++
## Members
colClasses = rep("factor", 3),
comment.char = "")
##----
##++++
## Claims
colClasses = c(
rep("factor", 7),
"integer",    # paydelay
"character",  # LengthOfStay
"character",  # dsfs
"factor",     # PrimaryConditionGroup
"character"   # CharlsonIndex
),
comment.char = "")
## Utility function
make.numeric <- function (cv, FUN = mean) {
### make a character vector numeric by splitting on '-'
sapply(strsplit(gsub("[^[:digit:]]+",
" ",
cv,
perl = TRUE),
" ",
fixed = TRUE),
function (x) FUN(as.numeric(x)))
}
## Length of stay as days
{
z <- make.numeric(claims.Y1\$LengthOfStay)
z.week <- grepl("week", claims.Y1\$LengthOfStay, fixed = TRUE)
z[z.week] <- z[z.week] * 7          # Weeks are 7 days
z[is.nan(z)] <- 0
claims.Y1\$LengthOfStay.days <- z
}
los.levels <- c("", "1 day", sprintf("%d days", 2:6),
"1- 2 weeks", "2- 4 weeks", "4- 8 weeks", "8-12 weeks",
"12-26 weeks", "26+ weeks")
stopifnot(all(claims.Y1\$LengthOfStay %in% los.levels))
claims.Y1\$LengthOfStay <- factor(claims.Y1\$LengthOfStay,
levels = los.levels,
labels = c("0 days", los.levels[-1]),
ordered = TRUE)
## Months since first claim
claims.Y1\$dsfs.months <- make.numeric(claims.Y1\$dsfs)
## dsfs is an ordered factor and gives the ordering of the claims
dsfs.levels <- c("0- 1 month", sprintf("%d-%2d months", 1:11, 2:12))
claims.Y1\$dsfs <- factor(claims.Y1\$dsfs, levels = dsfs.levels, ordered = TRUE)
## Index as numeric
claims.Y1\$CharlsonIndex.numeric <- make.numeric(claims.Y1\$CharlsonIndex)
claims.Y1\$CharlsonIndex <- factor(claims.Y1\$CharlsonIndex, ordered = TRUE)
##----
##++++
## Days in hospital
colClasses = c("factor", "integer"),
comment.char = "")
names(dih.Y2)[1] <- "MemberID"          # Fix broken file
##----
save(members, claims.Y1, dih.Y2,
file = "HHPR1.RData")
##############################
```

## Scoring

We will need a function to score our predictions `p` against the actual values `a`. The formula is on the evaluation page and we implement it as:

```#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP

##############################
#### FUNCTION TO CALCULATE SCORE
HPPScore <- function (p, a) {
### Scorng function after
### http://www.heritagehealthprize.com/c/hhp/Details/Evaluation
### Base 10 log from http://www.heritagehealthprize.com/forums/default.aspx?g=posts&m=2226#post2226
sqrt(mean((log(1+p, 10) - log(1+a, 10))^2))
}
##############################
```

## The simplest benchmarks

The simplest models don’t really model at all: they just use the average and are simple benchmarks.

```#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP

y <- dih.Y2\$DaysInHospital_Y2           # Actual
p <- rep(mean(y), NROW(dih.Y2))
cat(sprintf("Score using mean  : %8.6f\n", HPPScore(p, y)))
# Score using mean  : 0.278725

p <- rep(median(y), NROW(dih.Y2))
cat(sprintf("Score using median: %8.6f\n", HPPScore(p, y)))
# Score using median: 0.267969
```

## Simple single-variable linear models

OK, a model that doesn’t use past data isn’t much of a model, so let’s improve on that:

```#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
library("reshape2")

vars <- dcast(claims.Y1, MemberID ~ ., sum, value_var = "LengthOfStay.days")
names(vars)[2] <- "LengthOfStay"
data <- merge(vars, dih.Y2)

model <- lm(DaysInHospital_Y2 ~ LengthOfStay, data = data)
p <- predict(model)
cat(sprintf("Score using lm(LengthOfStay): %8.6f\n", HPPScore(p, y)))
# Score using lm(LengthOfStay): 0.279062

model <- glm(DaysInHospital_Y2 ~ LengthOfStay,
family = quasipoisson(),
data = data)
p <- predict(model, type="response")
cat(sprintf("Score using glm(LengthOfStay): %8.6f\n", HPPScore(p, y)))
# Score using glm(LengthOfStay): 0.278914
```

Let the competition begin.

# You may also like these posts:

1. Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how.

2. We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.

3. We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohl…

4. I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not …

5. I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible…

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...