Getting started with the Heritage Health Price competition

April 8, 2011
By

(This article was first published on CYBAEA Data, and kindly contributed to R-bloggers)

The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform.

We do not have the full set of data yet, so this is a simple warm-up session to predict the days in hospital in year 2 based on the year 1 data.

Prerequisites

Obviously you need to have R installed, and you should also have signed up for the competition (be sure to read the terms carefully) and downloaded and extracted the release 1 data file.

Data preparation

Let’s load the data into R and do some basic housekeeping:

#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/

##############################
#### DATA PREPARATION

##++++
## Members
members   <- read.csv(file = "HHP_release1/Members_Y1.csv",
                      colClasses = rep("factor", 3),
                      comment.char = "")
##----
##++++
## Claims
claims.Y1 <- read.csv(file = "HHP_release1/Claims_Y1.csv",
                      colClasses = c(
                          rep("factor", 7),
                          "integer",    # paydelay
                          "character",  # LengthOfStay
                          "character",  # dsfs
                          "factor",     # PrimaryConditionGroup
                          "character"   # CharlsonIndex
                          ),
                      comment.char = "")
## Utility function
make.numeric <- function (cv, FUN = mean) {
### make a character vector numeric by splitting on '-'
    sapply(strsplit(gsub("[^[:digit:]]+",
                         " ",
                         cv,
                         perl = TRUE),
                    " ",
                    fixed = TRUE),
           function (x) FUN(as.numeric(x)))
}
## Length of stay as days
{
    z <- make.numeric(claims.Y1$LengthOfStay)
    z.week <- grepl("week", claims.Y1$LengthOfStay, fixed = TRUE)
    z[z.week] <- z[z.week] * 7          # Weeks are 7 days
    z[is.nan(z)] <- 0
    claims.Y1$LengthOfStay.days <- z
}
los.levels <- c("", "1 day", sprintf("%d days", 2:6),
                "1- 2 weeks", "2- 4 weeks", "4- 8 weeks", "8-12 weeks",
                "12-26 weeks", "26+ weeks")
stopifnot(all(claims.Y1$LengthOfStay %in% los.levels))
claims.Y1$LengthOfStay <- factor(claims.Y1$LengthOfStay,
                                 levels = los.levels,
                                 labels = c("0 days", los.levels[-1]),
                                 ordered = TRUE)
## Months since first claim
claims.Y1$dsfs.months <- make.numeric(claims.Y1$dsfs)
## dsfs is an ordered factor and gives the ordering of the claims
dsfs.levels <- c("0- 1 month", sprintf("%d-%2d months", 1:11, 2:12))
claims.Y1$dsfs <- factor(claims.Y1$dsfs, levels = dsfs.levels, ordered = TRUE)
## Index as numeric
claims.Y1$CharlsonIndex.numeric <- make.numeric(claims.Y1$CharlsonIndex)
claims.Y1$CharlsonIndex <- factor(claims.Y1$CharlsonIndex, ordered = TRUE)
##----
##++++
## Days in hospital
dih.Y2    <- read.csv(file = "HHP_release1/DayInHospital_Y2.csv",
                      colClasses = c("factor", "integer"),
                      comment.char = "")
names(dih.Y2)[1] <- "MemberID"          # Fix broken file
##----
save(members, claims.Y1, dih.Y2,
     file = "HHPR1.RData")
##############################

Scoring

We will need a function to score our predictions p against the actual values a. The formula is on the evaluation page and we implement it as:

#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/

##############################
#### FUNCTION TO CALCULATE SCORE
HPPScore <- function (p, a) {
### Scorng function after
### http://www.heritagehealthprize.com/c/hhp/Details/Evaluation
### Base 10 log from http://www.heritagehealthprize.com/forums/default.aspx?g=posts&m=2226#post2226
    sqrt(mean((log(1+p, 10) - log(1+a, 10))^2))
}
##############################

The simplest benchmarks

The simplest models don’t really model at all: they just use the average and are simple benchmarks.

#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/

y <- dih.Y2$DaysInHospital_Y2           # Actual
p <- rep(mean(y), NROW(dih.Y2))
cat(sprintf("Score using mean  : %8.6f\n", HPPScore(p, y)))
# Score using mean  : 0.278725

p <- rep(median(y), NROW(dih.Y2))
cat(sprintf("Score using median: %8.6f\n", HPPScore(p, y)))
# Score using median: 0.267969

Simple single-variable linear models

OK, a model that doesn’t use past data isn’t much of a model, so let’s improve on that:

#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/
library("reshape2")

vars <- dcast(claims.Y1, MemberID ~ ., sum, value_var = "LengthOfStay.days")
names(vars)[2] <- "LengthOfStay"
data <- merge(vars, dih.Y2)

model <- lm(DaysInHospital_Y2 ~ LengthOfStay, data = data)
p <- predict(model)
cat(sprintf("Score using lm(LengthOfStay): %8.6f\n", HPPScore(p, y)))
# Score using lm(LengthOfStay): 0.279062

model <- glm(DaysInHospital_Y2 ~ LengthOfStay,
             family = quasipoisson(),
             data = data)
p <- predict(model, type="response")
cat(sprintf("Score using glm(LengthOfStay): %8.6f\n", HPPScore(p, y)))
# Score using glm(LengthOfStay): 0.278914

Let the competition begin.

To leave a comment for the author, please follow the link and comment on their blog: CYBAEA Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)