Getting started with the Heritage Health Price competition

April 8, 2011
By

(This article was first published on CYBAEA Data and Analysis, and kindly contributed to R-bloggers)

The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform.

We do not have the full set of data yet, so this is a simple warm-up session to predict the days in hospital in year 2 based on the year 1 data.

Prerequisites

Obviously you need to have R installed, and you should also have signed up for the competition (be sure to read the terms carefully) and downloaded and extracted the release 1 data file.

Data preparation

Let’s load the data into R and do some basic housekeeping:

#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/

##############################
#### DATA PREPARATION

##++++
## Members
members   <- read.csv(file = "HHP_release1/Members_Y1.csv",
                      colClasses = rep("factor", 3),
                      comment.char = "")
##----
##++++
## Claims
claims.Y1 <- read.csv(file = "HHP_release1/Claims_Y1.csv",
                      colClasses = c(
                          rep("factor", 7),
                          "integer",    # paydelay
                          "character",  # LengthOfStay
                          "character",  # dsfs
                          "factor",     # PrimaryConditionGroup
                          "character"   # CharlsonIndex
                          ),
                      comment.char = "")
## Utility function
make.numeric <- function (cv, FUN = mean) {
### make a character vector numeric by splitting on '-'
    sapply(strsplit(gsub("[^[:digit:]]+",
                         " ",
                         cv,
                         perl = TRUE),
                    " ",
                    fixed = TRUE),
           function (x) FUN(as.numeric(x)))
}
## Length of stay as days
{
    z <- make.numeric(claims.Y1$LengthOfStay)
    z.week <- grepl("week", claims.Y1$LengthOfStay, fixed = TRUE)
    z[z.week] <- z[z.week] * 7          # Weeks are 7 days
    z[is.nan(z)] <- 0
    claims.Y1$LengthOfStay.days <- z
}
los.levels <- c("", "1 day", sprintf("%d days", 2:6),
                "1- 2 weeks", "2- 4 weeks", "4- 8 weeks", "8-12 weeks",
                "12-26 weeks", "26+ weeks")
stopifnot(all(claims.Y1$LengthOfStay %in% los.levels))
claims.Y1$LengthOfStay <- factor(claims.Y1$LengthOfStay,
                                 levels = los.levels,
                                 labels = c("0 days", los.levels[-1]),
                                 ordered = TRUE)
## Months since first claim
claims.Y1$dsfs.months <- make.numeric(claims.Y1$dsfs)
## dsfs is an ordered factor and gives the ordering of the claims
dsfs.levels <- c("0- 1 month", sprintf("%d-%2d months", 1:11, 2:12))
claims.Y1$dsfs <- factor(claims.Y1$dsfs, levels = dsfs.levels, ordered = TRUE)
## Index as numeric
claims.Y1$CharlsonIndex.numeric <- make.numeric(claims.Y1$CharlsonIndex)
claims.Y1$CharlsonIndex <- factor(claims.Y1$CharlsonIndex, ordered = TRUE)
##----
##++++
## Days in hospital
dih.Y2    <- read.csv(file = "HHP_release1/DayInHospital_Y2.csv",
                      colClasses = c("factor", "integer"),
                      comment.char = "")
names(dih.Y2)[1] <- "MemberID"          # Fix broken file
##----
save(members, claims.Y1, dih.Y2,
     file = "HHPR1.RData")
##############################

Scoring

We will need a function to score our predictions p against the actual values a. The formula is on the evaluation page and we implement it as:

#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/

##############################
#### FUNCTION TO CALCULATE SCORE
HPPScore <- function (p, a) {
### Scorng function after
### http://www.heritagehealthprize.com/c/hhp/Details/Evaluation
### Base 10 log from http://www.heritagehealthprize.com/forums/default.aspx?g=posts&m=2226#post2226
    sqrt(mean((log(1+p, 10) - log(1+a, 10))^2))
}
##############################

The simplest benchmarks

The simplest models don’t really model at all: they just use the average and are simple benchmarks.

#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/

y <- dih.Y2$DaysInHospital_Y2           # Actual
p <- rep(mean(y), NROW(dih.Y2))
cat(sprintf("Score using mean  : %8.6f\n", HPPScore(p, y)))
# Score using mean  : 0.278725

p <- rep(median(y), NROW(dih.Y2))
cat(sprintf("Score using median: %8.6f\n", HPPScore(p, y)))
# Score using median: 0.267969

Simple single-variable linear models

OK, a model that doesn’t use past data isn’t much of a model, so let’s improve on that:

#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/
library("reshape2")

vars <- dcast(claims.Y1, MemberID ~ ., sum, value_var = "LengthOfStay.days")
names(vars)[2] <- "LengthOfStay"
data <- merge(vars, dih.Y2)

model <- lm(DaysInHospital_Y2 ~ LengthOfStay, data = data)
p <- predict(model)
cat(sprintf("Score using lm(LengthOfStay): %8.6f\n", HPPScore(p, y)))
# Score using lm(LengthOfStay): 0.279062

model <- glm(DaysInHospital_Y2 ~ LengthOfStay,
             family = quasipoisson(),
             data = data)
p <- predict(model, type="response")
cat(sprintf("Score using glm(LengthOfStay): %8.6f\n", HPPScore(p, y)))
# Score using glm(LengthOfStay): 0.278914

Let the competition begin.

Jump to comments.

You may also like these posts:

  1. [0.45] Big data for R

    Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how.

  2. [0.44] Employee productivity as function of number of workers revisited

    We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.

  3. [0.40] R code for Chapter 2 of Non-Life Insurance Pricing with GLM

    We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohl…

  4. [0.40] Area Plots with Intensity Coloring

    I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not …

  5. [0.38] R: Eliminating observed values with zero variance

    I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible…

To leave a comment for the author, please follow the link and comment on his blog: CYBAEA Data and Analysis.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.