Computed columns for dataframes

May 8, 2013
By

(This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers)

Everyone loves to aggregate data. Everyone loves to create new columns based on other columns. Everyone hates to do the same thing twice. In my continuing work on multilevel view of loss reserving, I reached a point where I realized that I needed a robust mechanism to aggregate computed columns. SQL server and (I’m assuming) other DBMSs have such a construct. Further, it’s something I use routinely in an Excel pivot table. Replicating this in R, doesn’t appear to be straightforward.

Here’s an example, that will be familiar to anyone working in insurance:

myData = data.frame(State = c("NY","NY", "TX", "TX")
                    , Premium = c(100,200,150,75)
                    , Loss = c(80,175,80,80)
                    , ALAE = c(10, 20, 15, 5))

What I’d like to do is calculate loss and loss+ALAE ratios. The former is simply the losses divided by the premium. The latter is the sum of loss and ALAE divided by the premium. I’ll also calculate the ratio of ALAE to loss+ALAE.

The first thing I do is create something called a ComputedColumn. This is simply a list with two elements, 1) the column name and 2) an expression which defines the calculation. The expression will be evaluated later during construction of a new class which I’ve defined. The class is called a “Mesa”. For now, I’m using it to create computed columns and collapse along selected dimensions. In the future, I’d like to use it to generate nice looking TEX or other output for tables.

I’ll note that Ben Escoto did quite a bit of very good work on well-formatted dataframes in his package FAViR. If you haven’t seen it, take a look at it here: http://www.favir.net/start.

Mesa works by evaluating the expressions which are passed into it and constructing a new dataframe for the results. The expressions are kept in the Mesa object so that they may be reapplied whenever we want to alter the Mesa.

ComputedColumn = function(ColumnName, ColumnFormula)
{
  arguments <- as.list(match.call())
  ColumnFormula = arguments$ColumnFormula
  x = list(ColumnName = ColumnName, ColumnFormula = ColumnFormula)
  x
}

new_Mesa = function(df, ComputedColumns)
{
  columnVals = lapply(ComputedColumns, function(x) {
    y = eval(x$ColumnFormula, df)})
  
  newMesa = do.call("cbind", columnVals)
  colnames(newMesa) = sapply(ComputedColumns, "[[", "ColumnName")
  newMesa = cbind(df, newMesa)
  
  mesa = new("Mesa", MesaData = newMesa, MesaColumns = ComputedColumns)
  mesa
}

When we run the following code, we'll get a table which has the first for columns of myData, augmented with three new ones.

col1 = ComputedColumn("LossRatio", Loss / Premium)
col2 = ComputedColumn("ALAERatio", ALAE / (Loss+ALAE))
myCols = list(col1, col2, ComputedColumn("LossAndLAERatio", (Loss + ALAE) / Premium))

myMesa = new_Mesa(myData, myCols)
rstudio::viewData(myMesa@MesaData)
StatePremiumLossALAELossRatioALAERatioLossAndLAERatio
NY100801080%11%90%
NY2001752088%10%98%
TX150801553%16%63%
TX75805107%6%113%

Cool. But what if I'd like to know how New York and Texas are doing in total?

CollapseMesa = function(x, Along)
{
  computedCols = x@MesaColumns
  df = x@MesaData

  cols = colnames(df)
  computedColnames = sapply(x@MesaColumns, "[[", "ColumnName")
  cols = cols[!cols %in% computedColnames]
  cols = cols[!cols %in% Along]
  
  dfSum = aggregate(df[,cols], by=list(df[,Along]), sum)
  colnames(dfSum) = gsub("Group.1", Along, colnames(dfSum))
  newMesa = new_Mesa(dfSum, x@MesaColumns)
  newMesa
}
stateMesa = CollapseMesa(myMesa, "State")
rstudio::viewData(stateMesa@MesaData)
StatePremiumLossALAELossRatioALAERatioLossAndLAERatio

NY3002553085%11%95%
TX2251602071%11%80%

At present, the grouping mechanism is extremely primitive. It presumes that I’m only interested in collapsing along one dimension and that I only ever want to take the sum. I’ll get around to fixing that at some point in the near future, but this fits my needs for now.

Have a look here: https://github.com/PirateGrunt/Mesa


To leave a comment for the author, please follow the link and comment on his blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.