Computed columns for dataframes

Posted on May 8, 2013 by PirateGrunt in R bloggers | 0 Comments

[This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Everyone loves to aggregate data. Everyone loves to create new columns based on other columns. Everyone hates to do the same thing twice. In my continuing work on multilevel view of loss reserving, I reached a point where I realized that I needed a robust mechanism to aggregate computed columns. SQL server and (I’m assuming) other DBMSs have such a construct. Further, it’s something I use routinely in an Excel pivot table. Replicating this in R, doesn’t appear to be straightforward.

Here’s an example, that will be familiar to anyone working in insurance:

myData = data.frame(State = c("NY","NY", "TX", "TX")
                    , Premium = c(100,200,150,75)
                    , Loss = c(80,175,80,80)
                    , ALAE = c(10, 20, 15, 5))

What I’d like to do is calculate loss and loss+ALAE ratios. The former is simply the losses divided by the premium. The latter is the sum of loss and ALAE divided by the premium. I’ll also calculate the ratio of ALAE to loss+ALAE.

The first thing I do is create something called a ComputedColumn. This is simply a list with two elements, 1) the column name and 2) an expression which defines the calculation. The expression will be evaluated later during construction of a new class which I’ve defined. The class is called a “Mesa”. For now, I’m using it to create computed columns and collapse along selected dimensions. In the future, I’d like to use it to generate nice looking TEX or other output for tables.

I’ll note that Ben Escoto did quite a bit of very good work on well-formatted dataframes in his package FAViR. If you haven’t seen it, take a look at it here: http://www.favir.net/start.

Mesa works by evaluating the expressions which are passed into it and constructing a new dataframe for the results. The expressions are kept in the Mesa object so that they may be reapplied whenever we want to alter the Mesa.

ComputedColumn = function(ColumnName, ColumnFormula)
{
  arguments <- as.list(match.call())
  ColumnFormula = arguments$ColumnFormula
  x = list(ColumnName = ColumnName, ColumnFormula = ColumnFormula)
  x
}

new_Mesa = function(df, ComputedColumns)
{
  columnVals = lapply(ComputedColumns, function(x) {
    y = eval(x$ColumnFormula, df)})
  
  newMesa = do.call("cbind", columnVals)
  colnames(newMesa) = sapply(ComputedColumns, "[[", "ColumnName")
  newMesa = cbind(df, newMesa)
  
  mesa = new("Mesa", MesaData = newMesa, MesaColumns = ComputedColumns)
  mesa
}

When we run the following code, we'll get a table which has the first for columns of myData, augmented with three new ones.

col1 = ComputedColumn("LossRatio", Loss / Premium)
col2 = ComputedColumn("ALAERatio", ALAE / (Loss+ALAE))
myCols = list(col1, col2, ComputedColumn("LossAndLAERatio", (Loss + ALAE) / Premium))

myMesa = new_Mesa(myData, myCols)
rstudio::viewData(myMesa@MesaData)

State	Premium	Loss	ALAE	LossRatio	ALAERatio	LossAndLAERatio
NY	100	80	10	80%	11%	90%
NY	200	175	20	88%	10%	98%
TX	150	80	15	53%	16%	63%
TX	75	80	5	107%	6%	113%

Cool. But what if I'd like to know how New York and Texas are doing in total?

CollapseMesa = function(x, Along)
{
  computedCols = x@MesaColumns
  df = x@MesaData

  cols = colnames(df)
  computedColnames = sapply(x@MesaColumns, "[[", "ColumnName")
  cols = cols[!cols %in% computedColnames]
  cols = cols[!cols %in% Along]
  
  dfSum = aggregate(df[,cols], by=list(df[,Along]), sum)
  colnames(dfSum) = gsub("Group.1", Along, colnames(dfSum))
  newMesa = new_Mesa(dfSum, x@MesaColumns)
  newMesa
}
stateMesa = CollapseMesa(myMesa, "State")
rstudio::viewData(stateMesa@MesaData)

State	Premium	Loss	ALAE	LossRatio	ALAERatio	LossAndLAERatio
NY	300	255	30	85%	11%	95%
TX	225	160	20	71%	11%	80%

At present, the grouping mechanism is extremely primitive. It presumes that I’m only interested in collapsing along one dimension and that I only ever want to take the sum. I’ll get around to fixing that at some point in the near future, but this fits my needs for now.

Have a look here: https://github.com/PirateGrunt/Mesa

To leave a comment for the author, please follow the link and comment on their blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Computed columns for dataframes

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)