Site icon R-bloggers

Object Orientation in R – Notes from a novice

[This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Having posted some code to Git a few days ago and having been wholly dissatisfied with it, I began to do what I often do with code I don’t like. I started re-writing it bigger and weirder and more philosophically pure. Part of this search for Platonic code lead me to explore object oriented programming in R. To me, OOP is very similar to relational database theory; it was love at first sight. That doesn’t mean that I fully appreciate all of the academic nuances, or that I’m very good at it, or even that I agree with all of it, but I’m an unrepentant fan. My dabblings with R to date have been purely modular. I’ve now written my first object. Here’s how I went about it.

I revisited the Coursera lecture slides about creating an S4 object. That was enough for me to define an object and overload a generic method.

setClass("Triangle", 
         representation(TriangleData = "data.frame"
                        , TriangleName = "character"
                        , LossPeriodType = "character"
                        , LossPeriodInterval = "Period"
                        , DevelopmentInterval = "Period"))

setMethod("show", "Triangle"
          , function(object){
            cat("This is a loss triangle\n")
            cat("Its name is", object@TriangleName, "\n")
            cat("Its columns are", colnames(object@TriangleData), "\n")
            print(head(object@TriangleData))
          })

 

One fairly important “gotcha”: The call to the setClass method appears to require that the arguments are named. On more than one occasion when I failed to do so, I get one of R’s typically cryptic error messages, an example of which looks like:

Error in summary(chainLadder@LinearFit) :
error in evaluating the argument ‘object’ in selecting a method for function ‘summary’: Error: no slot of name “LinearFit” for this object of class “TriangleModel”

The code above is sufficient to construct an object, but it’s not terribly useful or robust.  Among other thing, I’d like to ensure that the object will respond in some fashion to inappropriate inputs and do a bit more than just house several data elements and print a summary. So, let’s start to dig deeper.

Beyond the Coursera material, there are three papers that I consulted:

The Leisch paper is a good introduction to OOP in S3. However, because (according to Roger Peng, anyway) S4 is a more “pure” form of OOP, I looked for other sources. The Genolini and Hankin papers are sufficient to give someone enough information to get going with creating an object. In addition, the code for the lubridate package is on GitHub and is a splendid example of clear, detailed S4 objects in action.

With that knowledge in hand, I returned to my Triangle object. To ensure that the object behaves the way I want it to, I wrote a constructor function which will build the object using some sensible inputs. That winds up being a fairly lengthy function, so I’ll not post it here. If you’re curious, here’s the Gist: https://gist.github.com/4622610. Basically, I need to allow the user to specify which column contains loss and development time period information. If some of the inputs have not been specified properly, I return an informative error. I wrote a very rudimentary validation function to ensure that the type of loss period is something sensible. (Brief aside: all of my code is English and I’m not all that happy about that. Does anyone have any good suggestions about how to write multilingual code?)

While I’m at it, I overload some generic functions, including one for plotting. This means that my default plot for a triangle will be something which looks sensible. Cool. I can also write custom behaviors, such as a “LatestDiagonal” function, which will return the most recent observation for a set of loss (or origin) periods. To write a custom method, you must first define a new generic function. This seems a bit odd to me, but whatever. I can imagine a way that it makes sense somewhere within R’s engine.

Finally, I set a method to assign a name to the triangle. This is a bit crazy and I’ll be the first to admit that there may be something here which I don’t get. Hankin writes these functions, but doesn’t use them. They simply use the “@” operator to access object properties directly. This makes me wonder what the point of writing an access function is, other than clean looking code.

So what do I think of OOP in R? I tend to view OOP as having four key properties: Encapsulation, Inheritance, Polymorphism and Methods. I barely need to add that this is hardly a canonical list, merely one biased person’s view of what they like to see in an OO langauge.

OOP purists and other academics will have a different view about what’s important and how well it’s implemented in R. I’ve barely scratched the surface in my own development and look forward to bringing this technology to bear on problems where this is appropriate. Comments are more than welcome. I’m certain that I’ve gotten a few things horribly wrong.

All the code may be found in the MRMR project here: https://github.com/PirateGrunt/MRMR
Brief demo:

# Demo script

#=============================
# Source the necessary code
source("https://raw.github.com/PirateGrunt/MRMR/master/RegressionSupport.r")
source("https://raw.github.com/PirateGrunt/MRMR/master/NAIC.R")
source("https://raw.github.com/PirateGrunt/MRMR/master/ReservingVisualization.R")
source("https://raw.github.com/PirateGrunt/MRMR/master/Triangle.R")
source("https://raw.github.com/PirateGrunt/MRMR/master/TriangleModel.R")
source("https://raw.github.com/PirateGrunt/MRMR/master/TriangleProjection.R")

#=============================
# Get some data from the big NAIC database 
# and get a triangle we can project
df = GetNAICData("wkcomp_pos.csv")
bigCompany = as.character(df[which(df$CumulativePaid == max(df$CumulativePaid)),"GroupName"])

df.BigCo = subset(df, GroupName == bigCompany)

df.UpperTriangle = subset(df.BigCo, DevelopmentYear <=1997)
df.LowerTriangle = subset(df.BigCo, DevelopmentYear > 1997)

#=============================
# Construct the triangle and display 
# some basic properties
tri = Triangle(TriangleData = df.UpperTriangle
               , TriangleName = bigCompany
               , LossPeriodType = "accident"
               , LossPeriodInterval = years(1)
               , DevelopmentInterval = years(1)
               , LossPeriodColumn = "LossPeriodStart"
               , DevelopmentColumn = "DevelopmentLag")

tri@TriangleName
tri

is(tri, "Triangle")
is.Triangle(tri)

plt = ShowTriangle(tri@TriangleData, bigCompany)

plot(tri)
head(LatestDiagonal(tri))
length(LatestDiagonal(tri)[,1])

plt = ShowTriangle(tri@TriangleData, bigCompany, Cumulative=FALSE)
#Note the apparent calendar year impact in 1996. This is invisible in the cumulative display.

setName(tri) = "AnotherName"
tri@TriangleName
setName(tri) = bigCompany
tri@TriangleName
tri@TriangleName = "Another name"

#===========================
# Now let's fit a model

chainLadder = TriangleModel("CumulPaid"
                            , BaseTriangle = tri
                            , ResponseName = "CumulativePaid"
                            , PredictorName = "DirectEP"
                            , CategoryName = "DevelopmentLag"
                            , MinimumCategoryFrequency = 1
                            , delta = 0)
summary(chainLadder@LinearFit)


To leave a comment for the author, please follow the link and comment on their blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.