Object Orientation in R – Notes from a novice

[This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Having posted some code to Git a few days ago and having been wholly dissatisfied with it, I began to do what I often do with code I don’t like. I started re-writing it bigger and weirder and more philosophically pure. Part of this search for Platonic code lead me to explore object oriented programming in R. To me, OOP is very similar to relational database theory; it was love at first sight. That doesn’t mean that I fully appreciate all of the academic nuances, or that I’m very good at it, or even that I agree with all of it, but I’m an unrepentant fan. My dabblings with R to date have been purely modular. I’ve now written my first object. Here’s how I went about it.

I revisited the Coursera lecture slides about creating an S4 object. That was enough for me to define an object and overload a generic method.

         representation(TriangleData = "data.frame"
                        , TriangleName = "character"
                        , LossPeriodType = "character"
                        , LossPeriodInterval = "Period"
                        , DevelopmentInterval = "Period"))

setMethod("show", "Triangle"
          , function(object){
            cat("This is a loss triangle\n")
            cat("Its name is", object@TriangleName, "\n")
            cat("Its columns are", colnames(object@TriangleData), "\n")


One fairly important “gotcha”: The call to the setClass method appears to require that the arguments are named. On more than one occasion when I failed to do so, I get one of R’s typically cryptic error messages, an example of which looks like:

Error in summary(chainLadder@LinearFit) :
error in evaluating the argument ‘object’ in selecting a method for function ‘summary’: Error: no slot of name “LinearFit” for this object of class “TriangleModel”

The code above is sufficient to construct an object, but it’s not terribly useful or robust.  Among other thing, I’d like to ensure that the object will respond in some fashion to inappropriate inputs and do a bit more than just house several data elements and print a summary. So, let’s start to dig deeper.

Beyond the Coursera material, there are three papers that I consulted:

The Leisch paper is a good introduction to OOP in S3. However, because (according to Roger Peng, anyway) S4 is a more “pure” form of OOP, I looked for other sources. The Genolini and Hankin papers are sufficient to give someone enough information to get going with creating an object. In addition, the code for the lubridate package is on GitHub and is a splendid example of clear, detailed S4 objects in action.

With that knowledge in hand, I returned to my Triangle object. To ensure that the object behaves the way I want it to, I wrote a constructor function which will build the object using some sensible inputs. That winds up being a fairly lengthy function, so I’ll not post it here. If you’re curious, here’s the Gist: https://gist.github.com/4622610. Basically, I need to allow the user to specify which column contains loss and development time period information. If some of the inputs have not been specified properly, I return an informative error. I wrote a very rudimentary validation function to ensure that the type of loss period is something sensible. (Brief aside: all of my code is English and I’m not all that happy about that. Does anyone have any good suggestions about how to write multilingual code?)

While I’m at it, I overload some generic functions, including one for plotting. This means that my default plot for a triangle will be something which looks sensible. Cool. I can also write custom behaviors, such as a “LatestDiagonal” function, which will return the most recent observation for a set of loss (or origin) periods. To write a custom method, you must first define a new generic function. This seems a bit odd to me, but whatever. I can imagine a way that it makes sense somewhere within R’s engine.

Finally, I set a method to assign a name to the triangle. This is a bit crazy and I’ll be the first to admit that there may be something here which I don’t get. Hankin writes these functions, but doesn’t use them. They simply use the “@” operator to access object properties directly. This makes me wonder what the point of writing an access function is, other than clean looking code.

So what do I think of OOP in R? I tend to view OOP as having four key properties: Encapsulation, Inheritance, Polymorphism and Methods. I barely need to add that this is hardly a canonical list, merely one biased person’s view of what they like to see in an OO langauge.

  • Encapsulation: This is ability to hide the internals of an object and to control how an object’s properties are manipulated. R gets an F here. One may control property assignments, but that comes in one of a few ways: either through coding a setReplace function, a setX function or through a single setValidity function. It would be possible to code a validation function for a specific property, which other functions could call, but (unless I’ve miss something) that function won’t be private. A single setValidity function is inefficient, both in terms of run-time and in development effort. Moreover, there’s nothing stopping the user from modifying an object’s internal properties via a direct call to the “@” reference to an object’s data. 
  • Inheritance: I’ll give R an incomplete as I’ve not yet had a need to construct an object hierarchy with inherited properties and behaviors, though I expect that I will. At present, the model object defaults to OLS. Obviously, I ‘d like to extend that to other structures. Watch this space.
  • Polymorphism: This gets an A. I get the feeling that this is why there’s any OO in R at all. You can barely get through a bit of  documentation without reference to “generic” methods. OO allows a developer to overload standard functions like plot, summary, sum, etc. R’s support here is fairly straightforward and welcome. Provision of a default plot method for a triangle object is helpful to ensure that users get a useful output without much effort on their part.
  • Methods: I never know quite what to call this, but to me it’s the ability of an object to have behaviors. When I was first learning about OO, this seemed to be what separated an object from a mere structure. A structure is a composition of primitive data types, but an object can actually DO something. Here, R is a bit mixed, so I’ll give them a C. Again, I may be missing something, but there doesn’t seem to be any straightforward support for private methods which would allow an object to manage its own data. The requirement to declare a generic and then a specific is bizarre, but I’m content to write it off as a minor sacrifice to the R gods.

OOP purists and other academics will have a different view about what’s important and how well it’s implemented in R. I’ve barely scratched the surface in my own development and look forward to bringing this technology to bear on problems where this is appropriate. Comments are more than welcome. I’m certain that I’ve gotten a few things horribly wrong.

All the code may be found in the MRMR project here: https://github.com/PirateGrunt/MRMR
Brief demo:

# Demo script

# Source the necessary code

# Get some data from the big NAIC database 
# and get a triangle we can project
df = GetNAICData("wkcomp_pos.csv")
bigCompany = as.character(df[which(df$CumulativePaid == max(df$CumulativePaid)),"GroupName"])

df.BigCo = subset(df, GroupName == bigCompany)

df.UpperTriangle = subset(df.BigCo, DevelopmentYear <=1997)
df.LowerTriangle = subset(df.BigCo, DevelopmentYear > 1997)

# Construct the triangle and display 
# some basic properties
tri = Triangle(TriangleData = df.UpperTriangle
               , TriangleName = bigCompany
               , LossPeriodType = "accident"
               , LossPeriodInterval = years(1)
               , DevelopmentInterval = years(1)
               , LossPeriodColumn = "LossPeriodStart"
               , DevelopmentColumn = "DevelopmentLag")


is(tri, "Triangle")

plt = ShowTriangle(tri@TriangleData, bigCompany)


plt = ShowTriangle(tri@TriangleData, bigCompany, Cumulative=FALSE)
#Note the apparent calendar year impact in 1996. This is invisible in the cumulative display.

setName(tri) = "AnotherName"
setName(tri) = bigCompany
tri@TriangleName = "Another name"

# Now let's fit a model

chainLadder = TriangleModel("CumulPaid"
                            , BaseTriangle = tri
                            , ResponseName = "CumulativePaid"
                            , PredictorName = "DirectEP"
                            , CategoryName = "DevelopmentLag"
                            , MinimumCategoryFrequency = 1
                            , delta = 0)

To leave a comment for the author, please follow the link and comment on their blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)