Thomas Lumley just dropped a blog post, Blank cheque inheritance and statistical objects, which begins as follows.
One of the problems with object-oriented programming for statistical methods is that inheritance is backwards. Everything is fine for data structures, and Bioconductor has many examples of broad (often abstract) base classes for biological information or annotation that are specialised by inheritance to more detailed and useful classes. Statistical methods go the other way.
In base R,
glmfor generalised linear models is a generalisation of
lmfor linear models, but the
glmclass inherits from the
This isn’t a problem with inheritance, it’s a problem with how R uses it.
Fundamentals of OOP and inheritance
The fundamental rule for inheritance in object oriented programming (OOP) is that a class X should inherit from class Y only if every X is a Y. That means you should be able to use an X wherever the code calls for an Y. For instance, a
Poodle class might inherit from the
Dog class, because every poodle is a dog. Any function you can apply to dogs, you can apply to poodles. Every function that is defined to return a poodle will also return a dog. Behavior as arguments and returns is governed by the concepts of covariance and contravariance in programming language theory. Inheritance must respect these relations for a coherent OOP design.
A classic blunder is to define a class for real numbers and one for complex numbers and have the complex numbers inherit from the real numbers. Every real number is a complex number, but not every complex number is a real number. So doing this will break standard OOP implementations. The reason beginners in OOP make this mistake is that it’s natural to think of the implementation of a complex number as taking a real number and adding an imaginary component. If you want to start with real numbers, a better way to define a complex number would be using composition to include two real components. That is, it contains two real numbers rather than inheriting from real numbers to get its real component. This is exactly how
std::complex is defined in C++, with a constructor that takes the real and complex components as two objects of type
T might be
double for double-precision complex numbers or it might be an autodiff type like
The God object anti-pattern
I’m also not fond of how
lm returns a god object. God objects are a widely cited anti-pattern in OOP, largely because they’re so frequently seen in the wild. The inclination that leads to this is to have something like a “blackboard” into which all kinds of state can be written so the user can get it all in one place. A common example is including all the input in a function’s output. No need to do that because the user already had all of that information because they couldn’t have called the function otherwise. God objects’ are usually a terrible idea as it’s nearly impossible to ensure consistency of such an object without defeating its purpose, because its purpose is to behave like a global variable repository. R doesn’t even try—you can take an
lm fit object and change various aspects of it and leave it in an inconsistent state without warning, e.g.,
> fit <- lm(dist ~ speed, data = cars) > fit$coefficients (Intercept) speed -17.579095 3.932409 > fit$coefficients = c(1, 2, 3, 4) > fit$coefficients  1 2 3 4
The Stan interfaces in Python and R also return god objects. I lost the design argument to the other developers, who argued, “That’s the way it’s done in R and Python.”
R’s argument chaining vs. OOP method chaining
Speaking of OOP, chaining with pipes in R follows the object-oriented pattern of method chaining, but instead of using object returns that are the class defining the next method in the chain, it just passes along the return to use as the first argument in the next chained function. It’s no longer object oriented. This doesn’t break any OO patterns, per se. It might be awkward if you need to pack enough into a return to go onto the next function. In OOP, developers often break long method chains into groups of coherent calls with named returns when the returns are not all instances of the same class. The reason to break up long chains is, ironically given how they’re motivated in R, to help with readability and self-documentation. Code readability is the single best thing you can do to make code maintainable, because code will be read much more often than it gets written. You can bridge the gap between what R does with chaining and the standard way to do method chaining in OOP by looking at how Python classes are defined with an explicit
self argument (like the
this pointer to the class instance in C++, but C++ doesn’t require it as an explicit argument on methods).
P.S. I tried commenting on Lumley’s blog but was defeated by Disqus. I thought it might be of general interest, so am including it here.