Check your return types when modeling in R

January 27, 2015
By

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

Just a warning: double check your return types in R, especially when using different modeling packages.

We consider ourselves pretty familiar with R. We have years of experience, many other programming languages to compare R to, and we have taken Hadley Wickham’s Master R Developer Workshop (highly recommended). We already knew R’s predict function is pretty idiosyncratic (takes different arguments per model type, returns different types depending on model and arguments, which is why we wrapped it in our Bad Bayes article).

But here is unnecessarily nasty puzzle we ran into recently.

library('mgcv')
library('ROCR')

d <- data.frame(x=1:10,y=(1:10)>=5)

model <- lm(y~x,data=d)
d$predLM <- predict(model,type='response')
plot(performance(prediction(d$predLM,d$y),'tpr','fpr'))


model <- gam(y~x,family=binomial,data=d)
d$predGAM <- predict(model,type='response')
plot(performance(prediction(d$predGAM,d$y),'tpr','fpr'))
## Error in plot(performance(prediction(d$predGAM, d$y), "tpr", "fpr")) : 
##   error in evaluating the argument 'x' in selecting a method for 
##   function 'plot': Error in prediction(d$predGAM, d$y) : 
##   Format of predictions is invalid.

It is a silly example, but one really wonders why the plot of the lm model works and the exact same code fails to plot the gam model. Now (as with most runtime bugs brought on by overly dynamic languages) we ran into this problem while in the middle doing something else (while doing data analysis, not while coding). So we were not in the right frame of mind to deduce the solution without further experiment.

Now that we are calm we can try and look for the problem. The first step of effective debugging is to put aside what you had been working on and admit you are now debugging. So you write in your notebook what you had been trying to do, and temporarily clear that from your mind.

Professor Norman Matloff describes debugging as:

Finding your bug is a process of confirming the many things you believe are true, until you find one which is not true.

What do we believe? We believe that d$predLM and d$predGAM should both give us a plot. So in some sense we believe they have the same structure. The superficially look to have the same structure:

> print(d)
    x     y      predLM      predGAM
1   1 FALSE -0.05454545 2.220446e-16
2   2 FALSE  0.09090909 2.220446e-16
3   3 FALSE  0.23636364 2.220446e-16
4   4 FALSE  0.38181818 2.085853e-10
5   5  TRUE  0.52727273 1.000000e+00
6   6  TRUE  0.67272727 1.000000e+00
7   7  TRUE  0.81818182 1.000000e+00
8   8  TRUE  0.96363636 1.000000e+00
9   9  TRUE  1.10909091 1.000000e+00
10 10  TRUE  1.25454545 1.000000e+00

Let’s look closer:

> print(d$predLM)
 [1] -0.05454545  0.09090909  0.23636364  0.38181818  0.52727273  0.67272727  0.81818182
 [8]  0.96363636  1.10909091  1.25454545
> print(d$predGAM)
           1            2            3            4            5            6            7 
2.220446e-16 2.220446e-16 2.220446e-16 2.085853e-10 1.000000e+00 1.000000e+00 1.000000e+00 
           8            9           10 
1.000000e+00 1.000000e+00 1.000000e+00 

That is weird, print formats them different. Let’s see what these items really are.

> print(typeof(d$predGAM))
[1] "double"
> print(typeof(d$predLM))
[1] "double"
> print(class(d$predLM))
[1] "numeric"
> print(class(d$predGAM))
[1] "array"
> print(str(d$predLM))
 num [1:10] -0.0545 0.0909 0.2364 0.3818 0.5273 ...
NULL
> print(str(d$predGAM))
 num [1:10(1d)] 2.22e-16 2.22e-16 2.22e-16 2.09e-10 1.00 ...
 - attr(*, "dimnames")=List of 1
  ..$ : chr [1:10] "1" "2" "3" "4" ...
NULL

Using all of typeof, class, and str (which we didn’t know about when we wrote Survive R) gives us the story. d$predGAM isn’t a vector in R’s specific peculiar sense of the word:

> is.vector(d$predLM)
[1] TRUE
> is.vector(d$predGAM)
[1] FALSE

Had we known to look, we could have found the problem in one step with str(d):

> print(str(d))
'data.frame':	10 obs. of  4 variables:
 $ x      : int  1 2 3 4 5 6 7 8 9 10
 $ y      : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...
 $ predLM : num  -0.0545 0.0909 0.2364 0.3818 0.5273 ...
 $ predGAM: num [1:10(1d)] 2.22e-16 2.22e-16 2.22e-16 2.09e-10 1.00 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ : chr  "1" "2" "3" "4" ...
NULL

R’s type system is strange. typeof returns what primitive type is used to implement the item at hand (in this case a vector of doubles). class returns what classes have been declared for this item.

To computer scientist: d$predGAM is a double vector that has some additional attributes (such as the array class declaration and shape parameters). It would commonly be thought of as a derived or refined type of vector. The writers of mgcv were probably thinking in these terms and figured it is okay to return a more refined type than one would expect from the generic predict signature. This is how most object oriented languages work. It is hard to call this code incorrect when the help(predict) documentation (for the generic base-method) is:

The form of the value returned by predict depends on the class of its argument. See the documentation of the particular methods for details of what is produced by that method.

To R: d$predGAM is an array which is a class that is different than a double vector (though it is implemented in terms of a double vector, similar to C++’s idea of private or implementation inheritance). In R arrays support most of the same operations as vectors and can even interoperate with them (you can add an array to a vector). However, the ROCR package likely explicitly checked (at runtime) the types of its arguments. This is an actual correct instance of irony: an added type safety check (meant to defend against and give good error messages in the case of unexpected types) triggered an error on a type that probably could have quietly been used safely. (Note: in general I very much like such checks, they tend to cut down on errors and move detection of errors much closer to error origins- making debugging and maintenance much easier).

The two packages failed to interoperate because they happened to have slightly incompatible (but likely each internally consistent) visions the R type system.

A final question is: why didn’t stuffing the value into a data.frame coerce the types or get rid of some of the additional annotations? The reason is that in R data.frames are in fact lists of columns and they only appear to be two-dimensional row-oriented structures due to some clever over-riding of the two-dimensional [,] operator. Despite expectations data.frame columns are not always simple primitive vectors and can hold complex composite objects such as POSIXlt which would break a lot more code if it didn’t have a built-in conversion to numeric).

To leave a comment for the author, please follow the link and comment on their blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)