Unexpected behavior with summary function in R

[This article was first published on Bearded Analytics » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I often find myself working with data that includes dates and times. Sometimes I am interested in looking at what happened on on a particular calendar day. I usually avoid dealing with actual datetime formats when working at the day level and prefer to use an integer for representing a day. For instance the first day of each quarter can be represented as integers (numeric also works for this example). For instance if I wanted to know the oldest date in my dataset I can just take the minimum since I am using a numerical data structure.

mydates  <- c(20140101L,20140401L,20140701L,20141001L)
min(mydates)

[1] 20140101

For some reason which I don't recall, I tried using the summary() function on my dates. The only values that would be valid are the minimum and the maximum, or so I thought.

summary(mydates)

The output:
# Min.            1st Qu.       Median      Mean          3rd Qu.       Max.
# 20140000 20140000 20140000 20140000 20140000 20140000

This behavior which seemed odd to me, is caused by the way summary() deals with numerical data. So I decided to look at what summary is actually doing. To view the code behind the summary function type:

summary.default

The portion of code that we are interested in is this:

else if (is.numeric(object)) {
nas <- is.na(object)
object <- object[!nas]
qq <- stats::quantile(object)
qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits)
names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.",
"Max.")
if (any(nas))
c(qq, `NA's` = sum(nas))
else qq
}

This code chunk gets executed when the object that we pass to summary is numeric.  If we substitute in our object 'mydates' we get the following code.

digits = max(3L, getOption("digits") -     3L)
nas <- is.na(mydates)
mydates<- mydates[!nas]
qq <- stats::quantile(mydates)
qq <- signif(c(qq[1L:3L], mean(mydates), qq[4L:5L]), digits)

If you step through the code line by line, you will notice that after line 4, summary produces what we would expect to see for a min and max value. However, after you execute line 5, the numbers are changed because they are not the actual numbers but they are changed to be significant figures. For example try:

signif(20140101,digits)

[1] 20140000

So be careful when using generic functions if you don't know what they are doing. I would encourage to take a look at the code behind some of the R functions you use the most. For instance using the fivenum() function does not change my min and max values the same way summary did.

To leave a comment for the author, please follow the link and comment on their blog: Bearded Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)