Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate().

For our example we create a data frame. The issue is: I am working in the Pacific time zone on Saturday October 31st 2015, and I have some time data that I want to work with that is in an Asian time zone.

print(date())
## [1] "Sat Oct 31 08:14:38 2015"
d <- data.frame(group='x',
time=as.POSIXct(strptime('2006/10/01 09:00:00',
format='%Y/%m/%d %H:%M:%S',
tz="Etc/GMT+8"),tz="Etc/GMT+8"))  # I'd like to say UTC+8 or CST
print(d)
##   group                time
## 1     x 2006-10-01 09:00:00
print(d$time) ## [1] "2006-10-01 09:00:00 GMT+8" str(d$time)
##  POSIXct[1:1], format: "2006-10-01 09:00:00"
print(unclass(d$time)) ## [1] 1159722000 ## attr(,"tzone") ## [1] "Etc/GMT+8" Suppose I try to aggregate the data to find the earliest time for each group. I have a problem, aggregate loses the timezone and gives a bad answer. d2 <- aggregate(time~group,data=d,FUN=min) print(d2) ## group time ## 1 x 2006-10-01 10:00:00 print(d2$time)
## [1] "2006-10-01 10:00:00 PDT"

This is bad. Our time has lost its time zone and changed from 09:00:00 to 10:00:00. This violates John M. Chambers’ “Prime Directive” that:

computations can be understood and trusted.

Software for Data Analysis, John M. Chambers, Springer 2008, page 3.

The issue is the POSIXct time time is essentially a numeric array carrying around its timezone as an attribute. Most base R code has problems if there are extra attributes on a numeric array. So R-stat code tends to have a habit of dropping attributes when it can. it is odd that the class() is kept (which itself an attribute style structure) and the timezone is lost, but R is full of hand-specified corner cases.

library('dplyr')
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
##     filter
##
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
by_group = group_by(d,group)
d3 <- summarize(by_group,min(time))
print(d3)
## Source: local data frame [1 x 2]
##
##   group           min(time)
## 1     x 2006-10-01 09:00:00
print(d3[[2]])
## [1] "2006-10-01 09:00:00 GMT+8"

And plyr also works.

library('plyr')
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
d4 <- ddply(d,.(group),summarize,time=min(time))
print(d4)
##   group                time
## 1     x 2006-10-01 09:00:00
print(d4\$time)
## [1] "2006-10-01 09:00:00 GMT+8"