Strange behavior from the cut function with dates in R

August 12, 2014
By

(This article was first published on Paleocave Blog » R, and kindly contributed to R-bloggers)

I recently encountered some strange behavior from R when using the cut.POSIXt method with “day” as the interval specification. This function isn’t working as I intended and I doubt that it is working properly. I’ll show you the behavior I’m seeing (and what I was expecting) then I’ll show you my current base R workaround. To generate a reproducible example, I’ll use this latemail function I gleaned from this stack overflow post.

latemail <- function(N, st="2013/01/01", et="2013/12/31") {
 st <- as.POSIXct(as.Date(st))
 et <- as.POSIXct(as.Date(et))
 dt <- as.numeric(difftime(et,st,unit="sec"))
 ev <- sort(runif(N, 0, dt))
 rt <- st + ev
 }

And generate some data…


set.seed(7110)
#generate 1000 random POSIXlt dates and times
bar<-data.frame("date"=latemail(1000, st="2013/03/02", et="2013/03/30"))
# assign factors based on the day portion of the POSIXlt object
bar$dateCut <- cut(bar$date, "day", labels = FALSE)

I expected that all rows with the date 2013-03-01 would receive factor 1, all rows with the date 2013-03-02 would receive factor 2, and so on. At first glance this seems to be what is happening.

head(bar, 10)
     date                 dateCut
1    2013-03-01 19:10:31  1
2    2013-03-01 19:31:31  1
3    2013-03-01 19:55:02  1
4    2013-03-01 20:09:36  1
5    2013-03-01 20:13:32  1
6    2013-03-01 22:15:42  1
7    2013-03-01 22:16:06  1
8    2013-03-01 23:41:50  1
9    2013-03-02 00:30:53  2
10   2013-03-02 01:08:52  2

Note that at row 9 the date changes from March 1 to March 2 and the factor (dateCut) changes from 1 to 2. So far so good. But we shall see some strange things in the midnight hour.  

For additional locations where I see the expected behavior you can check

bar[ c(259, 260, 294, 295), ]
259  2013-03-08 23:22:15  8
260  2013-03-09 00:11:08  9
294  2013-03-09 23:59:11  9
295  2013-03-10 00:56:19  10

Now the weirdness.

bar[320:326, ]
320  2013-03-10 22:14:22  10
321  2013-03-10 22:28:03  10
322  2013-03-11 00:08:27  10
323  2013-03-11 00:30:08  10
324  2013-03-11 00:56:23  10
325  2013-03-11 01:19:54  11
326  2013-03-11 01:22:43  11

At row 322 the date changes from March 10 to March 11 but the dateCut factor doesn’t change until line 325. After 1:00 AM things seem to behave as expected. At first I thought maybe some sort of floor rounding was going on which was rounding midnight back to the previous day, but notice that the previous examples included times between midnight and 1:00 that were cut as expected. More weirdness examples:

bar[398:405,]
398  2013-03-12 23:56:20  12
399  2013-03-13 00:53:47  12
400  2013-03-13 01:30:33  13
401  2013-03-13 01:45:31  13
bar[430:435,]
430  2013-03-13 23:45:48  13
431  2013-03-14 00:28:40  13
432  2013-03-14 00:46:24  13
433  2013-03-14 00:55:16  13
434  2013-03-14 01:33:19  14
435  2013-03-14 02:02:45  14

I see even stranger behavior when I truncate to just the date.

bar$datetrunc=trunc(bar$date, "day")  
bar$truncCut <- cut(bar$datetrunc, "day", labels = FALSE) 

Again, things work fine for a while

head(bar, 10)
   date             dateCut datetrunc truncCut
1  2013-03-01 19:10:31 1   2013-03-01  1
2  2013-03-01 19:31:31 1   2013-03-01  1
3  2013-03-01 19:55:02 1   2013-03-01  1
4  2013-03-01 20:09:36 1   2013-03-01  1
5  2013-03-01 20:13:32 1   2013-03-01  1
6  2013-03-01 22:15:42 1   2013-03-01  1
7  2013-03-01 22:16:06 1   2013-03-01  1
8  2013-03-01 23:41:50 1   2013-03-01  1
9  2013-03-02 00:30:53 2   2013-03-02  2
10 2013-03-02 01:08:52 2   2013-03-02  2

But eventually wind up worse than ever.

bar[320:330,]
    date               dateCut datetrunc truncCut
320 2013-03-10 22:14:22  10  2013-03-10  10
321 2013-03-10 22:28:03  10  2013-03-10  10
322 2013-03-11 00:08:27  10  2013-03-11  10
323 2013-03-11 00:30:08  10  2013-03-11  10
324 2013-03-11 00:56:23  10  2013-03-11  10
325 2013-03-11 01:19:54  11  2013-03-11  10
326 2013-03-11 01:22:43  11  2013-03-11  10
327 2013-03-11 02:29:34  11  2013-03-11  10
328 2013-03-11 02:34:23  11  2013-03-11  10
329 2013-03-11 02:51:47  11  2013-03-11  10
330 2013-03-11 03:11:00  11  2013-03-11  10

The timeCut factor changes 3 rows too late but the truncCut factor stays stuck at 10 for a long time (47 rows). At row 369, the timeCut factor changes to 12 (correctly) and the truncCut factor finally turns over to 11.

bar[365:375,]
    date              dateCut datetrunc truncCut
365 2013-03-11 19:49:05  11  2013-03-11  10
366 2013-03-11 21:19:31  11  2013-03-11  10
367 2013-03-11 21:31:58  11  2013-03-11  10
368 2013-03-11 22:06:44  11  2013-03-11  10
369 2013-03-12 02:45:14  12  2013-03-12  11
370 2013-03-12 03:14:56  12  2013-03-12  11
371 2013-03-12 04:02:03  12  2013-03-12  11
372 2013-03-12 05:12:03  12  2013-03-12  11
373 2013-03-12 05:31:53  12  2013-03-12  11
374 2013-03-12 05:56:08  12  2013-03-12  11
375 2013-03-12 06:40:45  12  2013-03-12  11

My initial sidestep involved the rank() function (it achieved the desired result, but was S L O W). I won’t torture you with it here. I consulted with Dr. Erin Hodgess and devised this work around, which is pretty speedy.

foo <- unique(bar$datetrunc)
bar$truncMatch <- match(bar$datetrunc, foo)

Here’s that strange section where the truncCut factor behaved so poorly. No problem for my new truncMatch factor.

 

bar[320:330,]
    date            dateCut datetrunc truncCut truncMatch
320 2013-03-10 22:14:22  10  2013-03-10  10   10
321 2013-03-10 22:28:03  10  2013-03-10  10   10
322 2013-03-11 00:08:27  10  2013-03-11  10   11
323 2013-03-11 00:30:08  10  2013-03-11  10   11
324 2013-03-11 00:56:23  10  2013-03-11  10   11
325 2013-03-11 01:19:54  11  2013-03-11  10   11
326 2013-03-11 01:22:43  11  2013-03-11  10   11
327 2013-03-11 02:29:34  11  2013-03-11  10   11
328 2013-03-11 02:34:23  11  2013-03-11  10   11
329 2013-03-11 02:51:47  11  2013-03-11  10   11
330 2013-03-11 03:11:00  11  2013-03-11  10   11


 

Share

To leave a comment for the author, please follow the link and comment on his blog: Paleocave Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.