In any programming tool, dates, times, and timezones are hard. Deceptively hard. They’ve been shaped by politics and whimsy for hundreds of years: timezones can shift with minimal notice, countries have skipped or repeated certain days, some are offset by weird increments, some observe Daylight Saving Time, leap years, leap seconds, the list goes on. Luckily, we rarely need to worry about most of those details because other teams of very smart people have spent a lot of time providing nice abstractions for us that handle most of the weird edge cases.
Here at Methods we’ve been bit by some timezone oddities in R recently, so I wanted to focus on that part of R’s “date/time stack”, starting with a crash-course on datetimes and timezones before showing how to avoid common errors, with a special focus on reading from and writing to files.
Timezones in base R
Let’s start with a single string representing noon on the first day of this year. We want to make this a proper date-time object, which in R means the
POSIXct class; we can do this with
x <- "2018-01-01 12:00:00" as.POSIXct(x) ##  "2018-01-01 12:00:00 EST"
Notice the timezone code at the end; it defaulted to “EST” because of my default timezone, which might be different for you.
Sys.timezone() ##  "America/Detroit"
x isn’t supposed to represent Eastern time, we can specify a different timezone as an argument to
as.POSIXct(x, tz = "America/Chicago") ##  "2018-01-01 12:00:00 CST"
Notice the only thing that changed is the timezone; the clock time is the same. CST and EST are an hour apart, though, so our two conversions represent different instances of time; noon in Detroit occurs one hour later than noon in Chicago. We can check this by subtracting the dates:
as.POSIXct(x, tz = "America/Chicago") - as.POSIXct(x) ## Time difference of 1 hours
I’m not really in Detroit, I’m in Ypsilanti (another city in Michigan); what happens if we use that instead?
as.POSIXct(x, tz = "America/Ypsilanti") ##  "2018-01-01 12:00:00 America"
Okay that’s weird; what does the “America” timezone code even mean? Is it the same as the first conversion we did? If it is, there should be a 0-hour difference between them.
as.POSIXct(x, tz = "America/Ypsilanti") - as.POSIXct(x) ## Time difference of -5 hours
That’s not right! “America/Ypsilanti” isn’t a valid timezone, but rather than throwing an error (or even a warning), R created an instance of time with no UTC offset.
UTC is a special standard that has no timezone, no daylight savings time, or any of the other weird things that make timezones hard. In R, and most other programming languages, UTC is the common, underlying representation of (date)time, and things like timezones are defined in terms of UTC (EST is UTC-5, hence the “-5” above). It’s roughly the same as GMT (Greenwich Mean Time), but not quite identical (GMT can have Daylight Saving Time!)
This above behavior in R can be really dangerous; even if you intend to use a proper timezone name, misspelling it (e.g. “America/Chicgo”) will produce this same bad behavior. We need to make sure to only use valid timezone names. The list of options can be accessed with the
OlsonName() function; on my machine, this gives 606 different options!
We can see “America/Chicago” and “America/Detroit” are in there, but “America/Ypsilanti” isn’t.
c("America/Chicago", "America/Detroit", "America/Ypsilanti") %in% OlsonNames() ##  TRUE TRUE FALSE
These kinds of bad-but-quiet misunderstandings between the programmer and the language exist all over base R, but can often be prevented with additional packages. The
tidyverse family in particular aims to prevent silent errors like this, and their date/time package
lubridate is no exception.
Rather than using
as.POSIXct to turn our string into a datetime object, we’ll use
library(lubridate) as_datetime(x) # ymd_hms works as well ##  "2018-01-01 12:00:00 UTC"
as_datetime used UTC by default, rather than my local timezone. This is great because it means that command will return the same thing on any other computer, regardless what that computers’ timezone is set to.
lubridate makes code more portable!
as_datetime function still produces
class(as_datetime(x)) ##  "POSIXct" "POSIXt"
which means any non-
lubridate functions for working with these datetime objects will still work.
We can still specify whatever timezone we desire,
as_datetime(x, tz = "America/Detroit") ##  "2018-01-01 12:00:00 EST"
but more importantly,
lubridate lets us know if we use a bad timezone name!
as_datetime(x, tz = "America/Ypsilanti") ## Error in C_force_tz(time, tz = tzone, roll): CCTZ: Unrecognized output timezone: "America/Ypsilanti"
This alone, saving us from ourselves, might be reason enough to use
lubridate whenever you deal with timezones in R, but it can do a lot more, most of which we won’t cover here.
Why should we care?
For a lot of programs utilizing datetimes, you can get away with unconsidered or incorrect timezones. I can’t enumerate all the ways that could come back to bite you, but one is Daylight Saving Time.
Suppose we’re given the following datetimes as strings, perhaps read from some file:
x <- c("2018-03-10 23:30:00", "2018-03-11 00:30:00", "2018-03-11 01:30:00", "2018-03-11 03:30:00", "2018-03-11 04:30:00", "2018-03-11 05:30:00") x ##  "2018-03-10 23:30:00" "2018-03-11 00:30:00" "2018-03-11 01:30:00" ##  "2018-03-11 03:30:00" "2018-03-11 04:30:00" "2018-03-11 05:30:00"
To do much with these, we need to convert from
POSIXct. We’ll use
lubridate again, but instead of parsing with the very general
as_datetime() we’ll use the more idiomatic
ymd_hms() which expects inputs to be formatted as year-month-day, hour-minute-second (exactly what we have!).
ymd_hms(x) ##  "2018-03-10 23:30:00 UTC" "2018-03-11 00:30:00 UTC" ##  "2018-03-11 01:30:00 UTC" "2018-03-11 03:30:00 UTC" ##  "2018-03-11 04:30:00 UTC" "2018-03-11 05:30:00 UTC"
It’s possible that those times are supposed to be UTC and we got exactly what we wanted. If instead these strings represent clock-times from a place that observes Daylight Saving Time, we’ve got a problem. These times span the “spring ahead” transition, where clocks immediately transition from 2am to 3am. This means that in a DST-observing timezone all these clock-times are evenly spaced, since 1:30am \(\rightarrow\) 3:30am is only one hour.
Since UTC does not observe DST, elements of
dst are currently not evenly-spaced (it’s not a regular timeseries), which we can see quickly with the
diff(ymd_hms(x)) ## Time differences in hours ##  1 1 2 1 1
Using a DST-observing timezone on conversion will fix this.
diff(ymd_hms(x, tz = "America/Detroit")) ## Time differences in hours ##  1 1 1 1 1
Depending on what you want to do with these datetimes, that may not matter, but it could be absolutely critical.
lubridate won’t help prevent is using valid timezone names which don’t align with what you’re trying to do. “EST” (Eastern Standard Time) and “EDT” (Eastern Daylight Time) are both in
OlsonNames(), but will ignore the DST shift in our vector of datetimes:
diff(ymd_hms(x, tz = "EST")) ## Time differences in hours ##  1 1 2 1 1 diff(ymd_hms(x, tz = "EDT")) ## Time differences in hours ##  1 1 2 1 1
When we specify “EST”, the the first 3 are correct, but the latter half is wrong because they’re really “EDT” datetimes; the opposite occurs when we specify “EDT”.
In almost all cases, use location-based timezone names like “America/City” to properly handle not only DST, but other potential oddities that may have occurred in that location, like those that have shifted between observing and not observing DST.
Timezones and I/O
It’s rare that we need to create new datetime objects from scratch; they’re nearly always coming into a project from some external dataset, generally in Excel or CSV formats.
We can make a little CSV using
writeLines(); we’ll have just one column of our DST-crossing datetimes, and we’ll call it
writeLines(c("datetime", x), "data/dst-example.csv")
There’s very few excuses to be using
read.csv() in 2018, so we’ll use the far-superior
readr which will automatically parse our datetimes into
library(readr) tbl <- read_csv("data/dst-example.csv") ## Parsed with column specification: ## cols( ## datetime = col_datetime(format = "") ## ) tbl ## # A tibble: 6 x 1 ## datetime ## <dttm> ## 1 2018-03-10 23:30:00 ## 2 2018-03-11 00:30:00 ## 3 2018-03-11 01:30:00 ## 4 2018-03-11 03:30:00 ## 5 2018-03-11 04:30:00 ## 6 2018-03-11 05:30:00
What timezone are those?
lubridate makes this easy with their
tz(tbl$datetime) ##  "UTC"
If we know these aren’t UTC times, but are instead clock times from New York, we need to do some sort of conversion here, as
readr doesn’t allow you to specify the timezone when reading.
lubridate provides two “timezone conversion” functions:
with_tz()applies a timezone so the clock time changes, but the instant of time (e.g. UTC) is the same, e.g. 5pm EST \(\rightarrow\) 2pm PST.
force_tz()changes the instant of time so the clock time remains constant, e.g. 5pm EST \(\rightarrow\) 5pm PST.
In this case we need
force_tz because the times in our file are non-UTC (New York, in this case) clock times.
library(dplyr) tbl <- tbl %>% mutate(datetime = force_tz(datetime, "America/New_York")) tbl ## # A tibble: 6 x 1 ## datetime ## <dttm> ## 1 2018-03-10 23:30:00 ## 2 2018-03-11 00:30:00 ## 3 2018-03-11 01:30:00 ## 4 2018-03-11 03:30:00 ## 5 2018-03-11 04:30:00 ## 6 2018-03-11 05:30:00
Notice the display isn’t any different;
tibble doesn’t print timezones for datetime columns. The timezone has changed,
tz(tbl$datetime) ##  "America/New_York"
and they’re now evenly spaced, too:
diff(tbl$datetime) ## Time differences in hours ##  1 1 1 1 1
I’m not sure we’ve ever had a client send us a file with UTC datetimes in it, so whenever we read in a client-provided file with datetimes, we immediately call
force_tz to correct things.
Reading from Excel
This workflow is essentially the same with Excel files. We’ll use the
readxl package, another
tidyverse gem, which is much nicer than the Java-based libraries that came before it (e.g.
xlsx). I’ve already created an excel file (which you can download here) whose first sheet is identical to the CSV we just made, so we’ll read that in.
library(readxl) tbl <- read_excel("data/timezone-examples.xlsx") tbl ## # A tibble: 6 x 1 ## datetime ## <dttm> ## 1 2018-03-10 23:30:00 ## 2 2018-03-11 00:30:00 ## 3 2018-03-11 01:30:00 ## 4 2018-03-11 03:30:00 ## 5 2018-03-11 04:30:00 ## 6 2018-03-11 05:30:00
readxl::read_excel() is smart enough to see we have datetimes, and parse them appropriately, but we have the same problem as before:
tz(tbl$datetime) ##  "UTC"
Thus we need the same fix. An alternative to
force_tz() is to assign to
tz(tbl$datetime) <- "America/New_York" tbl ## # A tibble: 6 x 1 ## datetime ## <dttm> ## 1 2018-03-10 23:30:00 ## 2 2018-03-11 00:30:00 ## 3 2018-03-11 01:30:00 ## 4 2018-03-11 03:30:00 ## 5 2018-03-11 04:30:00 ## 6 2018-03-11 05:30:00 diff(tbl$datetime) ## Time differences in hours ##  1 1 1 1 1
This is very base-R style and doesn’t fit nicely into pipelines of commands with
%>%, so we generally opt for
mutate, but this form can also be useful.
Split Date & Time
We often receive data where the date is in one column, but the time is in the next column over. This presents some new challenges because R doesn’t have date-free time classes; we want both pieces in the same column.
We find this happens most often in Excel files; the second sheet of our
timezone-examples.xlsx file provides an example with the same data as before, but the date & time in two separate columns.
tbl <- read_excel("data/timezone-examples.xlsx", sheet = 2) tbl ## # A tibble: 6 x 2 ## date time ## <dttm> <dttm> ## 1 2018-03-10 00:00:00 1899-12-31 23:30:00 ## 2 2018-03-11 00:00:00 1899-12-31 00:30:00 ## 3 2018-03-11 00:00:00 1899-12-31 01:30:00 ## 4 2018-03-11 00:00:00 1899-12-31 03:30:00 ## 5 2018-03-11 00:00:00 1899-12-31 04:30:00 ## 6 2018-03-11 00:00:00 1899-12-31 05:30:00
We sure didn’t expect the time to be read as datetimes with some strange, irrelevant date! We’d prefer to have a single
datetime column where the dates come from
date and the times come from
time. There’s at least two ways we could do this:
- change the dates in the
timecolumn to be the dates in the
datecolumn, then rename
- create a new
datetimecolumn by taking the pieces we need from
We think the latter is a bit more “tidy” or idiomatic, so that’s what we’ll do. It’s a bit clunky, but we can extract the dates and times from the appropriate column, paste them together, then re-parse those strings into datetime (
POSIXct) objects. This will also let us specify the timezone when we do that, avoiding the need for
force_tz. The hardest part is pulling only the time-part out of the
POSIXct objects in our
time column; we’ll use
format for that.
tbl <- tbl %>% mutate(datetime = paste(date, format(time, "%T")) %>% ymd_hms(tz = "America/New_York")) %>% select(datetime) # drop date/time cols tbl ## # A tibble: 6 x 1 ## datetime ## <dttm> ## 1 2018-03-10 23:30:00 ## 2 2018-03-11 00:30:00 ## 3 2018-03-11 01:30:00 ## 4 2018-03-11 03:30:00 ## 5 2018-03-11 04:30:00 ## 6 2018-03-11 05:30:00 tz(tbl$datetime) ##  "America/New_York" diff(tbl$datetime) ## Time differences in hours ##  1 1 1 1 1
There we go! One ugly line of code, but that’s all we needed.
With CSV’s this is a bit easier because the
time column will come back as a
character vector rather than a
POSIXct vector (or you can force it to
POSIXct with the
col_types argument); then you don’t need the
format part above. If we tried to force that behavior using
col_types argument, we’d get back some hard-to-interpret numbers due to the horrible way Excel stores time.
Writing back out
So we’ve got our inputs formatted the way we want, then we do something, and at some point we probably want to write that back out. This is super easy with
What was written to that text file might not be what we expected; let’s investigate with
read_lines("data/example-output.csv") ##  "datetime" "2018-03-11T04:30:00Z" "2018-03-11T05:30:00Z" ##  "2018-03-11T06:30:00Z" "2018-03-11T07:30:00Z" "2018-03-11T08:30:00Z" ##  "2018-03-11T09:30:00Z"
Two things to notice: the format is ISO 8601 (the “T” and “Z”), and they’re stored as UTC (no timezone or offset, indicated by the lack of anything after the “Z”). Just like
readr assumes everything is UTC coming in, it also makes things UTC going out.
For programmers and programming languages, this is ideal: it’s consistent, unambiguous, and portable.
If this file is going to be used by non-programmers, however, this can be a problem. If you open this in Excel, you’ll see the UTC times, and unlike all sane programming languages Excel can’t store UTC but display EST/EDT/PST/etc. “What are all these weird times? Why are they off by X hours sometimes, and X+1 hours other times?” are not questions we like getting after handing off a deliverable to a client.
We have just two options:
- explain to the next user (client, colleague, etc.) the excellence and utility of UTC and convince them it’s the right way to do things
- write the datetimes out a bit differently
Depending on the next user, option 1 can be a total no-go, in which case we need a way to write our datetimes as clock times rather than instants of time (UTC).
The simplest way is to convert to
tbl %>% mutate_if(is.POSIXct, as.character) %>% # handy if you have multiple datetime cols write_csv("data/example-output.csv") read_lines("data/example-output.csv") ##  "datetime" "2018-03-10 23:30:00" "2018-03-11 00:30:00" ##  "2018-03-11 01:30:00" "2018-03-11 03:30:00" "2018-03-11 04:30:00" ##  "2018-03-11 05:30:00"
That seems to have done what we want, and if necessary we could replace
as.character with something else to get a different format.
This works about as well when writing to Excel files (e.g. with
writexl::write_xlsx), but in general we try to avoid writing directly to Excel files. If we know the next user will be using Excel, we prefer
writexl is the tool of choice for files with multiple tables spread over different sheets.
This only scratched the surface of issues that can arise when dealing with datetimes & timezones, but if there’s one lesson we hope sticks with you, it’s use
lubridate. It saves lives; check out this decade-old article from Revolution Analytics as further proof that life was a bit tougher before
If you want to dig deeper into the scary world of datetimes we’re mostly abstracted from, check out this excellent article from Zach Holman to learn about that one time Samoa skipped a day and so much more.