Parsing Dates and Times

March 21, 2015
By

(This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers)

Motivation

R has excellent for dates and times via the built-in Date and POSIXt
classes. Their usage, however, is not always as straightforward as one
would want. Certain conversions are more cumbersome than we would like: while
as.Date("2015-03-22"), would it not be nice if as.Date("20150322") (a
format often used in logfiles) also worked, or for that matter
as.Date(20150322L) using an integer variable, or even
as.Date("2015-Mar-22") and as.Date("2015Mar22")?

Similarly, many date and time formats suitable for POSIXct (the short form)
and POSIXlt (the long form with accessible components) often require rather too
much formatting, and/or defaults. Why for example does
as.POSIXct(as.numeric(Sys.time()), origin="1970-01-01") require the
origin argument on the conversion back (from fractional seconds since the
epoch) into datetime—when it is not required when creating the
double-precision floating point representation of time since the epoch?

But thanks to Boost and its excellent
Boost Date_Time
library—which we already mentioned in
this post about the BH package— we can
address parsing of dates and times. It permitted us to write a new function
toPOSIXct() which now part of the
RcppBDT package (albeit right
now just the GitHub version but we
expect this to migrate to CRAN “soon” as well).

Implementation

We will now discuss the outline of this implementation. For full details,
see
the source file.

Headers and Constants

#include 
#include 
#include 

// [[Rcpp::depends(BH)]]

namespace bt = boost::posix_time;

const std::locale formats[] = {    // this shows a subset only, see the source file for full list
    std::locale(std::locale::classic(), new bt::time_input_facet("%Y-%m-%d %H:%M:%S%f")),
    std::locale(std::locale::classic(), new bt::time_input_facet("%Y/%m/%d %H:%M:%S%f")),

    std::locale(std::locale::classic(), new bt::time_input_facet("%Y-%m-%d")),
    std::locale(std::locale::classic(), new bt::time_input_facet("%b/%d/%Y")),
};
const size_t nformats = sizeof(formats)/sizeof(formats[0]);

Note that we show only two datetime formats along with two date
formats. The actual implementation has many more.

Core Converter

The actual conversion from string to a double (the underlying format in
POSIXct) is done by the following function. It loops over all given
formats, and returns the computed value after the first match. In case of
failure, a floating point NA is returned.

double stringToTime(const std::string s) {

    bt::ptime pt, ptbase;

    // loop over formats and try them til one fits
    for (size_t i=0; pt == ptbase && i < nformats; ++i) {
        std::istringstream is(s);
        is.imbue(formats[i]);
        is >> pt;
    }
    
    if (pt == ptbase) {
        return NAN;
    } else { 
        const bt::ptime timet_start(boost::gregorian::date(1970,1,1));
        bt::time_duration diff = pt - timet_start;

        // Define BOOST_DATE_TIME_POSIX_TIME_STD_CONFIG to use nanoseconds
        // (and then use diff.total_nanoseconds()/1.0e9;  instead)
        return diff.total_microseconds()/1.0e6;
    }
}

Convenience Wrappers

We want to be able to convert from numeric as well as string formats. For
this, we write a templated (and vectorised) function which invokes the actual
conversion function for each argument. It also deals (somewhat
heuristically) with two corner cases: we want 20150322 be converted from
either integer or numeric, but need in the latter case distinguish this value
and its rangue from the (much larger) value for seconds since the epoch.
That creates a minir ambiguity: we will not be able to convert back for inputs
from seconds since the epoch for the first few years since January 1, 1970.
But as these are rare in the timestamp form we can accept the trade-off.

template <int RTYPE>
Rcpp::DatetimeVector toPOSIXct_impl(const Rcpp::Vector<RTYPE>& sv) {

    int n = sv.size();
    Rcpp::DatetimeVector pv(n);
    
    for (int i=0; i<n; i++) {
        std::string s = boost::lexical_cast<std::string>(sv[i]);
        //Rcpp::Rcout << sv[i] << " -- " << s << std::endl;

        // Boost Date_Time gets the 'YYYYMMDD' format wrong, even
        // when given as an explicit argument. So we need to test here.
        // While we are at it, may as well test for obviously wrong data.
        int l = s.size();
        if ((l < 8) ||          // impossibly short
            (l == 9)) {         // 8 or 10 works, 9 cannot
            Rcpp::stop("Inadmissable input: %s", s);
        } else if (l == 8) {    // turn YYYYMMDD into YYYY/MM/DD
            s = s.substr(0, 4) + "/" + s.substr(4, 2) + "/" + s.substr(6,2);
        }
        pv[i] = stringToTime(s);
    }
    return pv;
}

User-facing Function

Finally, we can look at the user-facing function. It accepts input in either
integer, numeric or character vector form, and then dispatches accordingly to
the templated internal function we just discussed. Other inputs are
unsuitable and trigger an error.

// [[Rcpp::export]]
Rcpp::DatetimeVector toPOSIXct(SEXP x) {
    if (Rcpp::is<Rcpp::CharacterVector>(x)) {
        return toPOSIXct_impl<STRSXP>(x);
    } else if (Rcpp::is<Rcpp::IntegerVector>(x)) {
        return toPOSIXct_impl<INTSXP>(x); 
    } else if (Rcpp::is<Rcpp::NumericVector>(x)) {
        // here we have two cases: either we are an int like
        // 200150315 'mistakenly' cast to numeric, or we actually
        // are a proper large numeric (ie as.numeric(Sys.time())
        Rcpp::NumericVector v(x);
        if (v[0] < 21990101) {  // somewhat arbitrary cuttoff
            // actual integer date notation: convert to string and parse
            return toPOSIXct_impl<REALSXP>(x);
        } else {
            // we think it is a numeric time, so treat it as one
            return Rcpp::DatetimeVector(x);
        }
    } else {
        Rcpp::stop("Unsupported Type");
        return R_NilValue;//not reached
    }
}

Illustration

A simply illustration follows. A fuller demonstration is
part of the RcppBDT package.
This already shows support for subsecond granularity and a variety of date formats.

## parsing character
s <- c("2004-03-21 12:45:33.123456",    # ISO
       "2004/03/21 12:45:33.123456",    # variant
       "20040321",                      # just dates work fine as well
       "Mar/21/2004",                   # US format, also support month abbreviation or full
       "rapunzel")                      # will produce a NA

p <- toPOSIXct(s)

options("digits.secs"=6)                # make sure we see microseconds in output
print(format(p, tz="UTC"))              # format UTC times as UTC (helps for Date types too)
[1] "2004-03-21 12:45:33.123456" "2004-03-21 12:45:33.123456"
[3] "2004-03-21 00:00:00.000000" "2004-03-21 00:00:00.000000"
[5] NA                          

We can also illustrate integer and numeric inputs:

## parsing integer types
s <- c(20150315L, 20010101L, 20141231L)
p <- toPOSIXct(s)
print(format(p, tz="UTC"))
[1] "2015-03-15" "2001-01-01" "2014-12-31"
## parsing numeric types
s <- c(20150315, 20010101, 20141231)
print(format(p, tz="UTC"))
[1] "2015-03-15" "2001-01-01" "2014-12-31"

Note that we always forced display using UTC rather local time, the R default.

To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)