[This article was first published on r on Tony ElHabr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my job, I often work with data sampled at regular intervals. Samples may range from 5-minute intervals to daily intervals, depending on the specific task. While working with this kind of data is straightforward when its in a database (and I can use SQL), I have been in a couple of situations where the data is spread across .csv files. In these cases, I lean on `R` to scrape and compile the data. Although some other languages might be superior for such a task, I often need to produce some kind of visualization or report at the end, so choosing to handle all of the data with `R` is a no-brainer for me–I can easily transition to using `{ggplot2}` and `{rmarkdown}` to generate some pretty output.

## Minute-Level Data

When working with data sampled at 5-minute intervals (resulting in 288 intervals in a day), I’ve found that I’ve used a common “idiom” to generate a time-based “dictionary”. For example, here’s how I might create such a date dictionary for the year 2013. 1

```library("dplyr")
library("lubridate")

date_1 <- lubridate::ymd("2013-01-01")
date_2 <- lubridate::ymd("2013-12-31")

ymd_seq <- seq.Date(date_1, date_2, by = "day")
ymd_grid <-
data_frame(
ymd = lubridate::ymd(ymd_seq),
yyyy = lubridate::year(ymd_seq),
mm = lubridate::month(ymd_seq),
dd = lubridate::day(ymd_seq)
)

hhmmss <-
expand.grid(
hh = seq(1L, 24L, 1L),
min = seq(5L, 60L, by = 5L),
sec = 0L) %>%
as_tibble()

dates_dict <-
ymd_grid %>%
right_join(hhmmss %>% mutate(yyyy = lubridate::year(date_1), by = "year")) %>%
arrange(yyyy, mm, dd, hh, min)
dates_dict

## # A tibble: 105,120 x 8
##    ymd         yyyy    mm    dd    hh   min   sec by
##    <date>     <dbl> <dbl> <int> <int> <int> <int> <chr>
##  1 2013-01-01  2013     1     1     1     5     0 year
##  2 2013-01-01  2013     1     1     1    10     0 year
##  3 2013-01-01  2013     1     1     1    15     0 year
##  4 2013-01-01  2013     1     1     1    20     0 year
##  5 2013-01-01  2013     1     1     1    25     0 year
##  6 2013-01-01  2013     1     1     1    30     0 year
##  7 2013-01-01  2013     1     1     1    35     0 year
##  8 2013-01-01  2013     1     1     1    40     0 year
##  9 2013-01-01  2013     1     1     1    45     0 year
## 10 2013-01-01  2013     1     1     1    50     0 year
## # ... with 1.051e+05 more rows
```

And, just to prove that there are 288 5-minute intervals for each day in the year, I can use `dplyr::count()` twice in succession. 2

```dates_dict %>% count(yyyy, mm, dd) %>% count()

## # A tibble: 1 x 1
##      nn
##   <int>
## 1   365
```

I then extract data (from individual files) using the time-based dictionary as a “helper” for custom functions for creating file paths and processing the data after importing it.

After the dirty work of is done, I can transition to the fun part–exploring and interpreting the data. This process often turns out to be a cycle of visualization, data transformation, and modeling.

class=“section level3”>

### An Example (With the `nycflights13` Package)

To provide an example, I’ll use the `flights` data set from the `{nycflight13}` package. 3 This package includes information regarding all flights leaving from New York City airports in 2013, as well as information regarding weather, airlines, airports, and planes.

Let’s say that I that I’m interested in the average flight departure delay time at the JFK airport. I might hypothesize that there is a relationship between departure delay time with different time periods, such as hour in the day and days in the week.

First, I’ll perform the necessary transformation to the `flights` data to investigate my hypothesis. Specifically, I need to create columns for hour and weekday. (For hour (`hh`), I simply use the scheduled departure time (`sched_dep_time`).)

```library("nycflights13")

flights_jfk <-
nycflights13::flights %>%
filter(origin == "JFK") %>%
mutate(hh = round(sched_dep_time / 100, 0) - 1) %>%
mutate(ymd = lubridate::ymd(sprintf("%04.0f-%02.0f-%02.0f", year, month, day))) %>%
mutate(wd = lubridate::wday(ymd, label = TRUE))
```

Next, I might create a heat map plotting hours against weekdays.

To investigate the patterns more “scientifically”, I might perform a one-way Analysis of Variance (ANOVA) on different time variables. I would make sure to test time periods other than just weekday (`wd`) and hour (`hh`), such as `month` and `day`.

```summary(aov(dep_delay ~ month, data = flights_jfk))

##                 Df    Sum Sq Mean Sq F value   Pr(>F)
## month            1     55213   55213   36.25 1.74e-09 ***
## Residuals   109414 166664445    1523
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1863 observations deleted due to missingness

summary(aov(dep_delay ~ day, data = flights_jfk))

##                 Df    Sum Sq Mean Sq F value Pr(>F)
## day              1        40    40.2   0.026  0.871
## Residuals   109414 166719617  1523.8
## 1863 observations deleted due to missingness

summary(aov(dep_delay ~ wd, data = flights_jfk))

##                 Df    Sum Sq Mean Sq F value Pr(>F)
## wd               6    246401   41067   26.99 <2e-16 ***
## Residuals   109409 166473256    1522
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1863 observations deleted due to missingness

summary(aov(dep_delay ~ hh, data = flights_jfk))

##                 Df    Sum Sq Mean Sq F value Pr(>F)
## hh               1   5701754 5701754    3874 <2e-16 ***
## Residuals   109414 161017904    1472
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1863 observations deleted due to missingness
```

The “statistically significant” p values for the `wd` and `hh` variables provides incentive to investigate them more closely. I might try an ANOVA F-statistic test comparing linear regression models using each variable as a lone predictor with a linear model where both are used as predictors.

```lm_wd <- lm(dep_delay ~ wd, data = flights_jfk)
lm_hh <- lm(dep_delay ~ hh, data = flights_jfk)
lm_both <- lm(dep_delay ~ wd + hh, data = flights_jfk)
anova(lm_both, lm_wd, test = "F")

## Analysis of Variance Table
##
## Model 1: dep_delay ~ wd + hh
## Model 2: dep_delay ~ wd
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)
## 1 109408 160778456
## 2 109409 166473256 -1  -5694800 3875.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lm_both, lm_hh, test = "F")

## Analysis of Variance Table
##
## Model 1: dep_delay ~ wd + hh
## Model 2: dep_delay ~ hh
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)
## 1 109408 160778456
## 2 109414 161017904 -6   -239447 27.157 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Of course, this process would go on. For instance, I would certainly need to investigate if there is a relationship between departure delay and specific airlines.

## Final Thoughts

Working with interval data was initially a challenge for me, but after working with it more and more often, I find that it’s not so bad after all. It gets more interesting when there is missing data or data samples at irregular intervals, but that’s a story for another day.

1. The technique that I show here would have to be adjusted slightly if working with more than one year at a time, but it wouldn’t be difficult to do so. I tend to only use this design pattern for one year at a time anyways. ^
2. By the way, how cool is following one `dplyr::count()` call with another? I only found out about that nice little trick recently. ^
3. Thanks to Hadley Wickham for compiling this data package, as well as for developing the `{lubridate}` and {ggplot2}` packages.) ^

To leave a comment for the author, please follow the link and comment on their blog: r on Tony ElHabr.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)