# Dealing with Interval Data and the nycflights13 package using R

**r on Tony ElHabr**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my job, I often work with data sampled at regular intervals. Samples

may range from 5-minute intervals to daily intervals, depending on the

specific task. While working with this kind of data is straightforward

when its in a database (and I can use SQL), I have been in a couple of

situations where the data is spread across .csv files. In these cases, I

lean on `R`

to scrape and compile the data. Although some other

languages might be superior for such a task, I often need to produce

some kind of visualization or report at the end, so choosing to handle

all of the data with `R`

is a no-brainer for me–I can easily transition

to using `{ggplot2}`

and `{rmarkdown}`

to generate some pretty output.

## Minute-Level Data

When working with data sampled at 5-minute intervals (resulting in 288

intervals in a day), I’ve found that I’ve used a common “idiom” to

generate a time-based “dictionary”. For example, here’s how I might

create such a date dictionary for the year 2013. ^{1}

library("dplyr") library("lubridate") date_1 <- lubridate::ymd("2013-01-01") date_2 <- lubridate::ymd("2013-12-31") ymd_seq <- seq.Date(date_1, date_2, by = "day") ymd_grid <- data_frame( ymd = lubridate::ymd(ymd_seq), yyyy = lubridate::year(ymd_seq), mm = lubridate::month(ymd_seq), dd = lubridate::day(ymd_seq) ) hhmmss <- expand.grid( hh = seq(1L, 24L, 1L), min = seq(5L, 60L, by = 5L), sec = 0L) %>% as_tibble() dates_dict <- ymd_grid %>% right_join(hhmmss %>% mutate(yyyy = lubridate::year(date_1), by = "year")) %>% arrange(yyyy, mm, dd, hh, min) dates_dict

## # A tibble: 105,120 x 8 ## ymd yyyy mm dd hh min sec by ## <date> <dbl> <dbl> <int> <int> <int> <int> <chr> ## 1 2013-01-01 2013 1 1 1 5 0 year ## 2 2013-01-01 2013 1 1 1 10 0 year ## 3 2013-01-01 2013 1 1 1 15 0 year ## 4 2013-01-01 2013 1 1 1 20 0 year ## 5 2013-01-01 2013 1 1 1 25 0 year ## 6 2013-01-01 2013 1 1 1 30 0 year ## 7 2013-01-01 2013 1 1 1 35 0 year ## 8 2013-01-01 2013 1 1 1 40 0 year ## 9 2013-01-01 2013 1 1 1 45 0 year ## 10 2013-01-01 2013 1 1 1 50 0 year ## # ... with 1.051e+05 more rows

And, just to prove that there are 288 5-minute intervals for each day in

the year, I can use `dplyr::count()`

twice in succession. ^{2}

dates_dict %>% count(yyyy, mm, dd) %>% count()

## # A tibble: 1 x 1 ## nn ## <int> ## 1 365

I then extract data (from individual files) using the time-based

dictionary as a “helper” for custom functions for creating file paths

and processing the data after importing it.

After the dirty work of is done, I can transition to the fun

part–exploring and interpreting the data. This process often turns out

to be a cycle of visualization, data transformation, and modeling.

class=“section level3”>

### An Example (With the `nycflights13`

Package)

To provide an example, I’ll use the `flights`

data set from the

`{nycflight13}`

package. ^{3} This package includes

information regarding all flights leaving from New York City airports in

2013, as well as information regarding weather, airlines, airports, and

planes.

Let’s say that I that I’m interested in the average flight departure

delay time at the JFK airport. I might hypothesize that there is a

relationship between departure delay time with different time periods,

such as hour in the day and days in the week.

First, I’ll perform the necessary transformation to the `flights`

data

to investigate my hypothesis. Specifically, I need to create columns for

hour and weekday. (For hour (`hh`

), I simply use the scheduled departure

time (`sched_dep_time`

).)

library("nycflights13") flights_jfk <- nycflights13::flights %>% filter(origin == "JFK") %>% mutate(hh = round(sched_dep_time / 100, 0) - 1) %>% mutate(ymd = lubridate::ymd(sprintf("%04.0f-%02.0f-%02.0f", year, month, day))) %>% mutate(wd = lubridate::wday(ymd, label = TRUE))

Next, I might create a heat map plotting hours against weekdays.

To investigate the patterns more “scientifically”, I might perform a

one-way Analysis of Variance (ANOVA) on different time variables. I

would make sure to test time periods other than just weekday (`wd`

) and

hour (`hh`

), such as `month`

and `day`

.

summary(aov(dep_delay ~ month, data = flights_jfk))

## Df Sum Sq Mean Sq F value Pr(>F) ## month 1 55213 55213 36.25 1.74e-09 *** ## Residuals 109414 166664445 1523 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## 1863 observations deleted due to missingness

summary(aov(dep_delay ~ day, data = flights_jfk))

## Df Sum Sq Mean Sq F value Pr(>F) ## day 1 40 40.2 0.026 0.871 ## Residuals 109414 166719617 1523.8 ## 1863 observations deleted due to missingness

summary(aov(dep_delay ~ wd, data = flights_jfk))

## Df Sum Sq Mean Sq F value Pr(>F) ## wd 6 246401 41067 26.99 <2e-16 *** ## Residuals 109409 166473256 1522 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## 1863 observations deleted due to missingness

summary(aov(dep_delay ~ hh, data = flights_jfk))

## Df Sum Sq Mean Sq F value Pr(>F) ## hh 1 5701754 5701754 3874 <2e-16 *** ## Residuals 109414 161017904 1472 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## 1863 observations deleted due to missingness

The “statistically significant” p values for the `wd`

and `hh`

variables

provides incentive to investigate them more closely. I might try an

ANOVA F-statistic test comparing linear regression models using each

variable as a lone predictor with a linear model where both are used as

predictors.

lm_wd <- lm(dep_delay ~ wd, data = flights_jfk) lm_hh <- lm(dep_delay ~ hh, data = flights_jfk) lm_both <- lm(dep_delay ~ wd + hh, data = flights_jfk) anova(lm_both, lm_wd, test = "F")

## Analysis of Variance Table ## ## Model 1: dep_delay ~ wd + hh ## Model 2: dep_delay ~ wd ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 109408 160778456 ## 2 109409 166473256 -1 -5694800 3875.2 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lm_both, lm_hh, test = "F")

## Analysis of Variance Table ## ## Model 1: dep_delay ~ wd + hh ## Model 2: dep_delay ~ hh ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 109408 160778456 ## 2 109414 161017904 -6 -239447 27.157 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Of course, this process would go on. For instance, I would certainly

need to investigate if there is a relationship between departure delay

and specific airlines.

## Final Thoughts

Working with interval data was initially a challenge for me, but after

working with it more and more often, I find that it’s not so bad after

all. It gets more interesting when there is missing data or data samples

at irregular intervals, but that’s a story for another day.

- The technique that I show here would have to be adjusted slightly if working with more than one year at a time, but it wouldn’t be difficult to do so. I tend to only use this design pattern for one year at a time anyways.

^{^} - By the way, how cool is following one
`dplyr::count()`

call with another? I only found out about that nice little trick recently.

^{^} - Thanks to Hadley Wickham for compiling this data package, as well as for developing the
`{lubridate}`

and {ggplot2}` packages.)

^{^}

**leave a comment**for the author, please follow the link and comment on their blog:

**r on Tony ElHabr**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.