# Dealing with Interval Data and the nycflights13 package using R

**r on Tony ElHabr**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my job, I often work with data sampled at regular intervals. Samples

may range from 5-minute intervals to daily intervals, depending on the

specific task. While working with this kind of data is straightforward

when its in a database (and I can use SQL), I have been in a couple of

situations where the data is spread across .csv files. In these cases, I

lean on `R`

to scrape and compile the data. Although some other

languages might be superior for such a task, I often need to produce

some kind of visualization or report at the end, so choosing to handle

all of the data with `R`

is a no-brainer for me–I can easily transition

to using `{ggplot2}`

and `{rmarkdown}`

to generate some pretty output.

## Minute-Level Data

When working with data sampled at 5-minute intervals (resulting in 288

intervals in a day), I’ve found that I’ve used a common “idiom” to

generate a time-based “dictionary”. For example, here’s how I might

create such a date dictionary for the year 2013. ^{1}

```
library("dplyr")
library("lubridate")
date_1 <- lubridate::ymd("2013-01-01")
date_2 <- lubridate::ymd("2013-12-31")
ymd_seq <- seq.Date(date_1, date_2, by = "day")
ymd_grid <-
data_frame(
ymd = lubridate::ymd(ymd_seq),
yyyy = lubridate::year(ymd_seq),
mm = lubridate::month(ymd_seq),
dd = lubridate::day(ymd_seq)
)
hhmmss <-
expand.grid(
hh = seq(1L, 24L, 1L),
min = seq(5L, 60L, by = 5L),
sec = 0L) %>%
as_tibble()
dates_dict <-
ymd_grid %>%
right_join(hhmmss %>% mutate(yyyy = lubridate::year(date_1), by = "year")) %>%
arrange(yyyy, mm, dd, hh, min)
dates_dict
```

```
## # A tibble: 105,120 x 8
## ymd yyyy mm dd hh min sec by
##
```
## 1 2013-01-01 2013 1 1 1 5 0 year
## 2 2013-01-01 2013 1 1 1 10 0 year
## 3 2013-01-01 2013 1 1 1 15 0 year
## 4 2013-01-01 2013 1 1 1 20 0 year
## 5 2013-01-01 2013 1 1 1 25 0 year
## 6 2013-01-01 2013 1 1 1 30 0 year
## 7 2013-01-01 2013 1 1 1 35 0 year
## 8 2013-01-01 2013 1 1 1 40 0 year
## 9 2013-01-01 2013 1 1 1 45 0 year
## 10 2013-01-01 2013 1 1 1 50 0 year
## # ... with 1.051e+05 more rows

And, just to prove that there are 288 5-minute intervals for each day in

the year, I can use `dplyr::count()`

twice in succession. ^{2}

```
dates_dict %>% count(yyyy, mm, dd) %>% count()
```

```
## # A tibble: 1 x 1
## nn
##
```
## 1 365

I then extract data (from individual files) using the time-based

dictionary as a “helper” for custom functions for creating file paths

and processing the data after importing it.

After the dirty work of is done, I can transition to the fun

part–exploring and interpreting the data. This process often turns out

to be a cycle of visualization, data transformation, and modeling.

class=“section level3”>

### An Example (With the `nycflights13`

Package)

To provide an example, I’ll use the `flights`

data set from the

`{nycflight13}`

package. ^{3} This package includes

information regarding all flights leaving from New York City airports in

2013, as well as information regarding weather, airlines, airports, and

planes.

Let’s say that I that I’m interested in the average flight departure

delay time at the JFK airport. I might hypothesize that there is a

relationship between departure delay time with different time periods,

such as hour in the day and days in the week.

First, I’ll perform the necessary transformation to the `flights`

data

to investigate my hypothesis. Specifically, I need to create columns for

hour and weekday. (For hour (`hh`

), I simply use the scheduled departure

time (`sched_dep_time`

).)

```
library("nycflights13")
flights_jfk <-
nycflights13::flights %>%
filter(origin == "JFK") %>%
mutate(hh = round(sched_dep_time / 100, 0) - 1) %>%
mutate(ymd = lubridate::ymd(sprintf("%04.0f-%02.0f-%02.0f", year, month, day))) %>%
mutate(wd = lubridate::wday(ymd, label = TRUE))
```

Next, I might create a heat map plotting hours against weekdays.

To investigate the patterns more “scientifically”, I might perform a

one-way Analysis of Variance (ANOVA) on different time variables. I

would make sure to test time periods other than just weekday (`wd`

) and

hour (`hh`

), such as `month`

and `day`

.

```
summary(aov(dep_delay ~ month, data = flights_jfk))
```

```
## Df Sum Sq Mean Sq F value Pr(>F)
## month 1 55213 55213 36.25 1.74e-09 ***
## Residuals 109414 166664445 1523
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1863 observations deleted due to missingness
```

```
summary(aov(dep_delay ~ day, data = flights_jfk))
```

```
## Df Sum Sq Mean Sq F value Pr(>F)
## day 1 40 40.2 0.026 0.871
## Residuals 109414 166719617 1523.8
## 1863 observations deleted due to missingness
```

```
summary(aov(dep_delay ~ wd, data = flights_jfk))
```

```
## Df Sum Sq Mean Sq F value Pr(>F)
## wd 6 246401 41067 26.99 <2e-16 ***
## Residuals 109409 166473256 1522
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1863 observations deleted due to missingness
```

```
summary(aov(dep_delay ~ hh, data = flights_jfk))
```

```
## Df Sum Sq Mean Sq F value Pr(>F)
## hh 1 5701754 5701754 3874 <2e-16 ***
## Residuals 109414 161017904 1472
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1863 observations deleted due to missingness
```

The “statistically significant” p values for the `wd`

and `hh`

variables

provides incentive to investigate them more closely. I might try an

ANOVA F-statistic test comparing linear regression models using each

variable as a lone predictor with a linear model where both are used as

predictors.

```
lm_wd <- lm(dep_delay ~ wd, data = flights_jfk)
lm_hh <- lm(dep_delay ~ hh, data = flights_jfk)
lm_both <- lm(dep_delay ~ wd + hh, data = flights_jfk)
anova(lm_both, lm_wd, test = "F")
```

```
## Analysis of Variance Table
##
## Model 1: dep_delay ~ wd + hh
## Model 2: dep_delay ~ wd
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 109408 160778456
## 2 109409 166473256 -1 -5694800 3875.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```
anova(lm_both, lm_hh, test = "F")
```

```
## Analysis of Variance Table
##
## Model 1: dep_delay ~ wd + hh
## Model 2: dep_delay ~ hh
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 109408 160778456
## 2 109414 161017904 -6 -239447 27.157 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Of course, this process would go on. For instance, I would certainly

need to investigate if there is a relationship between departure delay

and specific airlines.

## Final Thoughts

Working with interval data was initially a challenge for me, but after

working with it more and more often, I find that it’s not so bad after

all. It gets more interesting when there is missing data or data samples

at irregular intervals, but that’s a story for another day.

- The technique that I show here would have to be adjusted slightly if working with more than one year at a time, but it wouldn’t be difficult to do so. I tend to only use this design pattern for one year at a time anyways.

^{^} - By the way, how cool is following one
`dplyr::count()`

call with another? I only found out about that nice little trick recently.

^{^} - Thanks to Hadley Wickham for compiling this data package, as well as for developing the
`{lubridate}`

and {ggplot2}` packages.)

^{^}

**leave a comment**for the author, please follow the link and comment on their blog:

**r on Tony ElHabr**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.