Dealing with Interval Data and the nycflights13 package using R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my job, I often work with data sampled at regular intervals. Samples
may range from 5-minute intervals to daily intervals, depending on the
specific task. While working with this kind of data is straightforward
when its in a database (and I can use SQL), I have been in a couple of
situations where the data is spread across .csv files. In these cases, I
lean on R
to scrape and compile the data. Although some other
languages might be superior for such a task, I often need to produce
some kind of visualization or report at the end, so choosing to handle
all of the data with R
is a no-brainer for me–I can easily transition
to using {ggplot2}
and {rmarkdown}
to generate some pretty output.
Minute-Level Data
When working with data sampled at 5-minute intervals (resulting in 288 intervals in a day), I’ve found that I’ve used a common “idiom” to generate a time-based “dictionary”. For example, here’s how I might create such a date dictionary for the year 2013. 1
library("dplyr") library("lubridate") date_1 <- lubridate::ymd("2013-01-01") date_2 <- lubridate::ymd("2013-12-31") ymd_seq <- seq.Date(date_1, date_2, by = "day") ymd_grid <- data_frame( ymd = lubridate::ymd(ymd_seq), yyyy = lubridate::year(ymd_seq), mm = lubridate::month(ymd_seq), dd = lubridate::day(ymd_seq) ) hhmmss <- expand.grid( hh = seq(1L, 24L, 1L), min = seq(5L, 60L, by = 5L), sec = 0L) %>% as_tibble() dates_dict <- ymd_grid %>% right_join(hhmmss %>% mutate(yyyy = lubridate::year(date_1), by = "year")) %>% arrange(yyyy, mm, dd, hh, min) dates_dict ## # A tibble: 105,120 x 8 ## ymd yyyy mm dd hh min sec by ## <date> <dbl> <dbl> <int> <int> <int> <int> <chr> ## 1 2013-01-01 2013 1 1 1 5 0 year ## 2 2013-01-01 2013 1 1 1 10 0 year ## 3 2013-01-01 2013 1 1 1 15 0 year ## 4 2013-01-01 2013 1 1 1 20 0 year ## 5 2013-01-01 2013 1 1 1 25 0 year ## 6 2013-01-01 2013 1 1 1 30 0 year ## 7 2013-01-01 2013 1 1 1 35 0 year ## 8 2013-01-01 2013 1 1 1 40 0 year ## 9 2013-01-01 2013 1 1 1 45 0 year ## 10 2013-01-01 2013 1 1 1 50 0 year ## # ... with 1.051e+05 more rows
And, just to prove that there are 288 5-minute intervals for each day in
the year, I can use dplyr::count()
twice in succession. 2
dates_dict %>% count(yyyy, mm, dd) %>% count() ## # A tibble: 1 x 1 ## nn ## <int> ## 1 365
I then extract data (from individual files) using the time-based dictionary as a “helper” for custom functions for creating file paths and processing the data after importing it.
After the dirty work of is done, I can transition to the fun part–exploring and interpreting the data. This process often turns out to be a cycle of visualization, data transformation, and modeling.
class=“section level3”>
An Example (With the nycflights13
Package)
To provide an example, I’ll use the flights
data set from the
{nycflight13}
package. 3 This package includes
information regarding all flights leaving from New York City airports in
2013, as well as information regarding weather, airlines, airports, and
planes.
Let’s say that I that I’m interested in the average flight departure delay time at the JFK airport. I might hypothesize that there is a relationship between departure delay time with different time periods, such as hour in the day and days in the week.
First, I’ll perform the necessary transformation to the flights
data
to investigate my hypothesis. Specifically, I need to create columns for
hour and weekday. (For hour (hh
), I simply use the scheduled departure
time (sched_dep_time
).)
library("nycflights13") flights_jfk <- nycflights13::flights %>% filter(origin == "JFK") %>% mutate(hh = round(sched_dep_time / 100, 0) - 1) %>% mutate(ymd = lubridate::ymd(sprintf("%04.0f-%02.0f-%02.0f", year, month, day))) %>% mutate(wd = lubridate::wday(ymd, label = TRUE))
Next, I might create a heat map plotting hours against weekdays.
To investigate the patterns more “scientifically”, I might perform a
one-way Analysis of Variance (ANOVA) on different time variables. I
would make sure to test time periods other than just weekday (wd
) and
hour (hh
), such as month
and day
.
summary(aov(dep_delay ~ month, data = flights_jfk)) ## Df Sum Sq Mean Sq F value Pr(>F) ## month 1 55213 55213 36.25 1.74e-09 *** ## Residuals 109414 166664445 1523 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## 1863 observations deleted due to missingness summary(aov(dep_delay ~ day, data = flights_jfk)) ## Df Sum Sq Mean Sq F value Pr(>F) ## day 1 40 40.2 0.026 0.871 ## Residuals 109414 166719617 1523.8 ## 1863 observations deleted due to missingness summary(aov(dep_delay ~ wd, data = flights_jfk)) ## Df Sum Sq Mean Sq F value Pr(>F) ## wd 6 246401 41067 26.99 <2e-16 *** ## Residuals 109409 166473256 1522 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## 1863 observations deleted due to missingness summary(aov(dep_delay ~ hh, data = flights_jfk)) ## Df Sum Sq Mean Sq F value Pr(>F) ## hh 1 5701754 5701754 3874 <2e-16 *** ## Residuals 109414 161017904 1472 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## 1863 observations deleted due to missingness
The “statistically significant” p values for the wd
and hh
variables
provides incentive to investigate them more closely. I might try an
ANOVA F-statistic test comparing linear regression models using each
variable as a lone predictor with a linear model where both are used as
predictors.
lm_wd <- lm(dep_delay ~ wd, data = flights_jfk) lm_hh <- lm(dep_delay ~ hh, data = flights_jfk) lm_both <- lm(dep_delay ~ wd + hh, data = flights_jfk) anova(lm_both, lm_wd, test = "F") ## Analysis of Variance Table ## ## Model 1: dep_delay ~ wd + hh ## Model 2: dep_delay ~ wd ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 109408 160778456 ## 2 109409 166473256 -1 -5694800 3875.2 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 anova(lm_both, lm_hh, test = "F") ## Analysis of Variance Table ## ## Model 1: dep_delay ~ wd + hh ## Model 2: dep_delay ~ hh ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 109408 160778456 ## 2 109414 161017904 -6 -239447 27.157 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Of course, this process would go on. For instance, I would certainly need to investigate if there is a relationship between departure delay and specific airlines.
Final Thoughts
Working with interval data was initially a challenge for me, but after working with it more and more often, I find that it’s not so bad after all. It gets more interesting when there is missing data or data samples at irregular intervals, but that’s a story for another day.
- The technique that I show here would have to be adjusted slightly if working with more than one year at a time, but it wouldn’t be difficult to do so. I tend to only use this design pattern for one year at a time anyways. ^
- By the way, how cool is following one
dplyr::count()
call with another? I only found out about that nice little trick recently. ^ - Thanks to Hadley Wickham for compiling this data package, as well as for developing the
{lubridate}
and {ggplot2}` packages.) ^
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.