Preparing Datetime Data for Analysis with padr and dplyr

Posted on March 19, 2017 by That’s so Random in R bloggers | 0 Comments

[This article was first published on That’s so Random, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Two months ago padr was introduced, followed by an improved version that allowed for applying pad on group level. See the introduction blogs or the vignette("padr") for more package information. In this blog I give four more elaborate examples on how to go from raw data to insight with padr, dplyr and ggplot2. They might serve as recipes for time series problems you want to solve. The dataset emergency is available in padr, and contains about a year of emergency data in Montgomery County, PA. For this blog I only use the twelve most prevalent emergencies.

library(tidyverse)
library(padr)
top12 <- emergency %>% count(title) %>% arrange(desc(n)) %>% slice(1:12) %>% 
  select(title) %>% inner_join(emergency)

1) A proportion plot over the months

Lets start with a fairly simple problem, we are calculating the relative distribution of events within each month. We normalize the counts within each month, and then plot through time. geom_bar does most of the heavy lifiting here, we just need the months counts for each event. (I checked, all events have calls in each month, there is no need for padding.)

top12 %>% thicken('month', col = 'month') %>% 
  count(title, month) %>%
  ggplot(aes(month, n)) +
  geom_bar(aes(fill = title), 
           col = "black",
           stat = 'identity', 
           position = "fill")

2) Get a distribution, per event, per time unit

For each clock hour, for each event, we want to compute the average of the number of occurences. The first step is to make the hourly distribution out of the raw data. For every hour we need to count the number of calls per event. This would give no records if there were no calls within a givin hour. If we did not account for this, the calculated averages would be too high. (Recently called “the zero-bug” in this blog post). Therefore we use pad and fill_by_value to insert records and give them the value 0.

hourly_distribution <- top12 %>% thicken('hour') %>% 
  count(title, time_stamp_hour) %>% 
  pad(group = 'title') %>% fill_by_value(n)

Next step is extracting the clock hour from each timestamp, and calculate the mean per event, per clock hour.

hourly_distribution %>% 
  mutate(clock_hour = lubridate::hour(time_stamp_hour) ) %>% 
  group_by(clock_hour, title) %>% 
  summarise(mn = mean(n)) %>% 
  ggplot(aes(clock_hour, mn)) +
  geom_line() + 
  facet_wrap(~title, scales = 'free')

3) Get the 30th observation per time unit

For this dataset it might not be terribly interesting, but there are many situations were it is very useful to extract the first, last, or nth observation per time unit. Lets find the 30th observation per week, where weeks are starting on Mondays. We need an offset in thicken here, to make the weeks start on Mondays.

first_day <- top12$time_stamp %>% as.Date %>% min
first_day %>% weekdays

## [1] "Thursday"

top12 %>% thicken('week', start_val = first_day - 3) %>% 
  group_by(title, time_stamp_week) %>% 
  filter(min_rank(time_stamp) == 30) %>% 
  mutate(weekday = weekdays(time_stamp) %>% 
      factor(levels = (c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday',
                         'Saturday', 'Sunday') %>% rev) )) %>%
  ggplot(aes(weekday)) +
  geom_bar(aes(fill = title)) +
  coord_flip() +
  guides(fill = FALSE) +
  facet_wrap(~title)

So here we have a distribution on which day the 30th call came in, for each of the twelve events.

4) Make a moving average

Moving averages are very helpful for smoothing time series. It is often a better indication of the underlying trend than the raw data. I recently learned about the RcppRoll package, when I was browsing through R for Data Science. This is a nice package by Kevin Ushey, that makes it terribly easy to calculate rolling stats on a vector. Here we want the moving average of the daily count of each of the events.

top12 %>% thicken('day', col = 'day') %>% 
  count(day, title) %>% 
  pad(group = 'title') %>% fill_by_value(n) %>% 
  group_by(title) %>% 
  mutate(moving_average = RcppRoll::roll_meanr(x = n, n = 30, fill = NA)) %>% 
  ggplot(aes(day, n)) + 
  geom_line() +
  geom_line(aes(y = moving_average), col = 'red') +
  facet_wrap(~title, scales = 'free')

There we have some recipes for preparing data containing timestamp for analysis. Do you have a time series-related challenge not adressed here? I would love to hear from you, and try to figure out an effective way for solving it! Just send an email or post a comment.

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Preparing Datetime Data for Analysis with padr and dplyr

1) A proportion plot over the months

2) Get a distribution, per event, per time unit

3) Get the 30th observation per time unit

4) Make a moving average

Related

1) A proportion plot over the months

2) Get a distribution, per event, per time unit

3) Get the 30th observation per time unit

4) Make a moving average

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)