During the eRum 2016, Adam Zagdański gave a very good tutorial about time series modeling. Among other things I’ve learned that the forecast package (created by Rob Hyndman) got cool new plots based on the ggplot2 package.
Let’s use it to play with mailbox statistics for my gmail account!
1. Get the data
Follow this link to download the data from your gmail account as a single mbox file.
It may be large (15GB in my case), but for further steps it’s enough to keep only headers.
grep + cat will do the job.
2. Read headers
readLines() function can handle headers. Then the
lubridate package is useful to extract and convert dates to the R format.
3. Basic gg-exploration
I’ve started with daily aggregates – number of emails per day.
ts() function converts vector of aggregates to a time series object.
Then I’ve used the
autoplot() function to plot the time series. Since it’s the
ggplot2 plot, you can easily add a smooth trend to the plot with the
There is some trend, but what about seasonality?
geom_boxplot() is useful to check if there are differences among days of week or months.
It turns out that the number of emails per day is very different for week-days and weekends.
Also the August is the email-lightest month. Only, on average, 60 per day
4. Time Series
autoplot() functions extract trend and seasonal components from the time series. The multiplicative seasonal component is probably more appropriate here, but below the additive component is presented since it’s easier to read values on the oy axis.
A lot of models that can be fitted with the
forecast package. From different choices the most scary one is for the forecast with the Holt method. Scary because of the trend.