Today we are introducing
tibbletime v0.0.2, and we’ve got a ton of new features in store for you. We have functions for converting to flexible time periods with the
~period formula~ and making/calculating custom rolling functions with
rollify() (plus a bunch more new functionality!). We’ll take the new functionality for a spin with some weather data (from the
weatherData package). However, the new tools make
tibbletime useful in a number of broad applications such as forecasting, financial analysis, business analysis and more! We truly view
tibbletime as the next phase of time series analysis in the
tidyverse. If you like what we do, please connect with us on social media to stay up on the latest Business Science news, events and information!
We are excited to announce the release of
tibbletime v0.0.2 on CRAN. Loads of new
functionality have been added, including:
Generic period support: Perform time-based calculations by a number of supported periods using a new
Creating series: Use
create_series()to quickly create a
tbl_timeobject initialized with a regular time series.
Rolling calculations: Turn any function into a rolling version of itself with
A number of smaller tweaks and helper functions to make life easier.
As we further develop
tibbletime, it is becoming clearer that the package
is a tool that should be used in addition to the rest of the
The combination of the two makes time series analysis in the tidyverse much easier to do!
In this post
Today we will take a look at weather data for New York and San
Francisco from 2013. It will be an exploratory analysis
to show off some of the new features in
tibbletime, but the package
itself has much broader application. As we will see,
functionality can be a valuable data manipulation tool to help with:
Product and sales forecasting
Financial analysis with custom rolling functions
Grouping data into time buckets to analyze change over time, which is great for any part of an organization including sales, marketing, manufacturing, and HR!
Data and packages
The datasets used are from a neat package called
weatherData has functionality to pull weather data for a number of cities, we will use the built-in datasets. We encourage you to explore the
weatherData API if you’re interested in collecting weather data.
To get started, load the following packages:
tibbletime: Time-aware tibbles for the tidyverse
tidyverse: Loads packages including
corrr: Tidy correlations
weatherData: Slick package for getting weather data
Also, load the datasets from
weatherData, “NewYork2013” and “SFO2013”.
Combine and convert
To tidy up, we first join our data sets together using
a named list of tibbles along with specifying the
.id argument allows
bind_rows() to create a new
City reference column for us.
Next, we will convert to
tbl_time and group by our
City variable. Note that we know this is a
tbl_time object by
Index: Time that gets printed along with the tibble.
The first new idea to introduce is the
~period formula~. This tells the
tibbletime functions how you want to time-group your data. It is specified
multiple ~ period, with examples being
1~d for “every 1 day,” and
4~m for “every 4 months.”
In our original data, it looks like
weather is an hourly dataset, with each new
data point coming in on the 51st minute of the hour for NYC and the 56th minute
for SFO. Unfortunately, a number of points don’t follow this. Check out the following rows:
What we want is 1 row per hour, and in this case we get two rows for NYC hour 12.
We can use
as_period() to ensure that we only have 1 row for each hour
Now that we have our data in an hourly format, we probably don’t care about
which minute it came in on. We can floor the date column using a helper function,
time_floor(). Credit to Hadley Wickham because this is essentially a convenient
lubridate::floor_date(). Setting the period to
each row to the beginning of the last hour.
Visualize the data
Now that we have cleaned up a bit, let’s visualize the data.
Seems like hourly data is a bit overwhelming for this kind of chart. Let’s convert to daily and try again.
That’s better. It looks like NYC has a much wider range of temperatures than SFO. Both seem to be hotter in summer months.
dplyr::summarise() function is very useful for taking grouped summaries.
time_summarise() takes this a step further by allowing you to summarise by
Below we take a look at the average and standard deviation of the temperatures calculated at monthly and bimonthly intervals.
A closer look at July
July seemed to be one of the hottest months for NYC, let’s take a closer look at it.
To just grab July dates, use
time_filter(). If you haven’t seen this before, a
time formula is used to specify the dates to filter for. The one-sided formula below expands to include dates between,
2013-07-01 00:00:00 ~ 2013-07-31 23:59:59.
To visualize July’s weather, we will make a boxplot of the temperatures.
Specifically, we will slice July into intervals of 2 days, and create a series
of boxplots based on the data inside those intervals. To do this, we will
time_collapse(), which collapses a column of dates into a column of the same
lenth, but where every row in a time interval shares the same date. You can use this resulting
column for further grouping or labeling operations.
Let’s visualize to see if we can gain any insights. Wow, San Fran maintained a constant cool average of 60 degrees in the hottest month of the year!
Period and rolling correlations
Finally, we will look at the correlation of temperatures in our two cities in a few different ways.
First, let’s look at the overall correlation. The
corrr package provides a nice way to accomplish this with data frames.
Next, let’s look at monthly correlations. The general idea will be
to nest each month into it’s own data frame, apply
correlate() to each
nested data frame, and then unnest the results. We will use
time_nest() to easily perform the monthly nesting.
For each month, calculate the correlation tibble and then
focus() on the NYC column. Then unnest and floor the results.
It seems that summer and fall months tend to have higher correlation than colder months.
And finally we will calculate the rolling correlation of NYC and SFO temperatures. The “width” of our roll will be monthly, or 360 hours since we are in hourly format.
There are a number of ways to do this, but for this example
rollify(), which takes any function that you give it and creates a rolling version of it. The first argument to
rollify() is the function that you want to modify, and it is passed to
rollify() in the same way that you would pass a function to
purrr::map(). The second argument is the window size. Call the rolling function just as you would call a non-rolling version
cor() from inside
It looks like the correlation is definitely not stable throughout the year,
so that initial correlation value of
.65 definitely has to be taken
with a grain of salt!
Rolling Functions: Pros/Cons and Recommendations
There are a number of ways to do rolling functions, and we recommend based on your needs. If you are interested in:
rollify(). You can literally turn any function into a “tidy” rolling function. Think everything from rolling statistics to rolling regressions. Whatever you can dream up, it can do. The speed is fast, but not quite as fast as other
Performance: Use the
rollpackage, which uses
RcppParallelas its backend making it the fastest option available. The only downside is flexibility since you cannot create custom rolling functions and are bound to those available.
We’ve touched on a few of the new features in
tibbletime v0.0.2. Notably:
rollify()for rolling functions
as_period()with generic periods
time_collapse()for collapsing date columns
A full change log can be found in the NEWS file on Github or CRAN.
We are always open to new ideas and encourage you to submit an issue on our Github repo here.
Have fun with
Mind you this is only v0.0.2. We have a lot of work to do, but we couldn’t
wait any longer to share this. Feel free to kick the tires on
tibbletime, and let us know your thoughts. Please submit any comments, issues or bug reports to us on GitHub here. Enjoy!
Business Science takes the headache out of data science. We specialize in applying machine learning and data science in business applications. We help businesses that seek to build out this capability but may not have the resources currently to implement predictive analytics. Business Science works with clients as diverse as startups to Fortune 500 and seeks to guide organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!
Connect, communicate and collaborate with us! The easiest way to do so is via social media. Connect with us out on: