Introducing sparklyr.flint: A time-series extension for sparklyr

[This article was first published on RStudio AI Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

  • “`
     In other words, given a timestamp t and a row in the result having time equal to t, one can notice the value_sum column of that row contains sum of values within the time window of [t - 2, t] from ts_rdd.

Intro to sparklyr.flint

The purpose of sparklyr.flint is to make time-series functionalities of Flint easily accessible from sparklyr. To see sparklyr.flint in action, one can skim through the example in the previous section, go through the following to produce the exact R-equivalent of each step in that example, and then obtain the same summarization as the final result:

  • First of all, install sparklyr and sparklyr.flint if you haven’t done so already.

    install.packages("sparklyr")
    install.packages("sparklyr.flint")
  • Connect to Apache Spark that is running locally from sparklyr, but remember to attach sparklyr.flint before running sparklyr::spark_connect, and then import our example time-series data to Spark:

    library(sparklyr)
    library(sparklyr.flint)
    
    sc <- spark_connect(master = "local", version = "2.4")
    sdf <- copy_to(sc, data.frame(time = seq(4), value = seq(4)^2))
  • Convert sdf above into a TimeSeriesRDD

    ts_rdd <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "time")
  • And finally, run the ‘sum’ summarizer to obtain a summation of values in all past-2-second time windows:

    result <- summarize_sum(ts_rdd, column = "value", window = in_past("2s"))
    
    print(result %>% collect())
    ## # A tibble: 4 x 3
    ##   time                value value_sum
    ##   <dttm>              <dbl>     <dbl>
    ## 1 1970-01-01 00:00:01     1         1
    ## 2 1970-01-01 00:00:02     4         5
    ## 3 1970-01-01 00:00:03     9        14
    ## 4 1970-01-01 00:00:04    16        29

Why create a sparklyr extension?

The alternative to making sparklyr.flint a sparklyr extension is to bundle all time-series functionalities it provides with sparklyr itself. We decided that this would not be a good idea because of the following reasons:

  • Not all sparklyr users will need those time-series functionalities
  • com.twosigma:flint:0.6.0 and all Maven packages it transitively relies on are quite heavy dependency-wise
  • Implementing an intuitive R interface for Flint also takes a non-trivial number of R source files, and making all of that part of sparklyr itself would be too much

So, considering all of the above, building sparklyr.flint as an extension of sparklyr seems to be a much more reasonable choice.

Current state of sparklyr.flint and its future directions

Recently sparklyr.flint has had its first successful release on CRAN. At the moment, sparklyr.flint only supports the summarizeCycle and summarizeWindow functionalities of Flint, and does not yet support asof join and other useful time-series operations. While sparklyr.flint contains R interfaces to most of the summarizers in Flint (one can find the list of summarizers currently supported by sparklyr.flint in here), there are still a few of them missing (e.g., the support for OLSRegressionSummarizer, among others).

In general, the goal of building sparklyr.flint is for it to be a thin “translation layer” between sparklyr and Flint. It should be as simple and intuitive as possibly can be, while supporting a rich set of Flint time-series functionalities.

We cordially welcome any open-source contribution towards sparklyr.flint. Please visit https://github.com/r-spark/sparklyr.flint/issues if you would like to initiate discussions, report bugs, or propose new features related to sparklyr.flint, and https://github.com/r-spark/sparklyr.flint/pulls if you would like to send pull requests.

Acknowledgement

  • First and foremost, the author wishes to thank Javier (@javierluraschi) for proposing the idea of creating sparklyr.flint as the R interface for Flint, and for his guidance on how to build it as an extension to sparklyr.

  • Both Javier (@javierluraschi) and Daniel (@dfalbel) have offered numerous helpful tips on making the initial submission of sparklyr.flint to CRAN successful.

  • We really appreciate the enthusiasm from sparklyr users who were willing to give sparklyr.flint a try shortly after it was released on CRAN (and there were quite a few downloads of sparklyr.flint in the past week according to CRAN stats, which was quite encouraging for us to see). We hope you enjoy using sparklyr.flint.

  • The author is also grateful for valuable editorial suggestions from Mara (@batpigandme), Sigrid (@skeydan), and Javier (@javierluraschi) on this blog post.

Thanks for reading!

To leave a comment for the author, please follow the link and comment on their blog: RStudio AI Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)