Bulk Downloading Adobe Analytics Data

July 21, 2016
By

(This article was first published on randyzwitch.com, and kindly contributed to R-bloggers)

This blog post also serves as release notes for RSiteCatalyst v1.4.9, as only one feature was added (batch report request and download). But it’s a feature big enough for its own post!

Recently, I was asked how I would approach replicating the market basket analysis blog post I wrote for 33 Sticks, but using a lot more data. Like, months and months of order-level data. While you might be able to submit multiple months worth of data in a single RSiteCatalyst call, it’s a lot more elegant to request data from the Adobe Analytics API in several calls. With the new batch-submit and batch-receive functionality in RSiteCatalyst, this process can be a LOT faster.

Non-Batched Method

Prior to version 1.4.9 of RSiteCatalyst, API calls could only be made in a serial fashion:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
library(RSiteCatalyst)
library(dplyr)

SCAuth(Sys.getenv("USER", ""), Sys.getenv("SECRET", ""))

combined_orders <- data.frame()
for(d in seq(as.Date("2016-06-01"), as.Date("2016-06-30"), by = "day")){

  d_ <- as.character(as.Date(d, origin = "1970-01-01"))
  print(d_)

  order_details <- QueueRanked(
    reportsuite.id = "reportsuite",
    date.from = d_,
    date.to = d_,
    elements = c('evar13', 'product'),
    metrics = c('revenue', 'units', 'orders'),
    top = c(50000, 10000),
    interval.seconds = 60
  )

  order_details$order_date <- d_
  combined_orders <- rbind.fill(combined_orders, order_details)
  rm(order_details)

}

The underlying assumption from a package development standpoint was that the user would be working in an interactive fashion; submit a report request, wait to get the answer back. There’s nothing inherently wrong with this code from an R standpoint that made this a slow process, you just had to wait until one report was calculated by the Adobe Analytics API until the next one was submitted.

Batch Method

Of course, most APIs can process multiple calls simultaneously, and the Adobe Analytics API is no exception. Thanks to user shashispace, it’s now possible to submit all of your report calls at once, then retrieve the results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
queued <- list()
for(d in seq(as.Date("2016-06-01"), as.Date("2016-06-30"), by = "day")){
  d_ <- as.character(as.Date(d, origin = "1970-01-01"))

  print(d_)

  reportid <- QueueRanked(
    reportsuite.id = "reportsuite",
    date.from = d_,
    date.to = d_,
    elements = c('evar13', 'product'),
    metrics = c('revenue', 'units', 'orders'),
    top = c(50000, 10000),
    interval.seconds = 1,
    enqueueOnly = TRUE
  )

  queued <- cbind(queued, reportid)

}

queued_df <- data.frame()
for (i in queued){
  queued_df <- bind_rows(queued_df, GetReport(i))
}

This code is nearly identical to the serial snippet above, except for 1) the addition of the enqueueOnly = TRUE keyword argument and 2) lowering the interval.seconds keyword argument to 1 second instead of 60. When you use the enqueueOnly keyword, instead of returning the report results back, a Queue* function will return the report.id; by accumulating these report.id values in a list, we can next retrieve the reports and bind them together using dplyr.

Performance gain: 4x speed-up

Although the code snippets are nearly identical, it is way faster to submit the reports all at once then retrieve the results. By submitting the requests all at once, the API will process numerous calls at once, and while you are retrieving the results of one call the others will continue to process in the background.

I wouldn’t have thought this would make such a difference, but retrieving one month of daily order-level data went from taking 2420 seconds to 560 seconds! If you were to retrieve the same amount of daily data, but for an entire year, that would mean saving 6 hours in processing time.

Keep The Pull Requests Coming!

The last several RSiteCatalyst releases have been driven by contributions from the community and I couldn’t be happier! Given that I don’t spend much time in my professional life now using Adobe Analytics, having improvements driven by a community of users using the library daily is just so rewarding.

So please, if you have a comment for improvement (and especially if you find a bug), please submit an issue on GitHub. Submitting questions and issues to GitHub is the easiest way for me to provide support, while also giving other users the possibility to answer your question before I might. It will also provide a means for others to determine if they are experiencing a new or previously-known problem.

To leave a comment for the author, please follow the link and comment on their blog: randyzwitch.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)