A Workaround For When Anti-DDoS Also Means Anti-Data

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

More sites are turning to services like Cloudflare due to just how stupid-easy it is to DDoS a site. Sometimes the DDoS is intentional (malicious). Sometimes it’s because your bot didn’t play nice (stop that, btw). Sadly, at some point, most of us with “vital” sites are going to have to pay protection money to one of these services unless law enforcement or ISPs do a better job stopping DDoS (killing the plethora of pwnd IoT devices that make up one of the largest for-rent DDoS services out there would be a good start).

Soapbox aside, sites like this one — https://www.bitmarket.pl/docs.php?file=api_public.html — (which was giving an SO poster trouble) have DDoS protection enabled but they also want you to be able to automate the downloads (this one even calls it an “API”). However, try to grab one of the files there with your browser and you’ll likely see a Cloudflare interstitial page which eventually gets you the data.

Try the same thing with download.file() or httr::GET() and you’ll run into trouble since neither of those two functions have a way to perform the javascript challenge execution which ultimately is posted (well, GETted in this case) to a checker endpoint which eventually redirects to the original URL with enough ??? to ensure you won’t be bothered again.

Cloudflare has captcha and other types of interstitials, but if you happen on the 503+javascript challenge one, have I got a package for you! Meet: cfhttr?.

The singular function (for now) — cf_GET() — does the following:

  • Makes an httr::GET() call with the initial URL
  • Checks to ensure it’s both on Cloudflare and is using the javascript challenge protection scheme
  • Slices the javascript and tweaks it enough to enable running it in V8
  • Retrieves the challenge computation from V8
  • Posts (well, httr::GET()s it since that’s what Cloudflare expects) the challenge form with the proper Referer header and hopefully passes the test so you get your content.
devtools::install_github("hrbrmstr/cfhttr")

library(cfhttr)

res <- cf_GET("https://www.bitmarket.pl/graphs/BTCPLN/90m.json")
## Waiting 5 seconds...

str(httr::content(res, as="parsed"))
## List of 90
##  $ :List of 6
##   ..$ time : int 1512908160
##   ..$ open : chr "48000.00000000"
##   ..$ high : chr "48100.00000000"
##   ..$ low  : chr "48000.00000000"
##   ..$ close: chr "48100.00000000"
##   ..$ vol  : chr "0.00124821"
##  $ :List of 6
##   ..$ time : int 1512908220
##   ..$ open : chr "48100.00000000"
##   ..$ high : chr "48100.00000000"
##   ..$ low  : chr "48100.00000000"
##   ..$ close: chr "48100.00000000"
##   ..$ vol  : chr "0.00000000"
##  $ :List of 6
##   ..$ time : int 1512908280
##   ..$ open : chr "48100.00000000"
##   ..$ high : chr "48100.00000000"
##   ..$ low  : chr "48100.00000000"
##   ..$ close: chr "48100.00000000"
##   ..$ vol  : chr "0.00000000"
## ...

FIN

If you end up using this in workflows and run into a problem, it likely means that Cloudflare changed the challenge code page. Please file an issue so I can update the code.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)