Package update: longurl 0.3.0 is hitting CRAN mirrors

December 18, 2016
By

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

The longurl package has been updated to version 0.3.0 as a result of a bug report noting that the URL expansion API it was using went pay-for-use. Since this was the second time a short URL expansion service either went belly-up or had breaking changes the package is now completely client-side-based and a very thin, highly-focused wrapper around the httr::HEAD() function.

Why longurl?

On the D&D alignment scale, short links are chaotic evil. [Full-disclosure: I use shortened links all the time, so the pot is definitely kettle-calling here]. Ostensibly, they are for making it easier to show memorable links on tiny, glowing rectangles or printed prose but they are mostly used to directly track you and mask other tracking parameters that the target site is using to keep tabs on you. Furthermore, short URLs are also used by those with even more malicious intent than greedy startups or mega-corporations.

In retrospect, giving a third-party API service access to URLs you are interested in expanding just exacerbated the tracking problem, but many of these third-party URL expansion services do use some temporal caching of results, so they can be a bit faster than doing this in a non-caching package (but, there’s nothing stopping you putting caching code around it if you are using it in a “production” capacity).

How does the updated package work without a URL expansion API?

By default, httr “verb” requests use the curl package and that is a wrapper for libcurl. The httr verb calls set the “please follow all HTTP status 3xx redirects that are found in responses” option (this is the libcurl CURLOPT_FOLLOWLOCATION equivalent option). There are other options that can be set to help configure minutae around how redirect following works. So, just by calling httr::HEAD(some_url) you get built-in short URL expansion (if what you passed in was a short URL or a URL with a redirect).

Take, for example, this innocent link: http://lnk.direct/zFu. We can see what goes on under the covers by passing in the verbose() option to an httr::HEAD() call:

httr::HEAD("http://lnk.direct/zFu", verbose())

## -> HEAD /zFu HTTP/1.1
## -> Host: lnk.direct
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Cookie: shorturl=4e0aql3p49rat1c8kqcrmv4gn2
## -> Accept: application/json, text/xml, application/xml, */*
## -> 
## <- HTTP/1.1 301 Moved Permanently
## <- Server: nginx/1.0.15
## <- Date: Sun, 18 Dec 2016 19:03:48 GMT
## <- Content-Type: text/html; charset=UTF-8
## <- Connection: keep-alive
## <- X-Powered-By: PHP/5.6.20
## <- Expires: Thu, 19 Nov 1981 08:52:00 GMT
## <- Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
## <- Pragma: no-cache
## <- Location: http://ow.ly/Ko70307eKmI
## <- 
## -> HEAD /Ko70307eKmI HTTP/1.1
## -> Host: ow.ly
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Accept: application/json, text/xml, application/xml, */*
## -> 
## <- HTTP/1.1 301 Moved Permanently
## <- Content-Length: 0
## <- Location: http://bit.ly/2gZq7qG
## <- Connection: close
## <- 
## -> HEAD /2gZq7qG HTTP/1.1
## -> Host: bit.ly
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Accept: application/json, text/xml, application/xml, */*
## -> 
## <- HTTP/1.1 301 Moved Permanently
## <- Server: nginx
## <- Date: Sun, 18 Dec 2016 19:04:36 GMT
## <- Content-Type: text/html; charset=utf-8
## <- Content-Length: 127
## <- Connection: keep-alive
## <- Cache-Control: private, max-age=90
## <- Location: http://example.com/IT_IS_A_SURPRISE
## <- 
## -> HEAD /IT_IS_A_SURPRISE HTTP/1.1
## -> Host: example.com
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Cookie: _csrf/link=g3iBgezgD_OYN0vOh8yI930E1O9ZAKLr4uHmVioxwwQ; mc=null; dmvk=5856d9e39e747; ts=475630; v1st=03AE3C5AD67E224DEA304AEB56361C9F
## -> Accept: application/json, text/xml, application/xml, */*
## -> 
## <- HTTP/1.1 200 OK
## ...
## <- 

We can reduce the clutter and see that it follows multiple redirects from multiple URL shorteners:

Here’s what the output of a request to longurl::expand_urls() returns:

longurl::expand_urls("http://lnk.direct/zFu")
## # A tibble: 1 × 3
##                orig_url                        expanded_url status_code
##                                                         
## 1 http://lnk.direct/zFu http://example.com/IT_IS_A_SURPRISE         200

NOTE: the link does actually go somewhere, and somewhere not malicious, political or preachy (a rarity in general in this post-POTUS-election world of ours).

What else is different?

The longurl::expand_urls() function returns a tbl_df and now includes the HTTP status code of the final, resolved link. You can also pass in a custom HTTP referrer since many (many) sites will change behavior depending on the referrer.

What’s next?

This bug-fix release had to go out fairly quickly since the package was essentially broken. With the new foundation being built on client-side machinations, future enhancements will be to pull more features (in the machine learning sense) out of the curl or httr requests (I may switch directly to using curl if I need more granular data) and include some basic visualizations for both request trees (mostly likely using the DiagrammeR and ggplot2 packages). I may try to add a caching layer, but I believe that’s more of a situation-specific feature folks should add on their own, so I may just add a “check hook” capability that will add an extra function call to a cache checking function of your choosing.

If you have a feature request, please add it to the github repo.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)