Unshorten (almost) any URL with R

December 13, 2011
By

(This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers)

Introduction

I was asked by a friend how to find the full final address of an URL which had been shortened via a shortening service (e.g., Twitter’s t.co, Google’s goo.gl, Facebook’s fb.me, dft.ba, bit.ly, TinyURL, tr.im, Ow.ly, etc.). I replied I had no idea and maybe he should have a look over on StackOverflow.com or, possibly, the R-help list, and if that didn’t turn up anything to try an online unshortening service like http://unshort.me.

Two minutes later he came back with this solution from Stack Overflow which, surpsingly to me, contained an answer I had provided about 1.5 years ago!

This has always been my problem with programming, that I learn something useful and then completely forget it. I’m kind of hoping that by having this blog it will aid me in remembering these sorts of things.

The Objective

I want to decode a shortened URL to reveal it’s full final web address.

The Solution

The basic idea is to use the getURL function from the RCurl package and telling it to retrieve the header of the webpage it’s connection too and extract the URL location from there.

decode_short_url <- function(url, ...) {
  # PACKAGES #
  require(RCurl)

  # LOCAL FUNCTIONS #
  decode <- function(u) {
    Sys.sleep(0.5)
    x <- try( getURL(u, header = TRUE, nobody = TRUE, followlocation = FALSE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")) )
    if(inherits(x, 'try-error') | length(grep(".*Location: (\\S+).*", x))<1) {
      return(u)
    } else {
      return(gsub('.*Location: (\\S+).*', '\\1', x))
    }
  }

  # MAIN #
  gc()
  # return decoded URLs
  urls <- c(url, ...)
  l <- vector(mode = "list", length = length(urls))
  l <- lapply(urls, decode)
  names(l) <- urls
  return(l)
}

And here’s how we use it:

# EXAMPLE #
decode_short_url("http://tinyurl.com/adcd",
                 "http://www.google.com")
# $`http://tinyurl.com/adcd`
# [1] "http://www.r-project.org/"
#
# $`http://www.google.com`
# [1] "http://www.google.co.uk/"

You can always find the latest version of this function here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/decode_shortened_url/decode_shortened_url.R

Limitations

A comment on the R-bloggers facebook page for this blog post made me realise that this doesn’t work with every shortened URL such as when you need to be logged in for a service, e.g.,

http://1.cloudst.at/myeg

decode_short_url("http://tinyurl.com/adcd",
"http://www.google.com",
"http://1.cloudst.at/myeg")

# $`http://tinyurl.com/adcd`
# [1] "http://www.r-project.org/"
#
# $`http://www.google.com`
# [1] "http://www.google.co.uk/"
#
# $`http://1.cloudst.at/myeg`
# [1] "http://1.cloudst.at/myeg"

I still don’t know why this might be a useful thing to do but hopefully it’s useful to someone out there :)


To leave a comment for the author, please follow the link and comment on his blog: Consistently Infrequent » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.