URL Originality Analysis

June 23, 2015

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

By Chris Campbell, Senior Consultant.

When I first heard about Twitter, it was described to me as love, life and the story of the world, summarized in 140 characters or less. Each snippet of information is like a beautiful haiku, perfectly capturing the essence of events. A moment captured in amber.

Of course now I know what Twitter is, perhaps less idyllic, but still interesting. Chris Musselle (who wrote the first blog post in this series) had been doing some investigation of what data science tools people are talking about. I was able to grab the data as a simple CSV.


A striking feature of the dataset that Chris captured using his Twitter-mining suite is that most tweets contain a URL. In fact, some tweets contain more than one URL. I wrote a simple function to list these, and then looked at the table of counts. More than 19,000 URLs were tweeted during 22,500 tweets.


Rather than telling a story in 140 characters, 82% of tweets are cheating the character limit, and using twitter as advertising for rich media. So if the story-telling isn’t happening on Twitter, where is it happening? And what stories are being told? Are users tweeting links that they’ve found on Twitter, or are they sharing links which they have discovered elsewhere?

To determine the originality of a URL, we can estimate the distribution of novel posts and shares using Twitter’s automatic URL shortening feature. Many URLs are rather long, depending on the website file structure, and could easily exceed 140 characters. To allow URLs to be shared, all URLs posted in Twitter are automatically converted to short form. For example,

Am I a data scientist?

was updated to


The host http://t.co will re-route requests to address XQfmfy0wIR to the r-bloggers post. This then leaves space for user comment.

Of the 19,000 URLs:

  • 8,000 are unique short URLs (i.e. manual URL entry)
  • The remaining 11,000 are re-tweeted URLs
  • One short URL (http://projectsuperior.com) was tweeted over 200 times
  • Five short URLs were tweeted more than 50 times
  • The mean URL is shared 2.4 times
  • The median URL is shared once




To discover where these short URLs are pointing to, we need to decode the URL to resolve the target destination. There are various tools in R for decoding short URLs. I used the decode_short_url function in the twitteR package. This function requests the URL from a web service and returns the long URL as a string. This can be slow for some sites and took about a second for each URL on average. In addition, not all short URLs were resolved, and took several requests to resolve. About 80 short URLs could not be resolved, perhaps due to the target site moving or being deleted.

I used the data.table package with the cSplit function from the splitstackshape package to reshape the dataset by URL rather than by tweet. I then merged the table of decoded unique short URLs with the reshaped dataset by recoding the short codes with the factor function.


Of the 8,000 unique short URLs:

  • 6,400 are unique long URLs
  • The remaining 1,200 are URLs that were tweeted on more than one unique occasions
  • One URL (PayPal’s software blog) was tweeted on 65 unique occasions
  • Six URLs were tweeted on more than 15 unique occasions
  • The mean URL is tweeted on 1.2 occasions
  • The median URL is tweeted once



This originality analysis identified different types of popular URL communication on Twitter.

Some URLs were very re-tweetable, but not discoverable.

Some URLs were very discoverable.

  • The PayPal blog post was discovered 65 times and tweeted 173 times. A third of these discoveries were retweeted.
  • A presentation on net by Yuichi Ito was discovered 16 times and tweeted 33 times. A quarter of these discoveries were retweeted.
  • A thread in the German language Entwicklerforum was linked on 29 unique occasions. In this case, the thread was being promoted by a single user, Entwickler himself (herself?). None of these discoveries were retweeted.

Popularity by volume definitely does not tell the whole story about how users are interacting with a URL. And discoverability on its own is insufficient to demonstrate interest, as Entwickler would perhaps admit.

The high level view can take a little while to consume from tables of links. I used a combination of tweet volume (word size) and URL discoverability (opacity) to display truly interesting websites using the wordcloud package.



This approach could be useful for prioritizing your lunchtime reading, and separate the genuinely interesting from the spambot!

To read the first blog post in this series click here

To read the second blog post in this series click here

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)