Data Science Tweet Analysis – What tools are people talking about?

[This article was first published on Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Chris Musselle PhD, Mango UK

At ­Mango we use a variety of tools in-house to address our clients’ business needs and when these fall within the data science arena, the main candidates we turn to are either the R or Python programming languages.

The question as to which is the “best” language for doing data science is a hotly debated topic ([link] [link] [link] [link]), with both languages having their pros and cons. However the capabilities of each are expanding all the time thanks to continuous open source development in both areas.

With both languages becoming increasingly popular for data analysis, we thought it would be interesting to track current trends and see what people are saying about these and other tools for data science on Twitter.

This post is the first of three that will look into the results of our analysis, but first a bit of background.

 

Twitter Analysis

Today many companies are routinely drawing on social media data sources such as Twitter and Facebook to enhance their business decision making in a number of ways. This type of analysis can be a component of market research, an avenue for collecting customer feedback or a way to promote campaigns and conduct targeted advertising.

To facilitate this type of analysis, Twitter offer a variety of Application Programming Interfaces or APIs that enable an application to programmatically interact with the services provided by Twitter. These APIs currently come in three main flavours.

  • REST API – Allows automated access to searching, reading and writing tweets
  • Streaming API – Allows tracking of multiple users and or search terms in near real time, though results may only be a sample
  • Twitter Firehose – Allows tracking of all tweets past and future, no limits on search results returned.

These different approaches have different trade-offs. The REST API can only search past tweets, and is limited in how far back you can search as Twitter only keeps the last couple of weeks of data. The Streaming API tracks tweets as they happen, but Twitter only guarantees a sample of all current tweets will be collected [link]. This means that if your search term is very generic and matches a lot of tweets, then not all of these tweets will be returned [link].

The Twitter Firehose addresses the shortcomings of the previous two APIs, but at quite a substantial cost, whereas the other two are free to use. There are also a growing number of third party intermediaries that have access to the Twitter Firehose, and sell on the Twitter data they collect [link].

 

Our Approach

We chose to use the Streaming API to collect tweets containing the hashtags “python” and/or “rstats” and/or “datascience” over a 10 day period.

To harvest the data, a python script was created to utilize the API and append tweets to a single file. Command line tools such as cvskit and jq were then used to clean and preprocess the data, with the analysis done in Python using the pandas library.

 

Preliminary Results: Hashtag Counts and Co-occurrence

From Figure 1, it is immediately obvious that “python” and “datascience” were more popular hashtags than “rstats” over the time period sampled. Though interestingly, there was little overlap between these groups.

Twitterblog  1

Figure 1: Venn diagram of tweet counts by hashtag

 

This suggests that the majority of tweets that mentioned these subjects either did so in isolation or alongside other hashtags that were not tracked. We can get a sense of which is the case by looking at a count of the total number of unique hashtags that occurred alongside each tracked hashtag, this is shown in Table 1.

twitterblog 2

Table 1: Total unique hashtags used per tracked subset

 

These counts show that the “python” hashtag is mentioned alongside a lot more other topics/hashtags than “rstats” and “datascience”. This makes sense when you consider that Python is a general purpose programming language, and as such has a broader range across application domains than R, which is more statistically focused. In between these is the “datascience” hashtag, a term that relates to many different skillsets and technologies, and so we would expect the number of unique hashtag co-occurrences to be quite high.

 

So what are people mentioning alongside these hashtags if not these technologies?

Table 2 shows the top hashtags mentioned alongside the three tracked hashtags. Here the numbers in the header are the total number of tweets that contained the tracked hashtag term, plus at least one other hashtag. So the vast majority of tweets occur with multiple hashtags As can be seen all three subjects were commonly mentioned alongside other hashtags.

twitterblog 3

Table 2: Table of most frequent co-occurring hashtags with tracked keywords. Numbers in the header are the total number of tweets containing at least one other hashtag to the one tracked.

As we may expect, many co-occurring hashtags are closely related, though in general it’s interesting to see that “datascience” co-occurs with many more general concepts and or ‘buzzwords’ frequently, with technologies mentioned further down the list.

Python on the other hand occurs frequently alongside other web technologies, as well as “careers” and “hiring”, which may reflect a high demand for jobs that use Python and these related technologies for web development. On the other hand it may simply be that many good web developers are active on Twitter, and as such recruitment companies favor this medium of advertising when trying to fill web development positions.

It’s interesting that tweets with the “Rstats” hashtags mentioned “datascience” and “bigdata” more than any other, likely reflecting the increasing trends in using R in this arena. The other co-occurring hashtags for R can be grouped into: those that relate to its domain specific use (“statistics”, “analytics”, “machinelearning” etc.); possible ways of integrating it with other language (“python”, “excel”, “d3js”); and other ways of referencing R itself (“r”, “rlang”)!

 

Summary

So from looking at the counts of hashtags and their co-occurrences, it looks like:

  • Tweets containing Python or data science were roughly 5 times more frequent than those containing Rstats. There was also little relative overlap in the three hashtags tracked.
  • Tweets containing Python also mention a broader range of other topics, while R is more focused around data science, statistics and analytics.
  • Tweets mentioning data science most commonly include hashtags for general analytics concepts and ‘buzzwords’, with specific technologies only occasionally mentioned.
  • Tweets mentioning Python most commonly include hashtags for web development technologies and are likely the result of a high volume of recruitment advertising.

 

Future Work

So far we have only looked at the hashtag contents of the tweet and there is much more data contained within that can be analysed. Two other key components are the user mentions and the URLs in the message. Future posts will look into both of these to investigate the content being shared, along with who is retweeting/being retweeted by whom.

 

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)