By Chris Musselle PhD, Mango UK
At Mango we use a variety of tools in-house to address our clients’ business needs and when these fall within the data science arena, the main candidates we turn to are either the R or Python programming languages.
The question as to which is the “best” language for doing data science is a hotly debated topic ([link] [link] [link] [link]), with both languages having their pros and cons. However the capabilities of each are expanding all the time thanks to continuous open source development in both areas.
With both languages becoming increasingly popular for data analysis, we thought it would be interesting to track current trends and see what people are saying about these and other tools for data science on Twitter.
This post is the first of three that will look into the results of our analysis, but first a bit of background.
Today many companies are routinely drawing on social media data sources such as Twitter and Facebook to enhance their business decision making in a number of ways. This type of analysis can be a component of market research, an avenue for collecting customer feedback or a way to promote campaigns and conduct targeted advertising.
To facilitate this type of analysis, Twitter offer a variety of Application Programming Interfaces or APIs that enable an application to programmatically interact with the services provided by Twitter. These APIs currently come in three main flavours.
- REST API – Allows automated access to searching, reading and writing tweets
- Streaming API – Allows tracking of multiple users and or search terms in near real time, though results may only be a sample
- Twitter Firehose – Allows tracking of all tweets past and future, no limits on search results returned.
These different approaches have different trade-offs. The REST API can only search past tweets, and is limited in how far back you can search as Twitter only keeps the last couple of weeks of data. The Streaming API tracks tweets as they happen, but Twitter only guarantees a sample of all current tweets will be collected [link]. This means that if your search term is very generic and matches a lot of tweets, then not all of these tweets will be returned [link].
The Twitter Firehose addresses the shortcomings of the previous two APIs, but at quite a substantial cost, whereas the other two are free to use. There are also a growing number of third party intermediaries that have access to the Twitter Firehose, and sell on the Twitter data they collect [link].
We chose to use the Streaming API to collect tweets containing the hashtags “python” and/or “rstats” and/or “datascience” over a 10 day period.
To harvest the data, a python script was created to utilize the API and append tweets to a single file. Command line tools such as cvskit and jq were then used to clean and preprocess the data, with the analysis done in Python using the pandas library.
Preliminary Results: Hashtag Counts and Co-occurrence
From Figure 1, it is immediately obvious that “python” and “datascience” were more popular hashtags than “rstats” over the time period sampled. Though interestingly, there was little overlap between these groups.
This suggests that the majority of tweets that mentioned these subjects either did so in isolation or alongside other hashtags that were not tracked. We can get a sense of which is the case by looking at a count of the total number of unique hashtags that occurred alongside each tracked hashtag, this is shown in Table 1.
These counts show that the “python” hashtag is mentioned alongside a lot more other topics/hashtags than “rstats” and “datascience”. This makes sense when you consider that Python is a general purpose programming language, and as such has a broader range across application domains than R, which is more statistically focused. In between these is the “datascience” hashtag, a term that relates to many different skillsets and technologies, and so we would expect the number of unique hashtag co-occurrences to be quite high.
So what are people mentioning alongside these hashtags if not these technologies?
Table 2 shows the top hashtags mentioned alongside the three tracked hashtags. Here the numbers in the header are the total number of tweets that contained the tracked hashtag term, plus at least one other hashtag. So the vast majority of tweets occur with multiple hashtags As can be seen all three subjects were commonly mentioned alongside other hashtags.
As we may expect, many co-occurring hashtags are closely related, though in general it’s interesting to see that “datascience” co-occurs with many more general concepts and or ‘buzzwords’ frequently, with technologies mentioned further down the list.
Python on the other hand occurs frequently alongside other web technologies, as well as “careers” and “hiring”, which may reflect a high demand for jobs that use Python and these related technologies for web development. On the other hand it may simply be that many good web developers are active on Twitter, and as such recruitment companies favor this medium of advertising when trying to fill web development positions.
It’s interesting that tweets with the “Rstats” hashtags mentioned “datascience” and “bigdata” more than any other, likely reflecting the increasing trends in using R in this arena. The other co-occurring hashtags for R can be grouped into: those that relate to its domain specific use (“statistics”, “analytics”, “machinelearning” etc.); possible ways of integrating it with other language (“python”, “excel”, “d3js”); and other ways of referencing R itself (“r”, “rlang”)!
So from looking at the counts of hashtags and their co-occurrences, it looks like:
- Tweets containing Python or data science were roughly 5 times more frequent than those containing Rstats. There was also little relative overlap in the three hashtags tracked.
- Tweets containing Python also mention a broader range of other topics, while R is more focused around data science, statistics and analytics.
- Tweets mentioning data science most commonly include hashtags for general analytics concepts and ‘buzzwords’, with specific technologies only occasionally mentioned.
- Tweets mentioning Python most commonly include hashtags for web development technologies and are likely the result of a high volume of recruitment advertising.
So far we have only looked at the hashtag contents of the tweet and there is much more data contained within that can be analysed. Two other key components are the user mentions and the URLs in the message. Future posts will look into both of these to investigate the content being shared, along with who is retweeting/being retweeted by whom.