Playing around with #rstats twitter data

Posted on February 28, 2015 by [email protected] in R bloggers | 0 Comments

[This article was first published on On the lambda » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As a bit of weekend fun, I decided to briefly look into the #rstats twitter data that Stephen Turner collected and made available (thanks!). Essentially, this data set contains some basic information about over 100,000 tweets that contain the hashtag “#rstats” that denotes that a tweeter is tweeting about R.

As a warning, I don’t know much about how these data were collected, whether it was collected and random times during the day or whether it was biased toward particular times and, therefore, locations. I wouldn’t really read too much into this.

Most common co-occuring hashtags
When a tweet uses a hashtag at all, it very often uses more than one. To extract the co-occuring hashtags, I used the following perl script:

#!/usr/bin/perl

while(<>){
    chomp;
    $_ = lc($_);
    $_ =~ s/#rstats//g;
    my @matches;
    push @matches, /(#w+)/;
    print join "n" => @matches if @matches;
}

which uses the regular expression “(#w+)” to search for hashtags after removing “#rstats” from every tweet.

On the unix command-line, I put these other hashtags into a file and sorted via these commands:

cat data/R-hashtag-data.txt | ./PERL_SCRIPT_ABOVE.pl | tee other-hashtags.txt

sort other-hashtags.txt | uniq -c | sort -n -r > sorted-other-hashtags.txt

After running these commands, I get a numbered list of co-occuring hashtags, sorted in descending order. The top 10 co-occuring hashtags were as follows (you can see the rest here :

5258 #datascience
1665 #python
1625 #bigdata
1542 #r
1451 #dataviz
1360 #ggplot2
 852 #statistics
 783 #dplyr
 749 #machinelearning
 743 #analytics

Neat-o. The presence of “#python” and “#ggplot2” in the top 10 made me wonder what the top 10 programming language and R package related hashtags were. Here they are, respectively:

1665 #python
 423 #d3js (plus 72 for #d3) (plus 2 for #js)
 343 #sas
 312 #julialang (plus 43 for #julia)
 240 #fsharp
 140 #spss  (plus 7 for #ibmspss)
 102 #stata
  75 #matlab
  55 #sql
  38 #java

1360 #ggplot2  (plus 298 for ggplot)  (plus for 6 #gglot2) (plus 4 for #ggpot)
 783 #dplyr
 663 #shiny
 557 #rcpp (plus 22 for rcpp11)
 251 #knitr
 156 #magrittr
 105 #lme4
  93 #ggvis   (plus 11 for #ggivs)
  65 #datatable
  46 #rneo4j

You can view the full list here and here.

I was happy to see my favorite languages (python, perl, clojure, lisp, haskell, c) besides R being represented in the first list. Additionally, most of my favorite packages were fairly well tweeted about–at least as far as hashtags-applied-to-a-package go.

#strangehashtags
Before moving on to the next section, I wanted to share my favorite co-occuring hashtags that I found while sifting through the data: #rcatladies, #rdogfella, #bayesianbootycall, #dontbeaplyrhater, #overlyhonestmethods, #rickshaw (??), #statafail, and #monkeysinfrontoftypewriters.

Most prolific #rstats tweeters
One of the first things I did with these data is a simple aggregation and sort to find the tweeters that used the hashtag most often:

library(dplyr)
THE_DATA %>%
  group_by(User) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) -> prolific.rstats.tweeters

Here is the top 10 (you can see the rest here.)

@Rbloggers	1081
@hadleywickham	498
@timelyportfolio	427
@recology_	419
@revodavid	210
@chlalanne	209
@adolfoalvarez	199
@RLangTip	175
@jmgomez	160

Nothing terribly surprising here.

Normalizing by total tweets
In a twitter discussion about these data, a twitter friend Tim Hopper posited that though he had fewer #rstats tweets than another mutual friend, Trey Causey, he would have a higher number of #rstats tweets if you control for total tweet volume. I wondered how this sorting would look.

Answering this question gave me an excuse to use Hadley Wickham’s new package, rvest (I literally just got why the package is named as much while typing this out) which makes web scraping easier–in part by leveraging the expressive power of the magrittr package.

To get the total number of tweets for a particular tweeter, I wrote the following function:

library(rvest)
library(magrittr)
get.num.tweets <- function(handle){
  tryCatch({
    unraw <- function(raw_str){
      raw_str <- sub(",", "", raw_str)    # remove commas if any
      if(grepl("K", raw_str)){
        return(as.numeric(sub("K", "", raw_str))*1000)   # in thousands
      }
      return(as.numeric(raw_str))
    }
    html(paste0("http://twitter.com/", sub("@", "", handle))) %>%
      html_nodes(".is-active .ProfileNav-value") %>%
      html_text() %>%
      unraw
    },
    error=function(cond){return(NA)})
}

The real logic (and beauty) of which is contained only in the last few lines:

    html(paste0("http://twitter.com/", sub("@", "", TWITTER_HANDLE))) %>%
      html_nodes(".is-active .ProfileNav-value") %>%
      html_text()

The CSS element that houses the number of total tweets from a useR’s twitter page was found easily using SelectorGadget.

After scraping the number of tweets for almost 10,000 #rstats tweeters (waiting a few seconds between each request because I’m considerate) I divided number of #rstats tweets by the total number of tweets to come up with a normalized value.

The top 10 tweeteRs were as follows:

              User count num.of.tweets     ratio 
1     @medzihorsky     9            28 0.3214286 
2        @statworx     5            16 0.3125000 
3    @LearnRinaDay   114           404 0.2821782 
4  @RforExcelUsers     4            15 0.2666667 
5     @showmeshiny    27           102 0.2647059 
6           @tcrug     6            25 0.2400000 
7   @DailyRpackage   155           666 0.2327327 
8   @R_Programming    49           250 0.1960000 
9        @hexadata     8            41 0.1951220 
10     @Deep_RHelp    11            58 0.1896552

In case you were wondering, Trey Causey still “won” by a long shot:

> tweeters[which(tweeters$User=="@tdhopper"),]   
Source: local data frame [1 x 4]                 
                                                 
       User count num.of.tweets        ratio     
1 @tdhopper     8         26700 0.0002996255     
> tweeters[which(tweeters$User=="@treycausey"),] 
Source: local data frame [1 x 4]                 
                                                 
         User count num.of.tweets      ratio     
1 @treycausey    50         28700 0.00174216

Before ending this post, I feel compelled to issue an almost certainly unnecessary but customary warning against using number of #rstats tweets as a proxy for who likes R the most or who are the biggest R “thought leaders” (whatever that is). Most tweets about R don’t use the #rstats hashtag, anyway.

Again, I would’t read too much into this 🙂