A glance at R-bloggers Twitter feed

[This article was first published on Maëlle, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s the second time I write a post about the blog aggregator R-bloggers, probably because I’m all about R blogs now that I have one. My husband says my posts are so meta. My first post was about R blogs names, in this one I shall focus on the last 1,000 tweets from R-bloggers.

Getting the tweets

Thanks to rtweet, this is fairly easy. I get rid of empty columns using janitor, which is a package you should really check out if you ever have to clean data.

rbloggers <- get_timeline(user = "Rbloggers",
                          n = 1000)

rbloggers <- janitor::remove_empty_cols(rbloggers)

readr::write_csv(rbloggers, path = "data/2017-02-28-rbloggerstweets.csv")

2017-02-28 20:00:42836667465333673984How to annotate a plot in ggplot2 https://t.co/78h18Plc5e #rstats #DataSciencer-bloggers.comFALSE1217en144592995RbloggersNANArstats,DataSciencehttps://wp.me/pMm6L-CrhFALSE
2017-02-28 17:00:40836622157295857665How to create correlation network plots with corrr and ggraph (and which countries drink like https://t.co/K8g1OkvWMs #rstats #DataSciencer-bloggers.comFALSE1455en144592995RbloggersNANArstats,DataSciencehttps://wp.me/pMm6L-CreFALSE
2017-02-28 16:11:45836609848276037636forecast 8.0 https://t.co/XATCVdeoJ8 #rstats #DataSciencer-bloggers.comFALSE1538en144592995RbloggersNANArstats,DataSciencehttps://wp.me/pMm6L-CrcFALSE
2017-02-28 15:18:28836596435307081729New R job: Applied Research Statistician/Methodologist https://t.co/Hfkanf74aM #rstats #DataScience #jobsr-bloggers.comFALSE27en144592995RbloggersNANArstats,DataScience,jobshttps://www.r-users.com/jobs/applied-research-statisticianmethodologist/FALSE
2017-02-28 13:12:59836564859647033344[WEBINAR] Trading in Live Markets using R https://t.co/nNGyQOqUmw #rstats #DataSciencer-bloggers.comFALSE521en144592995RbloggersNANArstats,DataSciencehttps://wp.me/pMm6L-CraFALSE
2017-02-28 00:40:34836375506945716226Video Introduction to Bayesian Data Analysis, Part 2: Why use Bayes? https://t.co/fmHtKr99kw #rstats #DataSciencer-bloggers.comFALSE49101en144592995RbloggersNANArstats,DataSciencehttps://wp.me/pMm6L-CqZFALSE

Now that I have the data, I’ll have a look at the content of the tweet, at their temporal patterns and at their popularity.

What are the most frequent words?

For finding the most frequent words in the tweets I use what has now become my usual workflow with tidytext. I remove 4 words that correspond to the hashtags used in every tweet (#rstats and #datascience) and to links (https and t.co)

rbloggers <- readr::read_csv("data/2017-02-28-rbloggerstweets.csv")
stopwords <- corpora("words/stopwords/en")$stopWords

rbloggers_words <- rbloggers %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE) %>%
filter(!word %in% stopwords) %>%
  filter(!word %in% c("rstats", "datascience",
                      "t.co", "https"))
knitr::kable(head(rbloggers_words, n = 20))

I’m not surprised by the trendy words, I guess you could mix a few of them up and get a pretty cool title, e.g. “How to make an app in RStudio with an interactive map in it” or “let’s analyse data with a regression and plot everything with ggplot2”. I think I’m more surprised that in 1,000 tweets, no word is that predominant.

When are blog posts published?

I’ll start with a warning, R-bloggers tweets appear a bit after the actual blog posts are published, about a few hours I’d say. I actually need a second warning, regarding time of day we have to keep in mind R blogs can be written from everywhere on the planet so in theory R-bloggers is an account that never sleeps.

rbloggers <- mutate(rbloggers, wday = as.factor(wday(created_at, label = TRUE)))
rbloggers <- mutate(rbloggers, hour = as.factor(hour(created_at)))
rbloggers <- mutate(rbloggers, week = week(created_at))
rbloggers <- mutate(rbloggers, day = as.Date(created_at))

Note that the reason I can use week is that there are no tweets from more than one year ago in my data.

Here I’ll show the number of tweets by day of the week.

weekday_dat <- rbloggers %>%
  group_by(week, wday) %>%
  summarize(n = n(), created_at = created_at[1]) 

arrange(weekday_dat, desc(n)) %>%
  head() %>%
4Sun982017-01-22 23:50:19
4Mon822017-01-23 23:10:18
3Sat652017-01-21 23:52:03
4Wed342017-01-25 22:10:27
4Fri192017-01-27 23:40:37
6Tues192017-02-07 23:50:44

There are a few days with a lot of tweets, which I guess is due to one blog being added and all its posts being shared at once? In any case, I’ll remove these days from the figure by not showing outliers.

ggplot(weekday_dat) +
  geom_boxplot(aes(wday, n),
               outlier.shape = NA) +
  scale_y_continuous(limits =  quantile(weekday_dat$n, c(0, 0.9)))

plot of chunk unnamed-chunk-5

I’m not too sure what to conclude as regards a possible day-of-the-week pattern. Maybe I’d need more data, since I don’t even have a full year of data:

## [1] "2016-12-14 19:06:23 UTC"
## [1] "2017-02-28 20:00:42 UTC"

With more data maybe I could say whether R-bloggers, who I think are often not blogging for work, post more on the week-ends. Thinking of programming, week-ends and weekdays makes me think of this very good post of Julia Silge’s.
Similarly for hour of the day (results not shown) I’m a victim, I think, of the size of my dataset. Moreover, even with a bigger sample, I’d still have trouble finding a circadian rythm since it’d mix tweets from several timezones, without any information about the location of the blog author. Too bad! And with years of data I could even look at seasonality!

How popular are R-bloggers tweets?

I’ll be honest, this is the primary reason why I got interested into R-bloggers’ feed. I wondered how famous it made my poor young blog. Well if I have to be honest I also wondered how visible an error of mine would be.

ggplot(rbloggers) +

plot of chunk unnamed-chunk-7

ggplot(rbloggers) +

plot of chunk unnamed-chunk-7

Both look like negative binomial distributions, right? But I don’t want to model them, I’m in a minimalistic mood. Note that the median number of retweet is 8 and the median number of favorites is 19. Let’s see which were the most popular tweets.

arrange(rbloggers, desc(retweet_count)) %>%
  head() %>%
2016-12-29 08:02:098.143810e+177 Visualizations You Should Learn in R https://t.co/p7aww3ueiU #rstats #DataSciencer-bloggers.comFALSE119214en144592995RbloggersNANArstats,DataSciencehttps://wp.me/pMm6L-BEPFALSEThurs8522016-12-29
2017-01-03 17:42:148.163389e+17Why R is the best data science language to learn today https://t.co/Bp5XouKIZ9 #rstats #DataSciencer-bloggers.comFALSE101129en144592995RbloggersNANArstats,DataSciencehttps://wp.me/pMm6L-BGQFALSETues1712017-01-03
2017-01-25 03:10:208.240920e+17Free guide to text mining with R https://t.co/o8pkSA3Rke #rstats #DataSciencer-bloggers.comFALSE99191en144592995RbloggersNANArstats,DataSciencehttps://wp.me/pMm6L-C1pFALSEWed342017-01-25
2017-02-24 10:50:478.350795e+17Announcing ggraph: A grammar of graphics for relational data https://t.co/5jKGoK6Dsu #rstats #DataSciencer-bloggers.comFALSE94141en144592995RbloggersNANArstats,DataSciencehttps://wp.me/pMm6L-CpxFALSEFri1082017-02-24
2017-02-07 03:50:398.288132e+17Deep Learning in R https://t.co/HTHKDy7mnN #rstats #DataSciencer-bloggers.comFALSE92189en144592995RbloggersNANArstats,DataSciencehttps://wp.me/pMm6L-CerFALSETues362017-02-07
2017-01-28 12:10:338.253151e+17The “Ten Simple Rules for Reproducible Computational Research” are easy to reach for R users https://t.co/9hsKE2FgHt #rstats #DataSciencer-bloggers.comFALSE91122en144592995RbloggersNANArstats,DataSciencehttps://wp.me/pMm6L-C6OFALSESat1242017-01-28

I’m happy to report I’d heard of most of these posts. I feel so well informed.

Now I guessed that the number of retweets and favorites were correlated, so I decided to draw a scatterplot which according to one of the posts above is a visualization you should learn.

ggplot(rbloggers) +
geom_point(aes(retweet_count, favorite_count))

plot of chunk unnamed-chunk-9

Wow! I was so surprised by this plot that I decided to make the same for another R content aggregator account, R Weekly.

rweekly_org <- get_timeline(user = "rweekly_org",
                          n = 1000)

ggplot(rweekly_org) +
geom_point(aes(retweet_count, favorite_count)) +
  ggtitle("rweekly_org's timeline")

plot of chunk unnamed-chunk-10

Well here we have far less tweets sadly. I still think there might be a golden ratio of some sort hidden here, so I’ll fit linear models to both datasets.

model <- lm(favorite_count ~ retweet_count, data = rbloggers)
broom::tidy(model) %>% knitr::kable()
model2 <- lm(favorite_count ~ retweet_count, data = rweekly_org)
broom::tidy(model2) %>% knitr::kable()

Now I just hope that if we collected more data for both accounts, the second coefficient estimate would be close to the golden ratio, about 1.618. Or I could let social media specialists explain me why retweets and favorites have this correlation. Or where the mistake in my post is, which I’d like to know before it gets sort of viral thanks to R-bloggers.

To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)