Which science is all around? #BillMeetScienceTwitter

[This article was first published on Maëlle, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ll admit I didn’t really know who Bill Nye was before yesterday. His name sounds a bit like Bill Nighy’s, that’s all I knew. But well science is all around and quite often scientists on Twitter start interesting campaigns. Remember the #actuallylivingscientists whose animals I dedicated a blog post? This time, the Twitter campaign is the #BillMeetScienceTwitter hashtag with which scientists introduce themselves to the famous science TV host Bill Nye. Here is a nice article about the movement.

Since I like surfing on Twitter trends, I decided to download a few of these tweets and to use my own R interface to the Monkeylearn machine learning API, monkeylearn (part of the rOpenSci project!), to classify the tweets in the hope of finding the most represented science fields. So, which science is all around?

Getting the tweets

It might sound a bit like trolling by now, but if you wanna get Twitter data, I recommend using rtweet because it’s a good package and because it’s going to replace twitteR which you might know from other blogs.

I only keep tweets in English, and moreover original ones, i.e. not retweets.

library("rtweet")
billmeet <- search_tweets(q = "#BillMeetScienceTwitter", n = 18000, type = "recent")
billmeet <- unique(billmeet)
billmeet <- dplyr::filter(billmeet, lang == "en")
billmeet <- dplyr::filter(billmeet, is_retweet == FALSE)

I’ve ended up with 2491 tweets.

Classifying the tweets

I’ve chosen to use this taxonomy classifier which classifies text according to generic topics and had quite a few stars on Monkeylearn website. I don’t think it was trained on tweets, and well it wasn’t trained to classify science topics in particular, which is not optimal, but it had the merit of being readily available. I’ve still not started training my own algorithms, and anyway, if I did I’d start by creating a very crucial algorithm for determining animal fluffiness on pictures, not text mining stuff. This was a bit off topic, let’s go back to science Twitter!

When I decided to use my own package I had forgotten it took charge of cutting the request vector into groups of 20 tweets, since the API only accept 20 texts at a time. I thought I’d have to do that splitting myself, but no, since I did it once in the code of the package, I’ll never need to write that code ever again. Great feeling! Look at how easy the code is after cleaning up the tweets a bit! One just needs to wait a bit before getting all results.

output <- monkeylearn::monkeylearn_classify(request = billmeet$text,
                                            classifier_id = "cl_5icAVzKR")
str(output)
## Classes 'tbl_df', 'tbl' and 'data.frame':	4466 obs. of  4 variables:
##  $ category_id: int  64638 64640 64686 64696 64686 64687 64689 64692 64648 64600 ...
##  $ probability: num  0.207 0.739 0.292 0.784 0.521 0.565 0.796 0.453 0.301 0.605 ...
##  $ label      : chr  "Computers & Internet" "Internet" "Humanities" "Religion & Spirituality" ...
##  $ text_md5   : chr  "f7b28f45ea379b4ca6f34284ce0dc4b7" "f7b28f45ea379b4ca6f34284ce0dc4b7" "b95429d83df2cabb9cd701a562444f0b" "b95429d83df2cabb9cd701a562444f0b" ...
##  - attr(*, "headers")=Classes 'tbl_df', 'tbl' and 'data.frame':	0 obs. of  0 variables

In the output, the package creator decided not to put the whole text corresponding to each line but its digested form itself, digested by the MD5 algorithm. So to join the output to the tweets again, I’ll have to first digest the tweet, which I do just copying the code from the package. After all I wrote it. Maybe it was the only time I successfully used vapply in my whole life.

billmeet <- dplyr::mutate(billmeet, text_md5 = vapply(X = text,
                                                    FUN = digest::digest,
                                                    FUN.VALUE = character(1),
                                                    USE.NAMES = FALSE,
                                                    algo = "md5"))
billmeet <- dplyr::select(billmeet, text, text_md5)
output <- dplyr::left_join(output, billmeet, by = "text_md5")

Looking at this small sample, some things make sense, other make less sense, either because the classification isn’t good or because the tweet looks like spam. Since my own field isn’t text analysis, I’ll consider myself happy with these results, but I’d be of course happy to read any better version of it.

As in my #first7jobs, I’ll make a very arbitrary decision and filter the labels to which a probability higher to 0.5 was attributed.

output <- dplyr::filter(output, probability > 0.5)

This covers 0.45 of the original tweets sample. I can only hope it’s a representative sample.

How many labels do I have by tweet?

dplyr::group_by(output) %>%
  dplyr::summarise(nlabels = n()) %>%
  dplyr::group_by(nlabels) %>%
  dplyr::summarise(n_tweets = n()) %>%
  knitr::kable()
nlabels n_tweets
1415 1

Perfect, only one.

Looking at the results

I know I suck at finding good section titles… At least I like the title of the post, which is a reference to the song Bill Nighy, not Bill Nye, sings in Love Actually. My husband assumed that science Twitter has more biomedical stuff. Now, even if my results were to support this fact, note that this could as well be because it’s easier to classify biomedical tweets.

I’ll first show a few examples of tweets for given labels.

dplyr::filter(output, label == "Chemistry") %>%
  head(n = 5) %>%
  knitr::kable()
category_id probability label text_md5 text
64701 0.530 Chemistry e82fc920b07ea9d08850928218529ca9 Hi @billnye I started off running BLAST for other ppl but now I have all the money I make them do my DNA extractions #BillMeetScienceTwitter
64701 0.656 Chemistry d21ce4386512aae5458565fc2e36b686 .@uw’s biochemistry dept - home to Nobel Laureate Eddy Fischer & ZymoGenetics co founder Earl Davie… https://t.co/0nsZW3b3xu
64701 0.552 Chemistry 1d5be9d1e169dfbe2453b6cbe07a4b34 Yo @BillNye - I’m a chemist who plays w lasers & builds to study protein interactions w materials #BillMeetScienceTwitter
64701 0.730 Chemistry 1b6a25fcb66deebf35246d7eeea34b1f Meow @BillNye! I’m Zee and I study quantum physics and working on a Nobel prize. #BillMeetScienceTwitter https://t.co/oxAZO5Y6kI
64701 0.873 Chemistry 701d8c53e3494961ee7f7146b28b9c8c Hi @BillNye, I’m a organic chemist studying how molecules form materials like the liquid crystal shown below.… https://t.co/QNG2hSG8Fw
dplyr::filter(output, label == "Aquatic Mammals") %>%
  head(n = 5) %>%
  knitr::kable()
category_id probability label text_md5 text
64609 0.515 Aquatic Mammals f070a05b09d2ccc85b4b1650139b6cd0 Hi Bill, I am Anusuya. I am a palaeo-biologist working at the University of Cape Town. @BillNye #BillMeetScienceTwitter
64609 0.807 Aquatic Mammals bb06d18a1580c28c255e14e15a176a0f Hi @BillNye! I worked with people at APL to show that California blue whales are nearly recovered #BillMeetScienceTwitter
64609 0.748 Aquatic Mammals 1ca07aad8bc1abe54836df8dd1ff1a9d Hi @BillNye! I’m researching marine ecological indicators to improve Arctic marine monitoring and management… https://t.co/pJv8Om4IeI
64609 0.568 Aquatic Mammals a140320fcf948701cfc9e7b01309ef8b More like as opposed to vaginitis in dolphins or chimpanzees or sharks #BillMeetScienceTwitter https://t.co/gFCQIASty1
64609 0.520 Aquatic Mammals 06d1e8423a7d928ea31fd6db3c5fee05 Hi @BillNye I study visual function in ppl born w/o largest connection between brain hemispheres #callosalagenesis… https://t.co/WSz8xsP38R
dplyr::filter(output, label == "Internet") %>%
  head(n = 5) %>%
  knitr::kable()
category_id probability label text_md5 text
64640 0.739 Internet f7b28f45ea379b4ca6f34284ce0dc4b7 @BillNye #AskBillNye @BillNye join me @AllendaleCFD. More details at https://t.co/nJPwWARSsa
#BillMeetScienceTwitter          
  64640 0.725 Internet b2b7843dc9fcd9cd959c828beb72182d @120Stat you could also use #actuallivingscientist #womeninSTEM or #BillMeetScienceTwitter to spread the word about your survey as well
  64640 0.542 Internet a357e1216c5e366d7f9130c7124df316 Thank you so much for the retweet, @BillNye! I’m excited for our next generation of science-lovers!… https://t.co/B3iz3KVCOQ
  64640 0.839 Internet 61712f61e877f3873b69fed01486d073 @ParkerMolloy Hi @BillNye, Im an elem school admin who wants 2 bring in STEM/STEAM initiatives 2 get my students EX… https://t.co/VMLO3WKVRv
  64640 0.924 Internet 4c7f961acfa2cdd17c9af655c2e81684 I just filled my twitter-feed with brilliance. #BIllMeetScienceTwitter

Based on that, and on the huge number of internet-labelled tweets, I decided to remove those.

library("ggplot2")
library("viridis")

label_counts <- output %>% 
  dplyr::filter(label != "Internet") %>%
  dplyr::group_by(label) %>% 
  dplyr::summarise(n = n()) %>% 
  dplyr::arrange(desc(n))

label_counts <- label_counts %>%
  dplyr::mutate(label = ifelse(n < 5, "others", label)) %>%
  dplyr::group_by(label) %>%
  dplyr::summarize(n = sum(n)) %>%
  dplyr::arrange(desc(n))

label_counts <- dplyr::mutate(label_counts,
                        label = factor(label,
                                        ordered = TRUE,
                                        levels = unique(label)))

ggplot(label_counts) +
  geom_bar(aes(label, n, fill = label), stat = "identity")+
  scale_fill_viridis(discrete = TRUE, option = "plasma")+
    theme(axis.text.x = element_text(angle = 90,
                            hjust = 1,
                            vjust = 1),
          text = element_text(size=25),
          legend.position = "none")

plot of chunk unnamed-chunk-8

In the end, I’m always skeptical when looking at the results of such classifiers, and well at the quality of my sample to begin with – but then I doubt there ever was a hashtag that was perfectly used to only answer the question and not spam it and comment it (which is what I’m doing). I’d say it seems to support my husband’s hypothesis about biomedical stuff.

I’m pretty sure Bill Nye won’t have had the time to read all the tweets, but I think he should save them, or at least all the ones he can get via the Twitter API thanks to e.g. rtweet, in order to be able to look through them next time he needs an expert. And in the random sample of tweets he’s read, let’s hope he was exposed to a great diversity of science topics (and of scientists), although, hey, the health and life related stuff is the most interesting of course. Just kidding. I liked reading tweets about various scientists, science rocks! And these last words would be labelled with “performing arts”, perfect way to end this post.

To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)