Your and my 2019 R goals
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here we go again, using a Twitter trend as blog fodder! Colin Fay launched an inspiring movement by sharing his R goals of 2019.
My #RStats goals for 2019:
— Colin Fay ???? (@_ColinFay) December 29, 2018
1️⃣ Becoming entirely fluent with {data.table}
2️⃣ Getting at ease with {Rcpp}
What are yours?#rdatatable #rcpp
It’s been quite interesting reading the objectives of other tweeps: what
they want to learn, make, how they want to get involved in the
community, etc. As Mike Kearney, rtweet
’s maintainer, underlined, it
is excellent reading material!
Excellent reading material–tweets about 2019 #rstats goals: https://t.co/6wrGeqsWbm
— Mike Kearney???? (@kearneymw) December 31, 2018
… but also blogging material! Let me fetch and tokenize these tweets to summarize them!
Disclaimer: I later saw that Jason Baik got the same idea and was faster than I, find the analysis here.
Collect Twitter data
If you’re using rtweet
for the first time, check out its
website for information about use and setup and
also refer to Twitter API docs
themselves for knowing more about rate limitation and e.g. learning that
the search endpoint won’t let you get tweets older than 6-9 days old.
tweets <- rtweet::search_tweets("Rstats goals 2019", include_rts = FALSE)
I obtained 87 tweets from 85 unique users. Definitely not big data, but not bad!
Tokenize tweets
I then set out to tokenize the tweets into words using the specific
tokenizers::tokenize_tweets()
tokenizer via the tidytext
package. If
you’re new to tidytext
I’d recommend reading the book written by its
authors. A token in natural language
processing can be a word, line, etc. which is a totally different
concept from a token for rtweet
functions (your API credentials).
The tweet tokenization is a “tokenization by word that preserves usernames, hashtags, and URLS”. So awesome, and today is the first time I find an occasion to use it! I also removed stopwords.
library("magrittr") stopwords <- rcorpora::corpora("words/stopwords/en")$stopWords tokens <- tweets %>% dplyr::select(text) %>% tidytext::unnest_tokens(token, text, token = "tweets", drop = FALSE) %>% dplyr::filter(!token %in% stopwords)
Analyze tweets
Most mentioned topics
I first was able to draw a figure similar to Jason Baik’s one, with the most common tokens. I too removed digits.
library("ggalt") tokens %>% dplyr::mutate(token = stringr::str_remove_all(token, "[^\x01-\x7F]")) %>% dplyr::mutate(token = stringr::str_remove_all(token, "[[:digit:]]")) %>% dplyr::filter(! token %in% c("", "#rstats", "goals")) %>% dplyr::count(token, sort = TRUE) %>% dplyr::mutate(token = reorder(token, n)) %>% head(n = 18) %>% ggplot() + geom_lollipop(aes(token, n), size = 1.5, col = "salmon") + hrbrthemes::theme_ipsum(base_size = 12, axis_title_size = 12) + coord_flip()
What actions?
In this figure I identify verbs like learn, finish, write, build and contribute. Let me look at a sample of lines for each of them. This is a sample of lines for a small sample of verbs.
lines <- tweets %>% dplyr::select(text) %>% tidytext::unnest_tokens(line, text, token = "lines") sample_verb <- function(verb, lines){ set.seed(42) dplyr::filter(lines, stringr::str_detect(line, paste0(verb, " "))) %>% dplyr::sample_n(3) } samples <- purrr::map_df(c("learn", "finish", "write", "build", "contribute"), sample_verb, lines) knitr::kable(samples)
line |
---|
3️⃣ learn how to make r packages and write my code so it could be made into an r package more easily |
1. learn how to do spatial analysis in r |
2️⃣ learn better way to automate feature engineering (neural nets) for text |
1️⃣ finally finish all the courses and certifications i started last year on #coursera and #datacamp |
2️⃣ finish my track and field r package |
3). finish that text mining project i started in october |
4️⃣ write an advanced shiny book with bookdown ???? |
3️⃣ learn how to make r packages and write my code so it could be made into an r package more easily |
1️⃣ write the htmlwidgets book |
- build my first #rstats package (aiming for 2 but 1 would be great :d) |
2⃣ build a shiny web app to explore tx staar data |
- use f(x) regularly & build own package. cease patching. |
5⃣ contribute to foss https://t.co/oh7mwcq50r |
2 contribute more to #rstats community through #scicomm, #stackoverflow, etc |
3) contribute to #swdchallenge (with r, duh) |
These actions are quite varied, e.g. writing is applied to software as well as reading material. My goal was to summarize tweets, but I keep thinking reading all of them is interesting!
Packages?
I wondered how many of the tokens correspond to a package name. I
limited myself to CRAN packages, by using the available.packages()
function, but one could have a look at the source code of the
available
package to get
an idea of how to find names of packages from Bioconductor and GitHub.
cran_pkgs <- as.character( available.packages(contrib.url('https://cran.r-project.org', 'source'))[,"Package"]) pkg_tokens <- dplyr::mutate(tokens, token = gsub("#", "", token)) %>% dplyr::filter(token %in% cran_pkgs)
Using the data I’ll look at tweets with the most packages, and most frequent packages.
pkg_tokens %>% dplyr::group_by(text) %>% dplyr::mutate(pkg_text = paste(toString(token), text)) %>% dplyr::count(pkg_text, sort = TRUE) %>% head(n = 3) %>% dplyr::pull(pkg_text) ## [1] "portfolio, blogdown, rmarkdown, knitr, shiny, maps I really like seeing all these #rstats 2019 goals. My own, in order of urgency:\n1) Finish my personal website and online portfolio using blogdown\n2) Get rolling with project workflows, rmarkdown, and knitr \n3) Create shiny apps for custom interactive maps" ## [2] "inference, projects, import, rvest, httr, xml2 My #rstats 2019 goals:\n1. Improve my statistical modeling and inference skills\n2. Develop business literacy and apply it in data analysis projects\n3. Continue to post on my blog (1 post every 2 months)\n4. Learn to import data using DBI, rvest, httr, and xml2" ## [3] "shiny, templates, shiny, shiny, bookdown, shiny My #RStats goals for 2019: \n\n1<U+FE0F><U+20E3> Improve shinydashboardPlus, bs4Dash and argonDash ..<U+0001F973> \n\n2<U+FE0F><U+20E3> Release new shiny templates \n3<U+FE0F><U+20E3> Open a consulting service for https://t.co/k3PAbxyVMa about shiny \n4<U+FE0F><U+20E3> Write an advanced shiny book with bookdown <U+0001F388>\n#rstats #shiny #consulting https://t.co/Fyc7MhaeW8"
There are false positives, e.g. projects
was here meant as a word, not
a package name. What about the most popular packages among the tweets?
dplyr::count(pkg_tokens, token, sort = TRUE) ## # A tibble: 66 x 2 ## token n ## <chr> <int> ## 1 shiny 16 ## 2 blogdown 9 ## 3 projects 9 ## 4 rmarkdown 9 ## 5 tidyverse 6 ## 6 bookdown 5 ## 7 purrr 4 ## 8 track 4 ## 9 caret 3 ## 10 markdown 3 ## # ... with 56 more rows
In this table, we get a glimpse at current popular packages, apart from “projects”, “track” and “markdown”. If I’m reading the list correctly they’re all developed at RStudio!
Conclusion
In this post I followed an approach similar to Jason
Baik’s
to summarize tweets about 2019 R goals announced on Twitter: I collected
tweets with rtweet
and then used tidytext
and the tidyverse to
summarize them. Goals often included learning about stuff, building
packages (find my list of
resources and don’t miss
this offer by Steph de
Silva),
and mentions of RStudio packages.
What about my own R goals, that I haven’t tweeted? I have not made any list, but have exciting projects at work, and hope to keep semi-consistently posting on this blog. In January I’ll also get to start 2019 by giving two R talks, one at R-Ladies Paris and a remote one at ConectaR 2019! Happy 2019, I hope you can meet your own R goals!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.