Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Scraping and transforming the list of blogs

The list of contributing blogs is not an html table which makes me things more interesting. I had to look a bit at the source of the html page to come up with a solution to transform it. I wasn’t too slow because I’ve done something similar with Wikipedia data in the past, something that’s still an unfinished building, of course, but whose code I could re-use.

I used rvest for first scraping the data but also stringr for manipulating strings. What I like the most about stringr compared to manipulating strings with base R are the names of functions, that reflect what they do so well, and the order of arguments, with the character vector first. Nick, who has a great R blog by the way, told me in the comments of one post I had “some pretty ninja grep skills”. Although I think he exaggerates a little bit, I’ll try not to disappoint him!

First I wanted to get a character vector for each blog.

library("rvest")
library("purrr")
library("tibble")
library("stringr")
blogs_list

## {xml_document}
##
## [1] \n

blogs_list <- html_nodes(blogs_list, "ul")
blogs_list

## {xml_nodeset (10)}
##  [1] git\n\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\t\n  \n

blogs_list <- blogs_list[str_detect(blogs_list, "xoxo blogroll")][2]
blogs_list

## {xml_nodeset (1)}
## [1] \n

blogs_list <- toString(blogs_list)
blogs_list <- str_split(blogs_list, "

## [1] " href=\"http://dmlc.ml\" onclick=\"_gaq.push(['_trackEvent', 'outbound-widget', 'http://dmlc.ml', ' R stats']);\"> R stats\n"
## [2] " href=\"http://lionel-.github.io\" onclick=\"_gaq.push(['_trackEvent', 'outbound-widget', 'http://lionel-.github.io', ' rstats']);\"> rstats\n"
## [5] " href=\"https://rveryday.wordpress.com/\" onclick=\"_gaq.push(['_trackEvent', 'outbound-widget', 'https://rveryday.wordpress.com/', '(R)very Day']);\">(R)very Day\n"
## [6] " href=\"http://www.talyarkoni.org/blog\" onclick=\"_gaq.push(['_trackEvent', 'outbound-widget', 'http://www.talyarkoni.org/blog', '[citation needed] » R']);\">[citation needed] » R\n"


OK so now from each element of blogs_list I want to extract the name of the blog, e.g. ‘– R stats’, and its address, e.g. ‘http://dmlc.ml’. I’ll first extract the links because using the link it’ll be easier to extract the blog name. Note the use of map_chr: using this variant of purrr function map, you directly get a character vector instead of a list.

extract_link <- function(x){
x <- str_replace(x,
pattern = "href=\\\"",
replacement = "")
x <- str_replace(x,
pattern = "\\\" onclick=.*\\n",
replacement = "")
x <- str_trim(x)
return(x)
}

# the last one is a special snowflake
"\\\n\\\t<\\/ul>", "")


## [1] "http://dmlc.ml"
## [2] "http://lionel-.github.io"
## [5] "https://rveryday.wordpress.com/"
## [6] "http://www.talyarkoni.org/blog"


So let’s now extract names!

extract_names <- function(x, address){
x <- str_replace_all(x, paste0(".*", address, "', '"), "")
x <- str_replace_all(x, "]\\)\\;.*\\\n", "")
x <- str_replace(x, "'", "")
return(x)
}

# last one is a special snowflake again
"\\\n\\\t<\\/ul>", "")

## [1] " R stats"             " rstats"              "Jean Arreola"
## [4] "R you ready?"        "(R)very Day"           "[citation needed] » R"


Now I can build a tibble of names and addresses of blogs!

blogs_info <- tibble(name = blogs_names,


There are 746 blogs. The list also includes blogs that are no longer updated apparently, but that contributed content to R-bloggers at one point. My own blog isn’t in the list yet which is fine because “Maëlle” is a pretty unique blog name but neither funny nor witty, except if testing encoding issues is funny or witty.

Before starting to have fun looking at blog names and addresses, I’ll extract the extension from the addresses.

extract_extension <- function(df){
x <- str_replace(df$address, "feed\\.r-bloggers\\.xml", "") x <- str_replace(x, "\\.xml", "") x <- str_replace(x, "\\.html", "") x <- str_replace(x, "\\.php", "") x <- str_split(x, "\\.", simplify = TRUE) x <- x[length(x)] x <- str_replace_all(x, "\\/.*", "") return(x) } blogs_info <- by_row(blogs_info, extract_extension, .to = "extension", .collate = "cols") knitr::kable(head(blogs_info, n = 20))  name address extension – R stats http://dmlc.ml ml – rstats http://lionel-.github.io io –Jean Arreola– https://jean9208.github.io/rss-R.xml io “R” you ready? http://ryouready.wordpress.com com (R)very Day https://rveryday.wordpress.com/ com [citation needed] » R http://www.talyarkoni.org/blog org [R] tricks http://rtricks.wordpress.com com /mgritts_ http://mgritts.com/feed.r.xml r #untitled http://blog.ambodi.com/ com 0xCAFEBABE http://xcafebabe.blogspot.com/search/label/R com 4D Pie Charts » R http://4dpiecharts.com com 56north | Skræddersyet dataanalyse » Renglish http://www.56n.dk dk A blog from Sydney http://a-blog-from-sydney.blogspot.com/search/label/R com A Blog On Data Analytics http://aliarsalankazmi.github.io/blog_DA/ io A HopStat and Jump Away » Rbloggers http://hopstat.wordpress.com/ com a Physicist in Wall Street http://aphysicistinwallstreet.blogspot.com/search/label/R com A Pint of R http://rpint.wordpress.com com A Statistics Blog – R http://statistics.rainandrhino.org/R.xml org Actuarially (Matt Malin) http://www.actuarially.co.uk/ uk Adventures in Analytics and Visualization http://analyticsandvisualization.blogspot.com/search/label/R com # Analysing blog info ## How and where do people blog? Out of the 746 blogs there are 156 blogspot.com blogs, 181 wordpress.com blogs and 53 github.io blogs. I guess most github.io blogs are knitr-jekyll-based like mine, or more modern blogdown blogs. For blogs with a specific domain name, e.g. Julia Silge’s excellent “data science ish” blog who has “juliasilge” as domain name, it’s more difficult to know how people blog. Well I know Julia uses knitr-jekyll. What about extensions? library("dplyr") blogs_info %>% group_by(extension) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n = 16) %>% knitr::kable()  extension n com 533 io 53 org 39 net 27 de 11 info 7 me 7 uk 7 nl 6 co 4 eu 4 name 4 ca 3 edu 3 es 3 fr 3 I totally stopped at “fr” on purpose. Without surprise given the number of blogspot/wordpress blogs “.com” is a favourite, and then “.io”. I’m quite happy to see 4 fellow “.eu” bloggers. But I’ll admit regretting I didn’t choose “.guru” like this blog because I’d have loved to become a R guru. ## So, what’s in blog names? The first thing I noticed when looking at the names of blogs was the high frequency of puns! There’s the “Hallway Mathlete”, “Once Upon a Data”, “i’m a chordata! urochordata! » R” (too bad this website doesn’t seem to exist any longer), etc. And well these were regular data / statistics puns… then we have the R puns! “Shirin’s playgRound”, ““R” you ready?”, “(R)very Day”… R Bloggers are a bit like hair dressers, said my husband when I told him about some names. Another not surprising feature of some blog names is to include something about statistics, like David Robinson’s “Variance Explained”. Maybe we could resolve the Bayesians vs. frequentist feud by looking at blog names? Well 4 have a name including “Bayes”, 0 a “frequentist” name. That said, I can have missed a pun indicating the author’s position on the subject. Names of blogs can also indicate the field where the author applies R, e.g. 15 blogs include something “bio”-ish and 3 contain the word “finance”. Also, in case you want a bit of action in your life, there are 6 “adventures” blogs, and 3 “journey” blogs. By the way, my PhD advisor’s blog, “Theory meets practice…” can also help you in your daily adventures, from choosing a Tinder profile to knowing which Kinder Surprise eggs contain a figure. After this random exploration of blog names, I realized you’d all be waiting for a wordcloud. Or so I think. So let’s make a wordcloud of R blog names! library("wordcloud") words <- toString(blogs_info$name)
words <- str_split(words, pattern = " ", simplify = TRUE)
set.seed(1)
wordcloud(words, colors = viridis::viridis_pal(end = 0.8)(10),
min.freq = 3, random.color = TRUE)


# Conclusion

You might be disappointed that I don’t have any good advice for naming your blog. Sure, looking at existing names you can choose one for fitting in (including words like “data”, “rstats”, etc.) or a very original one, but does it predict success? I guess it’s better, however, to have a really cool blog name if you start printing t-shirts for your readers.

Another thing you might need to name during your R journey/adventures is a package. If you’re interested in this issue, you might want to read Alex Whant’s post “What’s in a [package] name?”.

And last but not least, thanks again to Tal for creating and maintaining R-bloggers!