Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How many R related books have been published so far? Who is the most popular publisher? How many other manuals, tutorials and books have been published online? Let’s find out.

A few years ago I used the publication list on r-project.org as an argument that R is an established statistical programming language. I believe at the time there were about 20 R related books available.

A recent post on Recology pointed me to a talk given by Ed Goodwin at the Houston R user group meeting about regular expressions in R, something I always wanted to learn properly, but never got around to do.

So let’s see, if we can manage to extract the information of published R books and texts from r-project.org, with what we learned from Ed about regular expressions in R.

We will start analysing the bib-file on r-project.org by publisher and then move on to look more closely at the number of titles published over time, including the self-published PDF-files on CRAN.

We read the bib-file into R using the readLines function and start analysing the data with regular expressions. The function regexpr allows us to find for each line the character place where “publisher =” starts, or it will return -1 otherwise, if no entry is found. We continue with the R function strsplit to cut the strings into sub-component for further analysis:
bibfile <- readLines("https://www.r-project.org/doc/bib/R-books.bib")pub.start.pos <- regexpr("publisher =", bibfile, perl=TRUE)pub.lines <- which( pub.start.pos > 0 ) pub.split <- strsplit(bibfile[pub.lines], "[ =,]", perl=TRUE) publishers <- sapply(pub.split, function(x) paste(x[-c(1:5)]))publishers <- gsub("[{}\"),\\]", "", publishers)publishers <- gsub("c\\(", "", publishers) s=c("Springer", "Wiley", "Sage", "Chapman & Hall", "CRC press", "Servicio")r=c("Springer", "Wiley", "Sage", "Chapman & Hall/CRC",     "Chapman & Hall/CRC","Universidad de La Rioja") for(i in seq(along=s)){  publishers[regexpr(s[i], publishers, ignore.case=TRUE) > 0] <- r[i]} pubs <- table(publishers)pubs <- pubs[order(pubs)]opar <- par(mar = c(4, 15, 4, 2))barplot(pubs, horiz=TRUE, las=1,  xlab=format(Sys.time(), "As at %d %b %Y"),   border=NA, main="Number of R books\nby publisher")par(opar)

We note that Springer is by far the most popular publisher for R related books. Thus, if you are looking for a specific topic around R your safest bet would be to check out Springer's portfolio.

However, although currently Springer is the publisher with the highest appetite for R, you may be able to find the information free online on r-project.org, in particular if your are looking for a tutorial like document/book.

Hence we want to compare the number of published R books in a traditional way, versus the PDF-files contributed online on CRAN: http://cran.r-project.org/doc/contrib/. CRAN is in this respect also a bit of a publisher, as I assume that the guys behind CRAN have some kind of a filtering and QA process.

We use the XML package to read the online directory content of the contributed books into R to get a better understanding of the published PDF-files. In a similar approach to above, we analyse the R books and PDF files published by year. Please find the R code below the charts.

From the two charts below we can see that over the years more and more R texts have been made available, illustrating the increased interest in R. The first chart shows the number of books/documents published in each year, while the second chart shows the same data in a cumulative way.

Today there are 206 R related books available either on CRAN or via your bookshop. Of the 206 texts 113 are published in the traditional sense with a publishing house, and the number is still growing. However the growth has slowed down a bit over the recent years, after a peak of 26 new books in 2009.

From the second chart I can see that I most have had the discussion on R I mentioned earlier around 2004 - 2005.

## Continue from the above R codeyear.start.pos <- regexpr("year =", bibfile, perl=TRUE)year.lines <- which( year.start.pos > 0 ) year.split <- strsplit(bibfile[year.lines], "[ =,]", perl=TRUE) Pub.year <- as.numeric(sapply(year.split, "[[", 6))tradPub <- as.data.frame(table(Pub.year))names(tradPub) <- c("year", "traditional book") ## Get information about the online published PDF fileslibrary(XML)webPub <-readHTMLTable(readLines("http://cran.r-project.org/doc/contrib/"))[[1]]## look only at the PDF filespdfs <- regexpr(".pdf", as.character(webPub[,2])) webPub.modified <- webPub[pdfs > 0, 3]webPub.modified <- strsplit(na.omit(as.character(webPub.modified)), "[ -]")webPub.year <- as.numeric(sapply(webPub.modified, "[[", 3)) webPub <- as.data.frame(table(webPub.year))names(webPub) <- c("year", "online PDF")## Merge information about the online and traditional bookstotalPub <- merge(tradPub, webPub, all=TRUE)totalPub[is.na(totalPub)] <- 0 ## Calculate the cumulative statisticscumPub <- data.frame(year=totalPub\$year, apply(totalPub[,2:3], 2, cumsum ))names(cumPub) <- c("year", "traditional book",           "online PDF")## We use the googleVis package to create area chartslibrary(googleVis)cumPlot <- gvisAreaChart(cumPub, "year",       options=list(title="Number of R related books available",     isStacked=TRUE)) incPlot <- gvisAreaChart(totalPub, "year",           options=list(title="Number of R       related books published",       isStacked=FALSE)) inccumPlot <- gvisMerge(incPlot, cumPlot) plot(inccumPlot)