Popular topics at the BioStar Q&A site

Posted on August 23, 2011 by nsaunders in R bloggers | 0 Comments

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Which topics are the most popular at the BioStar bioinformatics Q&A site?

One source of data is the tags used for questions. Tags are somewhat arbitrary of course, but fortunately BioStar has quite an active community, so “bad” tags are usually edited to improve them. Hint: if your question is “How to find SNPs”, then tagging it with “how, to, find, snps” won’t win you any admirers.

OK: we’re going to grab the tags then use a bunch of R packages (XML, wordcloud and ggplot2) to take a quick look.

1. Fetch the tags
Fortunately, I enjoy sufficient privileges at BioStar to obtain a dump of the database. It contains a file named “Tags.xml”, with this simple structure:

<Tags>
  <row>
    <Id>3</Id>
    <Name>bed</Name>
    <Count>20</Count>
    <UserId>2</UserId>
    <CreationDate>2009-09-30T14:55:00.167</CreationDate>
  </row>
  ...
</Tags>

A hint for people who write XML parsing documentation. Most of us just want to get the values from between the tags. Just tell us how to do that. OK?

Thanks to this StackOverflow thread, I discovered the incredibly-useful xmlToDataFrame() function in the R XML package:

library(XML)
tags <- xmlToDataFrame("Tags.xml")
head(tags)
#   Id       Name Count UserId            CreationDate
# 1  3        bed    20      2 2009-09-30T14:55:00.167
# 2  4        gff    12      2 2009-09-30T14:55:00.167
# 3  5     galaxy    11      2 2009-09-30T15:09:43.417
# 4  6      yeast     5      3 2009-09-30T16:09:06.723
# 5  7      motif    19      3  2009-09-30T16:09:06.74
# 6  8 microarray    96      2 2009-09-30T16:44:22.677

Too easy. However, class(tags$Count) = “character”, which is not quite not we want. So let’s change that to numeric, then sort the data frame on Count, decreasing:

tags$Count <- as.numeric(tags$Count)
tags <- tags[sort.list(tags$Count, decreasing = T),]

2. For those who like a “top N” plot
Next, we’ll grab the top 20 tags by Count. To plot them in decreasing order, we need to reorder the tag Name by Count. With thanks again to a StackOverflow thread.

library(ggplot2)
tags.20 <- head(tags, 20)
tags.20 <- transform(tags.20, Name = reorder(Name, Count))
ggplot(tags.20) + geom_bar(aes(Name, Count), fill = "coral") + coord_flip() + theme_bw() + opts(title = "Top 20 BioStar Tags")

Click image, right, for full-size version.

Top 20 Biostar Tags

3. For those who like word/tag clouds

Here, we look at tags which occur 10 or more times and display a maximum of 1000 tags in the cloud. Again, click image for the full-size version.

library(wordcloud)
library(RColorBrewer)

png(file = "tags.png", width = 1024, height = 1024)
wordcloud(tags$Name, tags$Count, scale = c(8,.2), min.freq = 10, max.words = 1000, random.order = F, rot.per = .15, colors = brewer.pal(8, "Dark2"))
dev.off()

BioStar tag cloud

Conclusions? XML, ggplot2 and wordcloud are all great packages. And whilst so-called “next-generation-sequencing” might be all the rage, it’s good to see the old stalwarts of bioinformatics hanging in there: BLAST, alignment, phylogenetics, Python and Perl. It will be interesting to see how tags change over time.

Filed under: bioinformatics, R, statistics, web resources Tagged: biostar, stackexchange