Popular topics at the BioStar Q&A site

August 23, 2011
By

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

Which topics are the most popular at the BioStar bioinformatics Q&A site?

One source of data is the tags used for questions. Tags are somewhat arbitrary of course, but fortunately BioStar has quite an active community, so “bad” tags are usually edited to improve them. Hint: if your question is “How to find SNPs”, then tagging it with “how, to, find, snps” won’t win you any admirers.

OK: we’re going to grab the tags then use a bunch of R packages (XML, wordcloud and ggplot2) to take a quick look.


1. Fetch the tags
Fortunately, I enjoy sufficient privileges at BioStar to obtain a dump of the database. It contains a file named “Tags.xml”, with this simple structure:

<Tags>
  <row>
    <Id>3</Id>
    <Name>bed</Name>
    <Count>20</Count>
    <UserId>2</UserId>
    <CreationDate>2009-09-30T14:55:00.167</CreationDate>
  </row>
  ...
</Tags>

A hint for people who write XML parsing documentation. Most of us just want to get the values from between the tags. Just tell us how to do that. OK?

Thanks to this StackOverflow thread, I discovered the incredibly-useful xmlToDataFrame() function in the R XML package:

library(XML)
tags <- xmlToDataFrame("Tags.xml")
head(tags)
#   Id       Name Count UserId            CreationDate
# 1  3        bed    20      2 2009-09-30T14:55:00.167
# 2  4        gff    12      2 2009-09-30T14:55:00.167
# 3  5     galaxy    11      2 2009-09-30T15:09:43.417
# 4  6      yeast     5      3 2009-09-30T16:09:06.723
# 5  7      motif    19      3  2009-09-30T16:09:06.74
# 6  8 microarray    96      2 2009-09-30T16:44:22.677

Too easy. However, class(tags$Count) = “character”, which is not quite not we want. So let’s change that to numeric, then sort the data frame on Count, decreasing:

tags$Count <- as.numeric(tags$Count)
tags <- tags[sort.list(tags$Count, decreasing = T),]
2. For those who like a “top N” plot
Next, we’ll grab the top 20 tags by Count. To plot them in decreasing order, we need to reorder the tag Name by Count. With thanks again to a StackOverflow thread.

library(ggplot2)
tags.20 <- head(tags, 20)
tags.20 <- transform(tags.20, Name = reorder(Name, Count))
ggplot(tags.20) + geom_bar(aes(Name, Count), fill = "coral") + coord_flip() + theme_bw() + opts(title = "Top 20 BioStar Tags")

Click image, right, for full-size version.

tags20

Top 20 Biostar Tags

3. For those who like word/tag clouds

Here, we look at tags which occur 10 or more times and display a maximum of 1000 tags in the cloud. Again, click image for the full-size version.

library(wordcloud)
library(RColorBrewer)

png(file = "tags.png", width = 1024, height = 1024)
wordcloud(tags$Name, tags$Count, scale = c(8,.2), min.freq = 10, max.words = 1000, random.order = F, rot.per = .15, colors = brewer.pal(8, "Dark2"))
dev.off()
tags

BioStar tag cloud

Conclusions? XML, ggplot2 and wordcloud are all great packages. And whilst so-called “next-generation-sequencing” might be all the rage, it’s good to see the old stalwarts of bioinformatics hanging in there: BLAST, alignment, phylogenetics, Python and Perl. It will be interesting to see how tags change over time.


Filed under: bioinformatics, R, statistics, web resources Tagged: biostar, stackexchange

To leave a comment for the author, please follow the link and comment on his blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , ,

Comments are closed.