Which topics are the most popular at the BioStar bioinformatics Q&A site?
One source of data is the tags used for questions. Tags are somewhat arbitrary of course, but fortunately BioStar has quite an active community, so “bad” tags are usually edited to improve them. Hint: if your question is “How to find SNPs”, then tagging it with “how, to, find, snps” won’t win you any admirers.
OK: we’re going to grab the tags then use a bunch of R packages (XML, wordcloud and ggplot2) to take a quick look.
1. Fetch the tags
Fortunately, I enjoy sufficient privileges at BioStar to obtain a dump of the database. It contains a file named “Tags.xml”, with this simple structure:
<Tags> <row> <Id>3</Id> <Name>bed</Name> <Count>20</Count> <UserId>2</UserId> <CreationDate>2009-09-30T14:55:00.167</CreationDate> </row> ... </Tags>
A hint for people who write XML parsing documentation. Most of us just want to get the values from between the tags. Just tell us how to do that. OK?
Thanks to this StackOverflow thread, I discovered the incredibly-useful xmlToDataFrame() function in the R XML package:
library(XML) tags <- xmlToDataFrame("Tags.xml") head(tags) # Id Name Count UserId CreationDate # 1 3 bed 20 2 2009-09-30T14:55:00.167 # 2 4 gff 12 2 2009-09-30T14:55:00.167 # 3 5 galaxy 11 2 2009-09-30T15:09:43.417 # 4 6 yeast 5 3 2009-09-30T16:09:06.723 # 5 7 motif 19 3 2009-09-30T16:09:06.74 # 6 8 microarray 96 2 2009-09-30T16:44:22.677
Too easy. However, class(tags$Count) = “character”, which is not quite not we want. So let’s change that to numeric, then sort the data frame on Count, decreasing:
tags$Count <- as.numeric(tags$Count) tags <- tags[sort.list(tags$Count, decreasing = T),]
2. For those who like a “top N” plot
Next, we’ll grab the top 20 tags by Count. To plot them in decreasing order, we need to reorder the tag Name by Count. With thanks again to a StackOverflow thread.
library(ggplot2) tags.20 <- head(tags, 20) tags.20 <- transform(tags.20, Name = reorder(Name, Count)) ggplot(tags.20) + geom_bar(aes(Name, Count), fill = "coral") + coord_flip() + theme_bw() + opts(title = "Top 20 BioStar Tags")
Click image, right, for full-size version.
3. For those who like word/tag clouds
Here, we look at tags which occur 10 or more times and display a maximum of 1000 tags in the cloud. Again, click image for the full-size version.
library(wordcloud) library(RColorBrewer) png(file = "tags.png", width = 1024, height = 1024) wordcloud(tags$Name, tags$Count, scale = c(8,.2), min.freq = 10, max.words = 1000, random.order = F, rot.per = .15, colors = brewer.pal(8, "Dark2")) dev.off()
Conclusions? XML, ggplot2 and wordcloud are all great packages. And whilst so-called “next-generation-sequencing” might be all the rage, it’s good to see the old stalwarts of bioinformatics hanging in there: BLAST, alignment, phylogenetics, Python and Perl. It will be interesting to see how tags change over time.
Filed under: bioinformatics, R, statistics, web resources Tagged: biostar, stackexchange