Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Just how many (bad) -omics are there anyway? Let’s find out.
1. Get the raw data
It would be nice if we could search PubMed for titles containing all -omics:
*omics[TITL]
However, we cannot since leading wildcards don’t work in PubMed search. So let’s just grab all articles from 2013:
2013[PDAT]
and save them in a format which includes titles. I went with “Send to…File”, “Format…CSV”, which returns 575 068 records in pubmed_result.csv, around 227 MB in size.
2. Extract the -omics
Titles are in column 1 and we only want the -omics, so:
cut -f1 -d "," pubmed_result.csv | grep -i omics > omics.txt wc -l omics.txt # 1770 omics.txt
3. Clean, rinse, repeat…
We want just a list of -omics words. Time to break out the R. After much trial and error, I ended up with this. Ugly and far from optimized, but it (mostly) works. I say mostly, because I know of at least one case which is not detected: stain-omics.
library(stringr)
omics <- readLines("omics.txt")
omics <- strsplit(omics, " ") # split titles on space
omics <- unlist(omics) # convert to vector of words
omics <- omics[grep("omics", omics)] # just the -omics words
omics <- gsub("[\"\'\\.:\\?\\[\\]]", "", omics, perl = T) # remove symbols, punctuation
omics <- tolower(omics)
m <- data.frame(a = omics, b = str_match(omics, "^(.*?omics)-")[, 2]) # matches e.g. "genomics-based"
omics <- ifelse(is.na(m$b), as.character(m$a), as.character(m$b))
m <- data.frame(a = omics, b = str_match(omics, "-{1,}(.*?omics)$")[, 2]) # matches e.g. "phospho-proteomics"
omics <- ifelse(is.na(m$b), as.character(m$a), as.character(m$b))
omics <- unlist(strsplit(omics, "\\/")) # split e.g. "genomics/proteomics"
omics <- omics[grep("omics", omics)] # just the -omics words again
# OK we're down to the edge cases now :)
omics <- gsub("applications", "", omics)
omics <- gsub("\\(meta\\)", "meta", omics)
4. Visualize
The top 20 -omics in 2013 and the less popular:
omics.freq <- as.data.frame(table(omics))
omics.freq <- omics.freq[ order(omics.freq$Freq, decreasing = T),]
ggplot(head(omics.freq, 20)) + geom_bar(aes(omics, Freq), stat = "identity", fill = "darkblue")
+ coord_flip() + theme_bw()
# and the less popular
subset(omics.freq, Freq == 1)
|
On the right, the top 20. Click for a larger version of the graphic. Top of the list so far for 2013 is proteomics, followed by genomics and metabolomics.
Listed below, those -omics found only once in titles from 2013. Some shockers, I think you’ll agree (paging Jonathan Eisen).
omics Freq
aquaphotomics 1
biointeractomics 1
calciomics 1
cholanomics 1
cytogenomics 1
cytokinomics 1
econogenomics 1
glcnacomics 1
glycosaminoglycanomics 1
interactomics 1
ionomics 1
macroeconomics 1
materiomics 1
metalloproteomics 1
metaomics 1
metaproteogenomics 1
microbiomics 1
microeconomics 1
microgenomics 1
microproteomics 1
miromics 1
mitoproteomics 1
mobilomics 1
morphomics 1
museomics 1
neuromics 1
neuropeptidomics 1
nitroproteomics 1
nutrimetabonomics 1
oncogenomics 1
orthoproteomics 1
pangenomics 1
petroleomics 1
pharmacometabolomics 1
pharmacoproteomics 1
phylotranscriptomics 1
phytomics 1
postgenomics 1
pyteomics 1
radiogenomics 1
rehabilomics 1
retrophylogenomics 1
romics 1
secretomics 1
sensomics 1
speleogenomics 1
surfaceomics 1
surfomics 1
toxicometabolomics 1
vaccinomics 1
variomics 1
|
Never heard of romics? That’s OK. It’s a surname.
Filed under: bioinformatics, publications, R, statistics Tagged: omics, pubmed
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
