GEO database: curation lagging behind submission?

August 30, 2010
By

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

geometadb

GSE and GDS records in GEOmetadb by date

I was reading an old post that describes GEOmetadb, a downloadable database containing metadata from the GEO database. We had a brief discussion in the comments about the growth in GSE records (user-submitted) versus GDS records (curated datasets) over time. Below, some quick and dirty R code to examine the issue, using the Bioconductor GEOmetadb package and ggplot2. Left, the resulting image – click for larger version.

Is the curation effort keeping up with user submissions? A little difficult to say, since GEOmetadb curation seems to have its own issues: (1) why do GDS records stop in 2008? (2) why do GDS (curated) records begin earlier than GSE (submitted) records?

library(GEOmetadb)
library(ggplot2)

# update database if required using getSQLiteFile()
# connect to database; assumed to be in user $HOME
con <- dbConnect(SQLite(), "~/GEOmetadb.sqlite")

# fetch "last updated" dates for GDS and GSE
gds <- dbGetQuery(con, "select update_date from gds")
gse <- dbGetQuery(con, "select last_update_date from gse")

# cumulative sums by date; no factor variables
gds.count <- as.data.frame(cumsum(table(gds)), stringsAsFactors = F)
gse.count <- as.data.frame(cumsum(table(gse)), stringsAsFactors = F)

# make GDS and GSE data frames comparable
colnames(gds.count) <- "count"
colnames(gse.count) <- "count"

# row names (dates) to real dates
gds.count$date <- as.POSIXct(rownames(gds.count))
gse.count$date <- as.POSIXct(rownames(gse.count))

# add type for plotting
gds.count$type <- "gds"
gse.count$type <- "gse"

# combine GDS and GSE data frames
gds.gse <- rbind(gds.count, gse.count)

# and plot records over time by type
png(filename = "geometadb.png", width = 800, height = 600)
print(ggplot(gds.gse, aes(date,count)) + geom_line(aes(color = type)))
dev.off()

Filed under: bioinformatics, R, statistics, web resources Tagged: bioconductor, database, geo, geometadb, ggplot2, microarray

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , , , , ,

Comments are closed.

Sponsors

Mango solutions



RStudio homepage



Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training



http://www.eoda.de







ODSC

ODSC

CRC R books series





Six Sigma Online Training





Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)