I was reading an old post that describes GEOmetadb, a downloadable database containing metadata from the GEO database. We had a brief discussion in the comments about the growth in GSE records (user-submitted) versus GDS records (curated datasets) over time. Below, some quick and dirty R code to examine the issue, using the Bioconductor GEOmetadb package and ggplot2. Left, the resulting image – click for larger version.
Is the curation effort keeping up with user submissions? A little difficult to say, since GEOmetadb curation seems to have its own issues: (1) why do GDS records stop in 2008? (2) why do GDS (curated) records begin earlier than GSE (submitted) records?
library(GEOmetadb) library(ggplot2) # update database if required using getSQLiteFile() # connect to database; assumed to be in user $HOME con <- dbConnect(SQLite(), "~/GEOmetadb.sqlite") # fetch "last updated" dates for GDS and GSE gds <- dbGetQuery(con, "select update_date from gds") gse <- dbGetQuery(con, "select last_update_date from gse") # cumulative sums by date; no factor variables gds.count <- as.data.frame(cumsum(table(gds)), stringsAsFactors = F) gse.count <- as.data.frame(cumsum(table(gse)), stringsAsFactors = F) # make GDS and GSE data frames comparable colnames(gds.count) <- "count" colnames(gse.count) <- "count" # row names (dates) to real dates gds.count$date <- as.POSIXct(rownames(gds.count)) gse.count$date <- as.POSIXct(rownames(gse.count)) # add type for plotting gds.count$type <- "gds" gse.count$type <- "gse" # combine GDS and GSE data frames gds.gse <- rbind(gds.count, gse.count) # and plot records over time by type png(filename = "geometadb.png", width = 800, height = 600) print(ggplot(gds.gse, aes(date,count)) + geom_line(aes(color = type))) dev.off()
Filed under: bioinformatics, R, statistics, web resources Tagged: bioconductor, database, geo, geometadb, ggplot2, microarray