Scraping organism metadata for Treebase repositories from GOLD using Python and R

(This article was first published on What is this? David Springate's personal blog :: R, and kindly contributed to R-bloggers)

Scraping organism metadata for Treebase repositories from GOLD using Python and R

I recently wanted to get hold of habitat/phenotype/sequencing metadata for the individual organisms of an archived Treebase project.)

The GOLD database holds more than 18000 full genomes. For many of these it provides pretty good metadata (GOLDcards) which are indirectly linked to Treebase via NCBI taxa IDs.

Unfortunately GOLD does not seem to have any kind of API for systematic downloads, so I hacked together a very quick-and-dirty scraper in Python that reads in taxa from a Treebase repo, follows the links to each species NCBI page and downloads the linked GOLDcard, if it exists.

Here is the code. You will need the external BeautifulSoup and lxml libraries for this to work – both are fantastic. (The Treebase repo here is from Wu et al. 2009**, just change the url string for a different repo…):

Once you have downloaded all of the available files, It would be great to have your metadata in a nice flatfile with one line per taxa, right? I did this with a little R script using the rather wonderful readHTMLtable() function in the XML (install.packages('XML')) package.

The output is a semicolon separated file with taxa in the rows and the different categories of metadata in columns. The metadata is often fairly incomplete, and there are plenty of omissions, but hopefully it will become more useful as more deposits are made to GOLD.

** Wu D., Hugenholtz P., Mavromatis K., Pukall R., Dalin E., Ivanova N.N., Kunin V., Goodwin L., Wu M., Tindall B.J., Hooper S.D., Pati A., Lykidis A., Spring S., Anderson I.J., D’haeseleer P., Zemla A., Singer M., Lapidus A., Nolan M., Copeland A., Chen F., Cheng J., Lucas S., Kerfeld C., Lang E., Gronow S., Chain P., Bruce D., Rubin E.M., Kyrpides N.C., Klenk H., & Eisen J.A. 2009. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature, 462(7276): 1056-1060.

To leave a comment for the author, please follow the link and comment on their blog: What is this? David Springate's personal blog :: R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)