finch has just been released to CRAN (binaries should be up soon).
finch is a package to parse Darwin Core files. Darwin Core is:
a body of standards. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. The Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information. … The Simple Darwin Core [SIMPLEDWC] is a specification for one particular way to use the terms – to share data about taxa and their occurrences in a simply structured way – and is probably what is meant if someone suggests to "format your data according to the Darwin Core".
DwC for short going forward.
GBIF (Global Biodiversity Information Facility) is the biggest holder of biodiversity data. When you request
data in bulk format from GBIF they call give it to you in what's called a Darwin Core Archive, or
DwC-A. GBIF has a validator for DwC-A files as well: http://tools.gbif.org/dwca-validator/
One of our most used packages is probably
rgbif, a client to interact with GBIF's web services.
There's a series of functions in
rgbif to request data in bulk format (see functions starting
occ_download), and from this you get a DwC-A file. This is where
finch comes in:
it can parse these DwC-A files into something useable inside R.
install.packages("finch") # or from source if binary not available yet install.packages("finch", type = "source")
To parse a simple darwin core file like
urn:catalog:YPM:VP.057488 PhysicalObject 2009-02-12T12:43:31 en FossilSpecimen YPM VP VP.057488 1 North America United States US Montana Garfield Tyrannosourus rex Tyrannosourus rex Creataceous Creataceous Late Cretaceous Late Cretaceous
This file is in this package as an example file, get the file, then
file <- system.file("examples", "example_simple_fossil.xml", package = "finch") out <- simple_read(file)
out$dc #> [] #> []$type #>  "PhysicalObject" #> #> #> [] #> []$modified #>  "2009-02-12T12:43:31" #> #> #> [] #> []$language #>  "en"
Parse Darwin Core Archive
To parse a Darwin Core Archive like can be gotten from GBIF use
dwca_read() can parse a DwC-A file as a directory, zipped file, or from a URL.
There's an example Darwin Core Archive:
file <- system.file("examples", "0000154-150116162929234.zip", package = "finch") (out <- dwca_read(file, read = TRUE)) #>
#> Package ID: 6cfaaf9c-d518-4ca3-8dc5-f5aadddc0390 #> No. data sources: 10 #> No. datasets: 3 #> Dataset occurrence.txt: [225 X 443] #> Dataset multimedia.txt: [15 X 1] #> Dataset verbatim.txt: [209 X 443]
List files in the archive
out$files #> $xml_files #>  "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/finch/examples/0000154-150116162929234/meta.xml" #>  "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/finch/examples/0000154-150116162929234/metadata.xml" #> #> $txt_files #>  "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/finch/examples/0000154-150116162929234/citations.txt" #>  "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/finch/examples/0000154-150116162929234/multimedia.txt" #>  "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/finch/examples/0000154-150116162929234/occurrence.txt" #>  "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/finch/examples/0000154-150116162929234/rights.txt" #>  "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/finch/examples/0000154-150116162929234/verbatim.txt" ...
High level metadata for the whole archive (printing a subset for brevity)
#> #> GBIF Occurrence Download 0000154-150116162929234#> #> #> GBIF Download Service#> #> #> #> #> GBIF Download Service#> #> #> #> OZCAM (Online Zoological Collections of Australian Museums) Provider#> http://www.ozcam.org.au/#> CONTENT_PROVIDER#> #> #> ...
High level metadata for each data file, there's many files, but we'll just look at one
hm <- out$highmeta head( hm$occurrence.txt ) #> index term delimitedBy #> 1 0 http://rs.gbif.org/terms/1.0/gbifID
#> 2 1 http://purl.org/dc/terms/abstract #> 3 2 http://purl.org/dc/terms/accessRights #> 4 3 http://purl.org/dc/terms/accrualMethod #> 5 4 http://purl.org/dc/terms/accrualPeriodicity #> 6 5 http://purl.org/dc/terms/accrualPolicy
You can get the same metadata as above for each dataset that went into the tabular dataset downloaded
View one of the datasets, brief overview.
head(out$data[][,c(1:5)]) #> gbifID abstract accessRights accrualMethod accrualPeriodicity #> 1 50280003 NA NA NA #> 2 477550574 NA NA NA #> 3 239703844 NA NA NA #> 4 239703843 NA NA NA #> 5 239703833 NA NA NA #> 6 477550692 NA NA NA
names(out$data[])[1:20] #>  "gbifID" "abstract" #>  "accessRights" "accrualMethod" #>  "accrualPeriodicity" "accrualPolicy" #>  "alternative" "audience" #>  "available" "bibliographicCitation" #>  "conformsTo" "contributor" #>  "coverage" "created" #>  "creator" "date" #>  "dateAccepted" "dateCopyrighted" #>  "dateSubmitted" "description"
DwC-A files can be very large – This is for sure going to be a pain point for some.
We'll continue to test and refine on big data files.
We'd love to know what people think about this package.
Documentation can be better, e.g., there's no vignette yet (but adding