Scholarly metadata from R

September 17, 2012
By

(This article was first published on Recology - R, and kindly contributed to R-bloggers)

Metadata! Metadata is very cool. It's super hot right now - everybody is talking about it. Okay, maybe not everyone, but it's an important part of archiving scholarly work.

We are working on a repo on GitHub rmetadata to be a one stop shop for querying metadata from around the web. Various repos on GitHub we have started - rpmc, rdatacite, rdryad, rpensoft, rhindawi - will at least in part be folded into rmetadata.

As a start we are writing functions to hit any metadata services that use the OAI-PMH: "Open Archives Initiative Protocol for Metadata Harvesting" framework. OAI-PMH has six methods (or verbs as they are called) for data harvesting that are the same across different metadata providers:

  • GetRecord
  • Identify
  • ListIdentifiers
  • ListMetadataFormats
  • ListRecords
  • ListSets

OAI-PMH provides an updating list of data providers, which we can easily use to get the base URLs for their data. Then we just use one of the six above methods to query their metadata.

Let's install rmetadata first.

install_github("rmetadata", "ropensci")
library(rmetadata)

The most basic thing you can do with OAI-PMH is identify the data provider, getting their basic information. The Identify verb.

# one provider
md_identify(provider = "datacite")
  repositoryName                     baseURL protocolVersion
1   DataCite MDS http://oai.datacite.org/oai             2.0
          adminEmail    earliestDatestamp deletedRecord
1 [email protected] 2011-01-01T00:00:00Z            no
           granularity compression compression.1
1 YYYY-MM-DDThh:mm:ssZ        gzip       deflate
                                                                                                                                                      description
1 oai, oai.datacite.org, :, oai:oai.datacite.org:12425, http://www.openarchives.org/OAI/2.0/oai-identifier http://www.openarchives.org/OAI/2.0/oai-identifier.xsd
# many providers
md_identify(provider = c("datacite", "pensoft"))
      repositoryName                     baseURL protocolVersion
1       DataCite MDS http://oai.datacite.org/oai             2.0
2 Pensoft Publishers       http://oai.pensoft.eu             2.0
          adminEmail    earliestDatestamp deletedRecord
1 [email protected] 2011-01-01T00:00:00Z            no
2               NULL           2008-07-04            no
           granularity compression compression.1
1 YYYY-MM-DDThh:mm:ssZ        gzip       deflate
2           YYYY-MM-DD        NULL          NULL
                                                                                                                                                      description
1 oai, oai.datacite.org, :, oai:oai.datacite.org:12425, http://www.openarchives.org/OAI/2.0/oai-identifier http://www.openarchives.org/OAI/2.0/oai-identifier.xsd
2                                                                                                                                                            NULL
# no match for one, two matches for other
md_identify(provider = c("harvard", "journal"))
$harvard
               x
1 no match found

$journal
                                               repo_name
1       Hrcak - Portal of scientific journals of Croatia
2 International journal of Power Electronics Engineering
# let's pick one from the second
md_identify(provider = "Hrcak")
                                    repositoryName
1 Hrcak - Portal of scientific journals of Croatia
                    baseURL protocolVersion    adminEmail
1 http://hrcak.srce.hr/oai/             2.0 [email protected]
  earliestDatestamp deletedRecord granularity
1        2005-12-01            no  YYYY-MM-DD
                                                                                                                                                                          description
1 oai, hrcak.srce.hr, :, oai:hrcak.srce.hr:anIdentifier, http://www.openarchives.org/OAI/2.0/oai-identifier                    http://www.openarchives.org/OAI/2.0/oai-identifier.xsd

There are a variety of metadata formats, depending on the data provider - list them with the ListMetadataFormats verb.

# List metadata formats for a provider
md_listmetadataformats(provider = "dryad")
  metadataPrefix
1         oai_dc
2            rdf
3            ore
4           mets
                                                       schema
1              http://www.openarchives.org/OAI/2.0/oai_dc.xsd
2                 http://www.openarchives.org/OAI/2.0/rdf.xsd
3 http://tweety.lanl.gov/public/schemas/2008-06/atom-tron.sch
4                  http://www.loc.gov/standards/mets/mets.xsd
                            metadataNamespace
1 http://www.openarchives.org/OAI/2.0/oai_dc/
2    http://www.openarchives.org/OAI/2.0/rdf/
3                 http://www.w3.org/2005/Atom
4                    http://www.loc.gov/METS/
# List metadata formats for a specific identifier for a provider
md_listmetadataformats(provider = "pensoft", identifier = "10.3897/zookeys.1.10")
            identifier metadataPrefix
1 10.3897/zookeys.1.10         oai_dc
2 10.3897/zookeys.1.10           mods
                                             schema
1    http://www.openarchives.org/OAI/2.0/oai_dc.xsd
2 http://www.loc.gov/standards/mods/v3/mods-3-1.xsd
                            metadataNamespace
1 http://www.openarchives.org/OAI/2.0/oai_dc/
2                  http://www.loc.gov/mods/v3

The ListRecords verb is used to harvest records from a repository

head(md_listrecords(provider = "datacite")[[1]][, 2:4])
                  identifier            datestamp setSpec
1 oai:oai.datacite.org:32153 2011-06-08T08:57:11Z     TIB
2 oai:oai.datacite.org:32200 2011-06-20T08:11:08Z     TIB
3 oai:oai.datacite.org:32220 2011-06-28T14:11:08Z     TIB
4 oai:oai.datacite.org:32241 2011-06-30T13:24:45Z     TIB
5 oai:oai.datacite.org:32255 2011-07-01T12:09:24Z     TIB
6 oai:oai.datacite.org:32282 2011-07-05T09:08:10Z     TIB

ListIdentifiers is an abbreviated form of ListRecords, retrieving only headers rather than records.

# Single provider
md_listidentifiers(provider = "datacite", set = "REFQUALITY")[[1]][1:10]
 [1] "oai:oai.datacite.org:32426" "oai:oai.datacite.org:32152"
 [3] "oai:oai.datacite.org:25453" "oai:oai.datacite.org:25452"
 [5] "oai:oai.datacite.org:25451" "oai:oai.datacite.org:25450"
 [7] "oai:oai.datacite.org:25449" "oai:oai.datacite.org:25407"
 [9] "oai:oai.datacite.org:48328" "oai:oai.datacite.org:48439"
md_listidentifiers(provider = "dryad", from = "2012-07-15")[[1]][1:10]
 [1] "oai:datadryad.org:10255/dryad.9106" 
 [2] "oai:datadryad.org:10255/dryad.33780"
 [3] "oai:datadryad.org:10255/dryad.33901"
 [4] "oai:datadryad.org:10255/dryad.33902"
 [5] "oai:datadryad.org:10255/dryad.34472"
 [6] "oai:datadryad.org:10255/dryad.34558"
 [7] "oai:datadryad.org:10255/dryad.39975"
 [8] "oai:datadryad.org:10255/dryad.35065"
 [9] "oai:datadryad.org:10255/dryad.35081"
[10] "oai:datadryad.org:10255/dryad.35082"
# Many providers
out <- md_listidentifiers(provider = c("datacite", "pensoft"), from = "2012-08-21")
llply(out, function(x) x[1:10])  # display just a few of them
[[1]]
 [1] "oai:oai.datacite.org:1099317" "oai:oai.datacite.org:1099572"
 [3] "oai:oai.datacite.org:1099824" "oai:oai.datacite.org:1099695"
 [5] "oai:oai.datacite.org:1088239" "oai:oai.datacite.org:1088122"
 [7] "oai:oai.datacite.org:1088190" "oai:oai.datacite.org:1175749"
 [9] "oai:oai.datacite.org:1175288" "oai:oai.datacite.org:1115603"

[[2]]
 [1] "10.3897/phytokeys.16.2884" "10.3897/phytokeys.16.3602"
 [3] "10.3897/phytokeys.16.3186" "10.3897/zookeys.216.3407" 
 [5] "10.3897/zookeys.216.3332"  "10.3897/zookeys.216.3224" 
 [7] "10.3897/zookeys.216.3769"  "10.3897/zookeys.216.3360" 
 [9] "10.3897/zookeys.216.3646"  "10.3897/neobiota.14.3140" 

With ListSets you can retrieve the set structure of a repository.

# arXiv, returns a data.frame
head(md_listsets(provider = "arXiv")[[1]])
             setName          setSpec
1   Computer Science               cs
2        Mathematics             math
3 Nonlinear Sciences             nlin
4            Physics          physics
5       Astrophysics physics:astro-ph
6   Condensed Matter physics:cond-mat
# many providers, returns a list
md_listsets(provider = c("pensoft", "arXiv"))
[[1]]
                                  setName            setSpec
1                                 ZooKeys            zookeys
2                                 BioRisk            biorisk
3                               PhytoKeys          phytokeys
4                                NeoBiota           neobiota
5         Journal of Hymenoptera Research                jhr
6  International Journal of Myriapodology                ijm
7                Comparative Cytogenetics        compcytogen
8                    Subterranean Biology           subtbiol
9                     Nature Conservation natureconservation
10                               MycoKeys           mycokeys

[[2]]
                                    setName          setSpec
1                          Computer Science               cs
2                               Mathematics             math
3                        Nonlinear Sciences             nlin
4                                   Physics          physics
5                              Astrophysics physics:astro-ph
6                          Condensed Matter physics:cond-mat
7  General Relativity and Quantum Cosmology    physics:gr-qc
8          High Energy Physics - Experiment   physics:hep-ex
9             High Energy Physics - Lattice  physics:hep-lat
10      High Energy Physics - Phenomenology   physics:hep-ph
11             High Energy Physics - Theory   physics:hep-th
12                     Mathematical Physics  physics:math-ph
13                       Nuclear Experiment  physics:nucl-ex
14                           Nuclear Theory  physics:nucl-th
15                          Physics (Other)  physics:physics
16                          Quantum Physics physics:quant-ph
17                     Quantitative Biology            q-bio
18                     Quantitative Finance            q-fin
19                               Statistics             stat

Retrieve an individual metadata record from a repository using the GetRecord verb.

# Single provider, one identifier
md_getrecord(provider = "pensoft", identifier = "10.3897/zookeys.1.10")
            identifier  datestamp
1 10.3897/zookeys.1.10 2008-07-04
                                                                                             dc.title
1 A new candidate for a Gondwanaland distribution in the Zodariidae (Araneae): Australutica in Africa
   dc.creator  dc.subject dc.subject.1 dc.subject.2 dc.subject.3
1 Jocqué,Rudy new species Gondwanaland Soutpansberg      Araneae
         dc.source
1 ZooKeys 1: 59-66
                                                                                                                                                                                                                                                                                                                     dc.description
1 Two new species of Australutica Jocqué, 1995, a genus formerly only known from Australia, are described from South Africa: A. africana n. sp. from Soutpansberg and A. normanlarseni n. sp. from the Cape Peninsula. The taxonomic position of the new species is discussed and a key to the species of Australutica is provided.
        dc.publisher dc.date          dc.type dc.format
1 Pensoft Publishers    2008 Research Article text/html
                           dc.identifier
1 http://dx.doi.org/10.3897/zookeys.1.10
                                      dc.identifier.1 dc.language
1 http://www.pensoft.net/journals/zookeys/article/10/          en
                                                                                   dc..attrs
1 http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd
# Single provider, multiple identifiers
md_getrecord(provider = "pensoft", identifier = c("10.3897/zookeys.1.10", "10.3897/zookeys.4.57"))
            identifier  datestamp
1 10.3897/zookeys.1.10 2008-07-04
2 10.3897/zookeys.4.57 2008-12-17
                                                                                                dc.title
1    A new candidate for a Gondwanaland distribution in the Zodariidae (Araneae): Australutica in Africa
2 Studies of Tiger Beetles. CLXXVIII. A new Lophyra (Lophyra) from Somaliland (Coleoptera, Cicindelidae)
     dc.creator    dc.subject dc.subject.1 dc.subject.2 dc.subject.3
1   Jocqué,Rudy   new species Gondwanaland Soutpansberg      Araneae
2 Cassola,Fabio Tiger Beetles Cicindelidae      Lophyra   Somaliland
         dc.source
1 ZooKeys 1: 59-66
2 ZooKeys 4: 65-69
                                                                                                                                                                                                                                                                                                                                                                                                                   dc.description
1                                                                                               Two new species of Australutica Jocqué, 1995, a genus formerly only known from Australia, are described from South Africa: A. africana n. sp. from Soutpansberg and A. normanlarseni n. sp. from the Cape Peninsula. The taxonomic position of the new species is discussed and a key to the species of Australutica is provided.
2 A new tiger beetle species, Lophyra (Lophyra) praetermissa n. sp. (Coleoptera, Cicindelidae), obviously a close relative of L. (L.) histrio (Tschitschérine, 1903), is described from the environs of Erigavo, Somaliland (northern Somalia). Its discovery thus brings up to 73 the number of the species of this genus presently known worldwide (39 species of which - 29 from Africa - belong to the typonominal subgenus).
        dc.publisher dc.date          dc.type dc.format
1 Pensoft Publishers    2008 Research Article text/html
2 Pensoft Publishers    2008 Research Article text/html
                           dc.identifier
1 http://dx.doi.org/10.3897/zookeys.1.10
2 http://dx.doi.org/10.3897/zookeys.4.57
                                      dc.identifier.1 dc.language
1 http://www.pensoft.net/journals/zookeys/article/10/          en
2 http://www.pensoft.net/journals/zookeys/article/57/          en
                                                                                   dc..attrs
1 http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd
2 http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd

Cool, so I hope people find this post and package useful. Let me know what you think in comments below, or if you have code specific comments or additions, go to the GitHub repo for rmetadata. In a upcoming post I will show an example of what you can do with rmetadata in terms of an actual research question.


Get the .Rmd file used to create this post at my github account - or .md file.

Written in Markdown, with help from knitr, and nice knitr highlighting/etc. in in RStudio.

To leave a comment for the author, please follow the link and comment on his blog: Recology - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.