solr – an R interface to Solr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A number of the APIs we interact with (e.g., PLOS full text API, and USGS's BISON API in rplos and rbison, respectively) expose Solr endpoints. Solr is an Apache hosted project – it is a powerful search server. Given that at least two, and possibly more in the future, of the data providers we interact with provide Solr endpoints, it made sense to create an R package to make robust functions to interact with Solr that work across any Solr endpoint. This is then useful to us, and hopefully others.
The following are a few examples covering some of things you can do in Solr that fall in to six categories:
- Search: via
solr_search
- Grouping: via
solr_group
- Faceting: via
solr_facet
- Highlighting: via
solr_highlight
- Stats: via
solr_stats
- More like this: via
solr_mlt
The solr
package generally has two steps for any query: a) send the request given your inputs, and b) parse the output into a useful R data structure. Part a) is quite easy. However, part b) is harder. We are working hard on making parsers that are as general as possible for each of the data formats that are returned by group, facet, highlight, etc., but of course we will still definitely fail in many cases. Please do submit bug reports to our issue tracker so we can make the parsers work better.
Installation
solr
is on CRAN, so you can install the more stable version there, and some dependencies.
install.packages("solr") install.packages(c("rjson", "plyr", "httr", "XML", "data.table", "assertthat"))
You can install the development version from Github as follows. Below we'll use the Github version – most of below is available in the CRAN version too, except solr_group
.
install.packages("devtools") library(devtools) install_github("solr", "ropensci")
Load the library
library(solr)
Define url endpoint and key
As solr
is a general interface to Solr endpoints, you need to define the url. Here, we'll work with the Public Library of Science full text search API (docs here). Some Solr endpoints will require authentication – I should note that we don't yet handle authentication schemes other than passing in a key in the url, but that's on the to do list.
url <- "http://api.plos.org/search"
Search
solr_search(q = "*:*", rows = 2, fl = "id", url = url) ## http://api.plos.org/search?q=*:*&start=0&rows=2&fl=id&wt=json ## id ## 1 10.1371/journal.pone.0040927/title ## 2 10.1371/journal.pone.0040927/abstract
Search for words "sports" and "alcohol" within seven words of each other
solr_search(q = "everything:\"sports alcohol\"~7", fl = "title", rows = 3, url = url) ## http://api.plos.org/search?q=everything:"sports alcohol"~7&start=0&rows=3&fl=title&wt=json ## title ## 1 “Like Throwing a Bowling Ball at a Battle Ship” Audience Responses to Australian News Stories about Alcohol Pricing and Promotion Policies: A Qualitative Focus Group Study ## 2 Development and Validation of a Risk Score Predicting Substantial Weight Gain over 5 Years in Middle-Aged European Men and Women ## 3 Adolescent Lifestyle and Behaviour: A Survey from a Developing Country
Groups
Most recent publication by journal
solr_group(q = "*:*", group.field = "journal", rows = 5, group.limit = 1, group.sort = "publication_date desc", fl = "publication_date, score", url = url) ## http://api.plos.org/search?group.field=journal&q=*:*&start=0&rows=5&wt=json&group.limit=1&group.sort=publication_date desc&group.sort=publication_date desc&group=true&fl=publication_date, score ## groupValue numFound start publication_date ## 1 plos one 687745 0 2014-01-27T00:00:00Z ## 2 plos genetics 33801 0 2014-01-23T00:00:00Z ## 3 plos neglected tropical diseases 19140 0 2014-01-23T00:00:00Z ## 4 plos biology 24165 0 2014-01-21T00:00:00Z ## 5 none 62556 0 2012-10-23T00:00:00Z ## score ## 1 1 ## 2 1 ## 3 1 ## 4 1 ## 5 1
First publication by journal
solr_group(q = "*:*", group.field = "journal", group.limit = 1, group.sort = "publication_date asc", fl = "publication_date, score", fq = "publication_date:[1900-01-01T00:00:00Z TO *]", url = url) ## http://api.plos.org/search?group.field=journal&q=*:*&start=0&fq=publication_date:[1900-01-01T00:00:00Z TO *]&wt=json&group.limit=1&group.sort=publication_date asc&group.sort=publication_date asc&group=true&fl=publication_date, score ## groupValue numFound start publication_date ## 1 plos one 687745 0 2006-12-01T00:00:00Z ## 2 plos genetics 33801 0 2005-06-17T00:00:00Z ## 3 plos neglected tropical diseases 19140 0 2007-08-30T00:00:00Z ## 4 plos biology 24165 0 2003-08-18T00:00:00Z ## 5 none 57574 0 2012-07-17T00:00:00Z ## 6 plos computational biology 25196 0 2005-06-24T00:00:00Z ## 7 plos medicine 17149 0 2004-09-07T00:00:00Z ## 8 plos pathogens 29691 0 2005-07-22T00:00:00Z ## 9 plos collections 5 0 2013-12-04T00:00:00Z ## 10 plos clinical trials 521 0 2006-04-21T00:00:00Z ## score ## 1 1 ## 2 1 ## 3 1 ## 4 1 ## 5 1 ## 6 1 ## 7 1 ## 8 1 ## 9 1 ## 10 1
Facet
solr_facet(q = "*:*", facet.field = "journal", facet.query = "cell,bird", url = url) ## http://api.plos.org/search?q=*:*&facet.query=cell&facet.query=bird&facet.field=journal&wt=json&fl=DOES_NOT_EXIST&facet=true ## $facet_queries ## term value ## 1 cell 81733 ## 2 bird 8194 ## ## $facet_fields ## $facet_fields$journal ## X1 X2 ## 1 plos one 687745 ## 2 plos genetics 33801 ## 3 plos pathogens 29691 ## 4 plos computational biology 25196 ## 5 plos biology 24165 ## 6 plos neglected tropical diseases 19140 ## 7 plos medicine 17149 ## 8 plos clinical trials 521 ## 9 plos medicin 9 ## 10 plos collections 5 ## ## ## $facet_dates ## NULL ## ## $facet_ranges ## NULL
Range faceting with > 1 field
out <- solr_facet(q = "*:*", url = url, facet.range = "counter_total_all,alm_twitterCount", facet.range.start = 5, facet.range.end = 1000, facet.range.gap = 10) ## http://api.plos.org/search?q=*:*&facet.range=counter_total_all&facet.range=alm_twitterCount&facet.range.start=5&facet.range.end=1000&facet.range.gap=10&wt=json&fl=DOES_NOT_EXIST&facet=true lapply(out$facet_ranges, head) ## $counter_total_all ## X1 X2 ## 1 5 0 ## 2 15 421 ## 3 25 800 ## 4 35 867 ## 5 45 690 ## 6 55 789 ## ## $alm_twitterCount ## X1 X2 ## 1 5 37537 ## 2 15 7940 ## 3 25 3559 ## 4 35 1627 ## 5 45 1416 ## 6 55 795
Highlight
solr_highlight(q = "alcohol", hl.fl = "abstract", rows = 2, url = url) ## http://api.plos.org/search?wt=json&q=alcohol&start=0&rows=2&hl=true&hl.fl=abstract&fl=DOES_NOT_EXIST ## $`10.1371/journal.pmed.0040151` ## $`10.1371/journal.pmed.0040151`$abstract ## [1] "Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting" ## ## ## $`10.1371/journal.pone.0027752` ## $`10.1371/journal.pone.0027752`$abstract ## [1] "Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking"
Stats
out <- solr_stats(q = "ecology", stats.field = "counter_total_all,alm_twitterCount", stats.facet = "journal,volume", url = url) ## http://api.plos.org/search?q=ecology&stats.field=counter_total_all&stats.field=alm_twitterCount&stats.facet=journal&stats.facet=volume&start=0&rows=0&wt=json&stats=true out$data ## min max count missing sum sumOfSquares mean ## counter_total_all 0 294080 18696 0 61103092 1.020e+12 3268.244 ## alm_twitterCount 0 1345 18696 0 63470 8.682e+06 3.395 ## stddev ## counter_total_all 6623.95 ## alm_twitterCount 21.28 out$facet ## $counter_total_all ## $counter_total_all$journal ## min max count missing sum sumOfSquares mean stddev ## 1 669 38118 414 0 2160463 1.870e+10 5219 4241 ## 2 582 42775 543 0 3170356 2.951e+10 5839 4505 ## 3 0 294080 14463 0 37291578 5.716e+11 2578 5734 ## 4 4401 8325 2 0 12726 8.867e+07 6363 2775 ## 5 157 54715 373 0 1971606 2.074e+10 5286 5268 ## 6 507 83913 209 0 2288958 5.031e+10 10952 11016 ## 7 893 161403 751 0 8647100 2.228e+11 11514 12820 ## 8 211 160180 690 0 2252119 3.722e+10 3264 6584 ## facet_field ## 1 plos pathogens ## 2 plos genetics ## 3 plos one ## 4 plos clinical trials ## 5 plos computational biology ## 6 plos medicine ## 7 plos biology ## 8 plos neglected tropical diseases ## ## $counter_total_all$volume ## min max count missing sum sumOfSquares mean stddev ## 1 840 108024 741 0 5138959 9.339e+10 6935 8834 ## 2 1144 85924 482 0 3999437 7.887e+10 8298 9746 ## 3 157 78129 104 0 870304 2.037e+10 8368 11270 ## 4 1385 109254 81 0 1075033 3.657e+10 13272 16698 ## 5 362 179012 4823 0 12626395 1.785e+11 2618 5492 ## 6 539 160180 2946 0 10172036 1.292e+11 3453 5652 ## 7 491 74271 1538 0 7411041 8.495e+10 4819 5660 ## 8 507 294080 1010 0 6330604 1.848e+11 6268 11994 ## 9 0 161403 812 0 2260500 4.591e+10 2784 6989 ## 10 78 166682 6093 0 10519943 1.515e+11 1727 4678 ## 11 1979 70376 62 0 670098 1.534e+10 10808 11521 ## 12 893 24278 4 0 28742 5.969e+08 7186 11407 ## facet_field ## 1 3 ## 2 2 ## 3 10 ## 4 1 ## 5 7 ## 6 6 ## 7 5 ## 8 4 ## 9 9 ## 10 8 ## 11 11 ## 12 12 ## ## ## $alm_twitterCount ## $alm_twitterCount$journal ## min max count missing sum sumOfSquares mean stddev ## 1 0 73 414 0 1255 31525 3.031 8.193 ## 2 0 99 543 0 1444 36060 2.659 7.710 ## 3 0 741 14463 0 43477 4699037 3.006 17.773 ## 4 0 3 2 0 3 9 1.500 2.121 ## 5 0 102 373 0 1140 36082 3.056 9.361 ## 6 0 456 209 0 2098 357306 10.038 40.207 ## 7 0 1345 751 0 5871 2518003 7.818 57.412 ## 8 0 785 690 0 1803 628569 2.613 30.091 ## facet_field ## 1 plos pathogens ## 2 plos genetics ## 3 plos one ## 4 plos clinical trials ## 5 plos computational biology ## 6 plos medicine ## 7 plos biology ## 8 plos neglected tropical diseases ## ## $alm_twitterCount$volume ## min max count missing sum sumOfSquares mean stddev facet_field ## 1 0 18 741 0 308 2380 0.4157 1.744 3 ## 2 0 36 482 0 276 4426 0.5726 2.979 2 ## 3 0 456 104 0 2580 370684 24.8077 54.566 10 ## 4 0 28 81 0 82 1630 1.0123 4.397 1 ## 5 0 734 4823 0 17106 1585234 3.5468 17.781 7 ## 6 0 785 2946 0 2710 761766 0.9199 16.057 6 ## 7 0 110 1538 0 1058 39912 0.6879 5.049 5 ## 8 0 147 1010 0 505 27497 0.5000 5.196 4 ## 9 0 206 812 0 5509 297295 6.7845 17.902 9 ## 10 0 741 6093 0 28449 3110039 4.6691 22.107 8 ## 11 1 1345 62 0 4292 2185830 69.2258 175.962 11 ## 12 14 543 4 0 595 295767 148.7500 262.844 12
More like this
solr_mlt
is a function to return similar documents to the ones searched for.
out <- solr_mlt(q = "title:\"ecology\" AND body:\"cell\"", mlt.fl = "title", mlt.mindf = 1, mlt.mintf = 1, fl = "counter_total_all", rows = 5, url = url) ## http://api.plos.org/search?q=title:"ecology" AND body:"cell"&mlt=true&fl=id,counter_total_all&mlt.fl=title&mlt.mintf=1&mlt.mindf=1&start=0&rows=5&wt=json out$docs ## id counter_total_all ## 1 10.1371/journal.pbio.0020440 16052 ## 2 10.1371/journal.pone.0040117 1654 ## 3 10.1371/journal.pone.0072525 681 ## 4 10.1371/journal.ppat.1002320 4708 ## 5 10.1371/journal.pone.0015143 11161
Raw data?
You can optionally get back raw json
or xml
from all functions by setting parameter raw=TRUE
. You can then parse after the fact with solr_parse
, or just process as you wish. For example:
(out <- solr_highlight(q = "alcohol", hl.fl = "abstract", rows = 2, url = url, raw = TRUE)) ## http://api.plos.org/search?wt=json&q=alcohol&start=0&rows=2&hl=true&hl.fl=abstract&fl=DOES_NOT_EXIST ## [1] "{\"response\":{\"numFound\":11628,\"start\":0,\"docs\":[{},{}]},\"highlighting\":{\"10.1371/journal.pmed.0040151\":{\"abstract\":[\"Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting\"]},\"10.1371/journal.pone.0027752\":{\"abstract\":[\"Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking\"]}}}\n" ## attr(,"class") ## [1] "sr_high" ## attr(,"wt") ## [1] "json"
Then parse
solr_parse(out, "df") ## names ## 1 10.1371/journal.pmed.0040151 ## 2 10.1371/journal.pone.0027752 ## abstract ## 1 Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting ## 2 Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking
Verbosity
As you have noticed, we include in each function the acutal call to the Solr endpoint made so you know exactly what was submitted to the remote or local Solr instance. You can suppress the message with verbose=FALSE
. This message isn't in the CRAN version.
Advanced: Function Queries
Function Queries allow you to query on actual numeric fields in the SOLR database, and do addition, multiplication, etc on one or many fields to stort results. For example, here, we search on the product of counter_total_all and alm_twitterCount, using a new temporary field "_val_"
solr_search(q = "_val_:\"product(counter_total_all,alm_twitterCount)\"", rows = 5, fl = "id,title", fq = "doc_type:full", url = url) ## http://api.plos.org/search?q=_val_:"product(counter_total_all,alm_twitterCount)"&start=0&rows=5&fq=doc_type:full&fl=id,title&wt=json ## id ## 1 10.1371/journal.pmed.0020124 ## 2 10.1371/journal.pone.0046362 ## 3 10.1371/journal.pntd.0001969 ## 4 10.1371/journal.pone.0069841 ## 5 10.1371/journal.pbio.1001535 ## title ## 1 Why Most Published Research Findings Are False ## 2 The Power of Kawaii: Viewing Cute Images Promotes a Careful Behavior and Narrows Attentional Focus ## 3 An In-Depth Analysis of a Piece of Shit: Distribution of Schistosoma mansoni and Hookworm Eggs in Human Stool ## 4 Facebook Use Predicts Declines in Subjective Well-Being in Young Adults ## 5 An Introduction to Social Media for Scientists
Here, we search for the papers with the most citations
solr_search(q = "_val_:\"max(counter_total_all)\"", rows = 5, fl = "id,counter_total_all", fq = "doc_type:full", url = url) ## http://api.plos.org/search?q=_val_:"max(counter_total_all)"&start=0&rows=5&fq=doc_type:full&fl=id,counter_total_all&wt=json ## id counter_total_all ## 1 10.1371/journal.pmed.0020124 802095 ## 2 10.1371/journal.pmed.0050045 301752 ## 3 10.1371/journal.pone.0007595 294080 ## 4 10.1371/journal.pone.0044864 234574 ## 5 10.1371/journal.pone.0033288 200899
Or with the most tweets
solr_search(q = "_val_:\"max(alm_twitterCount)\"", rows = 5, fl = "id,alm_twitterCount", fq = "doc_type:full", url = url) ## http://api.plos.org/search?q=_val_:"max(alm_twitterCount)"&start=0&rows=5&fq=doc_type:full&fl=id,alm_twitterCount&wt=json ## id alm_twitterCount ## 1 10.1371/journal.pbio.1001535 1345 ## 2 10.1371/journal.pone.0046362 1127 ## 3 10.1371/journal.pmed.0020124 1016 ## 4 10.1371/journal.pntd.0001969 785 ## 5 10.1371/journal.pone.0065263 741
Further reading on Solr
- Solr home page
- Highlighting help
- Faceting help
- Solr stats
- 'More like this' searches
- Grouping/Feild collapsing
- Installing Solr on Mac using homebrew
- Install and Setup SOLR in OSX, including running Solr
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.