MongoDB – State of the R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Naturally there are two reasons for why you need to access MongoDB from R:
- MongoDB is already used for whatever reason and you want to analyze the data stored therein
- You decide you want store your data in MongoDB instead of using native R technology like data.table or data.frame
In-memory data storage like data.table is very fast especially for numerical data, provided the data actually fits into your RAM – but even then MongoDB comes along with a bag of goodies making it a tempting choice for a number of use cases:
- Flexible schema-less data structures
- spatial and textual indexing
- spatial queries
- persistence of data
- easily accessible from other languages and systems
In case you would like to learn more about MongoDB then I have good news for you – MongoDB Inc. provides a number of very well made online courses catering to various languages. An overview you may find here.
rmongodb versus RMongo
The good news is – there are two packages available for making R talk to MongoDB.
- rmongodb by Mongo Inc and Markus Schmidberger
- RMongo by Thommy Chheng
For a larger project at work I decided to go with rmongodb because it does not require Java in contrast to RMongo and it seems to be more actively developed – not to mention that MongoDB Incorporation itself has a finger in the pie apparently. And I can say I did not regret that choice. It’s a great package – and along these lines a big thank you to Markus Schmidberger for investing his time in its development. Having said that – there are a few quirks and a rather uncomfortable not-yet resolved issue one is going to face for non-trivial queries. But before I get to those subjects let me first give you a quick introduction into its application.
Storing Data with rmongodb
I assume that a MongoDB daemon is running on your localhost. By the way installing and using MongoDB from R is pretty much effortless for Ubuntu as well as Windows. First we are going to establish a connection and check whether we were successful.
> library(rmongodb) > > m <- mongo.create() > ns <- "database.collection" > > mongo.is.connected(m) [1] TRUE
Now let’s insert an object which we define using JSON notation:
json <- '{"a":1, "b":2, "c": {"d":3, "e":4}}' bson <- mongo.bson.from.JSON(json) mongo.insert(m, ns, bson)
Maybe this intermediate step is a bit surprising – after all MongoDB stores JSONs!? Well, it doesn’t – it works with BSONs.
BSON [bee · sahn], short for Binary JSON, is a binary-encoded serialization of JSON-like documents. Like JSON, BSON supports the embedding of documents and arrays within other documents and arrays. BSON also contains extensions that allow representation of data types that are not part of the JSON spec. For example, BSON has a Date type and a BinData type. [bsonspec.org]
The data structure closest to a JSON in R is a list – so naturally we can use that too for specifying a document to be inserted:
list <- list(a=2, b=3, c=list(d=4, e=5)) bson <- mongo.bson.from.list(list) mongo.insert(m, ns, bson)
Retrieving Data from MongoDB
Now let me show you how to retrieve those two documents and print them – for illustrative purposes I query for documents whose field “a” holds a value greater or equal to 1:
> json <- '{"a":{"$gte":1}}' > bson <- mongo.bson.from.JSON(json) > cursor <- mongo.find(m, ns, bson) > while(mongo.cursor.next(cursor)) { + value <- mongo.cursor.value(cursor) + list <- mongo.bson.to.list(value) + str(list) + } List of 4 $ _id:Class 'mongo.oid' atomic [1:1] 1 .. ..- attr(*, "mongo.oid")=<externalptr> $ a : num 1 $ b : num 2 $ c : Named num [1:2] 3 4 ..- attr(*, "names")= chr [1:2] "d" "e" List of 4 $ _id:Class 'mongo.oid' atomic [1:1] 0 .. ..- attr(*, "mongo.oid")=<externalptr> $ a : num 2 $ b : num 3 $ c : Named num [1:2] 4 5 ..- attr(*, "names")= chr [1:2] "d" "e"
Also the search query is only superficially a JSON and hence has to be casted to BSON before applying it. The result is a cursor which has to be iterated and leads to the resulting documents – as BSONs, of course. This little ritual with converting to and from BSON feels a bit clumsy at times but one gets used to it eventually and of course nothing keeps you from writing a more comfortable wrapper.
Implicit Coversion of Sub-Document to Vector
For the purpose of illustrating my point I added a document into the collection from MongoDB shell:
db.collection.insert({_id:1, a:{b:2, c:3, d:4, e:5}})
Now what you will see is that rmongodb implicitely casts the lowest sub-document as named R vector:
> json <- '{"_id":1}' > bson <- mongo.bson.from.JSON(json) > cursor <- mongo.find(m, ns, bson) > mongo.cursor.next(cursor) [1] TRUE > value <- mongo.cursor.value(cursor) > list <- mongo.bson.to.list(value) > print(list) $`_id` [1] 1 $a b c d e 2 3 4 5 > str(list) List of 2 $ _id: num 1 $ a : Named num [1:4] 2 3 4 5 ..- attr(*, "names")= chr [1:4] "b" "c" "d" "e" # it is not possible to access the sub-document as a list() > list$a$b Error in list$a$b : $ operator is invalid for atomic vectors > list$a["b"] b 2
This is something you have to keep in mind. Personally I find this to be a tad uncomfortable. As a list() would work, too, and would allow for a more homogenous processing. But that is just my two cents.
Formulating Queries Using Arrays is Problematic
At the present it is not possible to easily formulate a BSON containing an array. This is primarily a problem when you want to query documents and that query expression needs an array. For example:
{"$or": [{"a":1}, {"a":3}]}
Let’s assume three documents in
database.collection:
> db.collection.find() { "_id" : ObjectId("53fe48857f953e7c617eea04"), "a" : 1 } { "_id" : ObjectId("53fe48877f953e7c617eea05"), "a" : 2 } { "_id" : ObjectId("53fe488a7f953e7c617eea06"), "a" : 3 }
We would expect to receive the two obvious documents. But doing so in rmongodb currently yields an error:
> library(rmongodb) > M <- mongo.create("localhost") > mongo.is.connected(M) [1] TRUE > > qry1 <- list( + "a" = 1 + ) > > qry2 <- list( + "$or" = list( + list("a" = 1), + list("a" = 3) + ) + ) > > qry1 <- mongo.bson.from.list(qry1) > qry2 <- mongo.bson.from.list(qry2) > > mongo.count(M, "test.xxx", qry1) [1] 1 > mongo.count(M, "test.xxx", qry2) [1] -1 > mongo.get.last.err(M, "test") connectionId : 16 24 err : 2 $or needs an array code : 16 2 n : 16 0 ok : 1 1.000000
Building Up the BSON Buffer from Scratch
Internally rmongodb constructs a BSON for a JSON by constructing an initial BSON buffer and then adding to it the elements of the JSON-document. For the suggested query this would be done as follows:
buf <- mongo.bson.buffer.create() # "$or":[ ... mongo.bson.buffer.start.array(buf, "$or") # dummy name "0" for object in array # "0": { ... mongo.bson.buffer.start.object(buf, "0") # "a":1 mongo.bson.buffer.append.int(buf, "a", 1) # ... } mongo.bson.buffer.finish.object(buf) mongo.bson.buffer.start.object(buf, "1") mongo.bson.buffer.append.int(buf, "a", 3) mongo.bson.buffer.finish.object(buf) # ...] mongo.bson.buffer.finish.object(buf) bson <- mongo.bson.from.buffer(buf)
Obviously this approach is going to get quite complicated and error-prone for even mildly complex queries/documents. But then again this method of building up a BSON is also quite generic and straightforward. So, to simplify this task I composed a little package that recursively traverses the representing list and invokes those functions appropriately.
rmongodbHelper at Your Service
The following code will turn the or-query from a JSON into a BSON:
# install rmongodbHelper package from GitHub # install.packages("devtools") library(devtools) devtools::install_github("joyofdata/rmongodbHelper") library(rmongodbHelper) json_qry <- '{ "$or": [ {"a":1}, {"a":3} ] }' bson <- rmongodbhelper::json_to_bson(json_qry) cur <- mongo.find(M, "dbx.collx", bson) while(mongo.cursor.next(cur)) { print(mongo.cursor.value(cur)) }
And its result:
_id : 7 53fa14315aed8483db4ae794 a : 16 1 _id : 7 53fa14315aed8483db4ae796 a : 16 3
Why Oh Why?
The reason why I took the time to craft a solution for this issue is two fold:
- I don’t program C, so I cannot contribute to rmongodb by fixing the issue myself
- The issue seems to be around already for almost a year
I really hope my solution will become superfluous very soon because R deserves an awesome MongoDB connector and rmongodb is pretty damn awesome already.
Some Details about rmongodbHelper
I answered a couple of questions on stackoverflow.com which will provide more code on how to use the package:
- Query MongoDB from R with mongo.bson.from.list() and $or expression
- Using $or array in query
- rmongodb: using $or in query
Keep few rules in mind in case you would like to use it:
- It is version 0.0.0 – so there will be bugs – you are welcome to contribute or complain
- If you would like to feed it a JSON then keep in mind that all keys have to be placed within double quotes
'"x":3'
will be casted as double'"x":__int(3)'
will be casted as integer- Internally arrays and sub-documents are implemented as
list()
s. They are differentiated by non-existence ofnames()
for arrays and presence of nams for sub-documents. An empty array has to contain onelist()
-element .ARR with arbitrary value.
To give an example for the last point:
L <- list( obj = list( obj = list(), array = list(.ARR=1) ), arr = list( list(a = 1, b = 1), list("c","d") ) ) bson <- rmongodbhelper::list_to_bson(L) mongo.insert(M, "database.collection", bson)
The resulting document will look pretty-printed on MongoDB shell as follows:
> use database > db.collection.find().pretty() { "_id" : ObjectId("540085c4d915dc2b16c1327a"), "obj" : { "obj" : { }, "array" : [ ] }, "arr" : [ { "a" : 1, "b" : 1 }, [ "c", "d" ] ] }
(original article published on www.joyofdata.de)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.