MongoDB – State of the R

Posted on August 31, 2014 by Raffael Vogler in R bloggers | 0 Comments

[This article was first published on joy of data » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

mongodb Naturally there are two reasons for why you need to access MongoDB from R:

MongoDB is already used for whatever reason and you want to analyze the data stored therein
You decide you want store your data in MongoDB instead of using native R technology like data.table or data.frame

In-memory data storage like data.table is very fast especially for numerical data, provided the data actually fits into your RAM – but even then MongoDB comes along with a bag of goodies making it a tempting choice for a number of use cases:

Flexible schema-less data structures
spatial and textual indexing
spatial queries
persistence of data
easily accessible from other languages and systems

In case you would like to learn more about MongoDB then I have good news for you – MongoDB Inc. provides a number of very well made online courses catering to various languages. An overview you may find here.

rmongodb versus RMongo

The good news is – there are two packages available for making R talk to MongoDB.

rmongodb by Mongo Inc and Markus Schmidberger
RMongo by Thommy Chheng

For a larger project at work I decided to go with rmongodb because it does not require Java in contrast to RMongo and it seems to be more actively developed – not to mention that MongoDB Incorporation itself has a finger in the pie apparently. And I can say I did not regret that choice. It’s a great package – and along these lines a big thank you to Markus Schmidberger for investing his time in its development. Having said that – there are a few quirks and a rather uncomfortable not-yet resolved issue one is going to face for non-trivial queries. But before I get to those subjects let me first give you a quick introduction into its application.

Storing Data with rmongodb

I assume that a MongoDB daemon is running on your localhost. By the way installing and using MongoDB from R is pretty much effortless for Ubuntu as well as Windows. First we are going to establish a connection and check whether we were successful.

> library(rmongodb)
> 
> m <- mongo.create()
> ns <- "database.collection"
> 
> mongo.is.connected(m)
[1] TRUE

Now let’s insert an object which we define using JSON notation:

json <- '{"a":1, "b":2, "c": {"d":3, "e":4}}'
bson <- mongo.bson.from.JSON(json)
mongo.insert(m, ns, bson)

Maybe this intermediate step is a bit surprising – after all MongoDB stores JSONs!? Well, it doesn’t – it works with BSONs.

BSON [bee · sahn], short for Binary JSON, is a binary-encoded serialization of JSON-like documents. Like JSON, BSON supports the embedding of documents and arrays within other documents and arrays. BSON also contains extensions that allow representation of data types that are not part of the JSON spec. For example, BSON has a Date type and a BinData type. [bsonspec.org]

The data structure closest to a JSON in R is a list – so naturally we can use that too for specifying a document to be inserted:

list <- list(a=2, b=3, c=list(d=4, e=5))
bson <- mongo.bson.from.list(list)
mongo.insert(m, ns, bson)

Retrieving Data from MongoDB

Now let me show you how to retrieve those two documents and print them – for illustrative purposes I query for documents whose field “a” holds a value greater or equal to 1:

> json <- '{"a":{"$gte":1}}'
> bson <- mongo.bson.from.JSON(json)
> cursor <- mongo.find(m, ns, bson)
> while(mongo.cursor.next(cursor)) {
+   value <- mongo.cursor.value(cursor)
+   list <- mongo.bson.to.list(value)
+   str(list)
+ }

List of 4
 $ _id:Class 'mongo.oid'  atomic [1:1] 1
  .. ..- attr(*, "mongo.oid")=<externalptr> 
 $ a  : num 1
 $ b  : num 2
 $ c  : Named num [1:2] 3 4
  ..- attr(*, "names")= chr [1:2] "d" "e"

List of 4
 $ _id:Class 'mongo.oid'  atomic [1:1] 0
  .. ..- attr(*, "mongo.oid")=<externalptr> 
 $ a  : num 2
 $ b  : num 3
 $ c  : Named num [1:2] 4 5
  ..- attr(*, "names")= chr [1:2] "d" "e"

Also the search query is only superficially a JSON and hence has to be casted to BSON before applying it. The result is a cursor which has to be iterated and leads to the resulting documents – as BSONs, of course. This little ritual with converting to and from BSON feels a bit clumsy at times but one gets used to it eventually and of course nothing keeps you from writing a more comfortable wrapper.

Implicit Coversion of Sub-Document to Vector

For the purpose of illustrating my point I added a document into the collection from MongoDB shell:

db.collection.insert({_id:1, a:{b:2, c:3, d:4, e:5}})

Now what you will see is that rmongodb implicitely casts the lowest sub-document as named R vector:

> json <- '{"_id":1}'
> bson <- mongo.bson.from.JSON(json)
> cursor <- mongo.find(m, ns, bson)
> mongo.cursor.next(cursor)
[1] TRUE

> value <- mongo.cursor.value(cursor)
> list <- mongo.bson.to.list(value)

> print(list)
$`_id`
[1] 1

$a
b c d e 
2 3 4 5 

> str(list)
List of 2
 $ _id: num 1
 $ a  : Named num [1:4] 2 3 4 5
  ..- attr(*, "names")= chr [1:4] "b" "c" "d" "e"

# it is not possible to access the sub-document as a list()
> list$a$b
Error in list$a$b : $ operator is invalid for atomic vectors

> list$a["b"]
b 
2

This is something you have to keep in mind. Personally I find this to be a tad uncomfortable. As a list() would work, too, and would allow for a more homogenous processing. But that is just my two cents.

Formulating Queries Using Arrays is Problematic

At the present it is not possible to easily formulate a BSON containing an array. This is primarily a problem when you want to query documents and that query expression needs an array. For example:

{"$or": [{"a":1}, {"a":3}]}

Let’s assume three documents in

database.collection

> db.collection.find()
{ "_id" : ObjectId("53fe48857f953e7c617eea04"), "a" : 1 }
{ "_id" : ObjectId("53fe48877f953e7c617eea05"), "a" : 2 }
{ "_id" : ObjectId("53fe488a7f953e7c617eea06"), "a" : 3 }

We would expect to receive the two obvious documents. But doing so in rmongodb currently yields an error:

> library(rmongodb)
> M <- mongo.create("localhost")
> mongo.is.connected(M)
[1] TRUE
> 
> qry1 <- list(
+     "a" = 1
+ )
> 
> qry2 <- list(
+     "$or" = list(
+         list("a" = 1),
+         list("a" = 3)
+     )
+ )
> 
> qry1 <- mongo.bson.from.list(qry1)
> qry2 <- mongo.bson.from.list(qry2)
> 
> mongo.count(M, "test.xxx", qry1)
[1] 1
> mongo.count(M, "test.xxx", qry2)
[1] -1
> mongo.get.last.err(M, "test")
    connectionId : 16    24
    err : 2      $or needs an array
    code : 16    2
    n : 16   0
    ok : 1   1.000000

Building Up the BSON Buffer from Scratch

Internally rmongodb constructs a BSON for a JSON by constructing an initial BSON buffer and then adding to it the elements of the JSON-document. For the suggested query this would be done as follows:

buf <- mongo.bson.buffer.create()

# "$or":[ ...
mongo.bson.buffer.start.array(buf, "$or")

# dummy name "0" for object in array
# "0": { ...
mongo.bson.buffer.start.object(buf, "0")
# "a":1
mongo.bson.buffer.append.int(buf, "a", 1)
# ... }
mongo.bson.buffer.finish.object(buf)

mongo.bson.buffer.start.object(buf, "1")
mongo.bson.buffer.append.int(buf, "a", 3)
mongo.bson.buffer.finish.object(buf)

# ...]
mongo.bson.buffer.finish.object(buf)

bson <- mongo.bson.from.buffer(buf)

Obviously this approach is going to get quite complicated and error-prone for even mildly complex queries/documents. But then again this method of building up a BSON is also quite generic and straightforward. So, to simplify this task I composed a little package that recursively traverses the representing list and invokes those functions appropriately.

rmongodbHelper at Your Service

The following code will turn the or-query from a JSON into a BSON:

# install rmongodbHelper package from GitHub

# install.packages("devtools")
library(devtools)

devtools::install_github("joyofdata/rmongodbHelper")
library(rmongodbHelper)

json_qry <- 
'{
  "$or": [
    {"a":1},
    {"a":3}
  ]
}'

bson <- rmongodbhelper::json_to_bson(json_qry)
cur <- mongo.find(M, "dbx.collx", bson)

while(mongo.cursor.next(cur)) {
    print(mongo.cursor.value(cur))
}

And its result:

_id : 7 53fa14315aed8483db4ae794 a : 16 1 
_id : 7 53fa14315aed8483db4ae796 a : 16 3

Why Oh Why?

The reason why I took the time to craft a solution for this issue is two fold:

I don’t program C, so I cannot contribute to rmongodb by fixing the issue myself
The issue seems to be around already for almost a year

I really hope my solution will become superfluous very soon because R deserves an awesome MongoDB connector and rmongodb is pretty damn awesome already.

Some Details about rmongodbHelper

I answered a couple of questions on stackoverflow.com which will provide more code on how to use the package:

Keep few rules in mind in case you would like to use it:

It is version 0.0.0 – so there will be bugs – you are welcome to contribute or complain
If you would like to feed it a JSON then keep in mind that all keys have to be placed within double quotes
```
'"x":3'
```
will be casted as double
```
'"x":__int(3)'
```
will be casted as integer
Internally arrays and sub-documents are implemented as
```
list()
```
s. They are differentiated by non-existence of
```
names()
```
for arrays and presence of nams for sub-documents. An empty array has to contain one
```
list()
```
-element .ARR with arbitrary value.

To give an example for the last point:

L <- list(
  obj = list(
    obj = list(), 
    array = list(.ARR=1)
  ), 
  arr = list(
    list(a = 1, b = 1), 
    list("c","d")
  )
) 

bson <- rmongodbhelper::list_to_bson(L)
mongo.insert(M, "database.collection", bson)

The resulting document will look pretty-printed on MongoDB shell as follows:

> use database
> db.collection.find().pretty()
{
    "_id" : ObjectId("540085c4d915dc2b16c1327a"),
    "obj" : {
        "obj" : {

        },
        "array" : [ ]
    },
    "arr" : [
        {
            "a" : 1,
            "b" : 1
        },
        [
            "c",
            "d"
        ]
    ]
}

(original article published on www.joyofdata.de)

To leave a comment for the author, please follow the link and comment on their blog: joy of data » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

MongoDB – State of the R

rmongodb versus RMongo

Storing Data with rmongodb

Retrieving Data from MongoDB

Implicit Coversion of Sub-Document to Vector

Formulating Queries Using Arrays is Problematic

Building Up the BSON Buffer from Scratch

rmongodbHelper at Your Service

Why Oh Why?

Some Details about rmongodbHelper

Related

rmongodb versus RMongo

Storing Data with rmongodb

Retrieving Data from MongoDB

Implicit Coversion of Sub-Document to Vector

Formulating Queries Using Arrays is Problematic

Building Up the BSON Buffer from Scratch

rmongodbHelper at Your Service

Why Oh Why?

Some Details about rmongodbHelper

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)