MongoDB – State of the R

August 31, 2014
By

(This article was first published on joy of data » R, and kindly contributed to R-bloggers)

mongodbNaturally there are two reasons for why you need to access MongoDB from R:

  1. MongoDB is already used for whatever reason and you want to analyze the data stored therein
  2. You decide you want store your data in MongoDB instead of using native R technology like data.table or data.frame

In-memory data storage like data.table is very fast especially for numerical data, provided the data actually fits into your RAM – but even then MongoDB comes along with a bag of goodies making it a tempting choice for a number of use cases:

  • Flexible schema-less data structures
  • spatial and textual indexing
  • spatial queries
  • persistence of data
  • easily accessible from other languages and systems

In case you would like to learn more about MongoDB then I have good news for you – MongoDB Inc. provides a number of very well made online courses catering to various languages. An overview you may find here.

rmongodb versus RMongo

The good news is – there are two packages available for making R talk to MongoDB.

For a larger project at work I decided to go with rmongodb because it does not require Java in contrast to RMongo and it seems to be more actively developed – not to mention that MongoDB Incorporation itself has a finger in the pie apparently. And I can say I did not regret that choice. It’s a great package – and along these lines a big thank you to Markus Schmidberger for investing his time in its development. Having said that – there are a few quirks and a rather uncomfortable not-yet resolved issue one is going to face for non-trivial queries. But before I get to those subjects let me first give you a quick introduction into its application.

Storing Data with rmongodb

I assume that a MongoDB daemon is running on your localhost. By the way installing and using MongoDB from R is pretty much effortless for Ubuntu as well as Windows. First we are going to establish a connection and check whether we were successful.

> library(rmongodb)
> 
> m <- mongo.create()
> ns <- "database.collection"
> 
> mongo.is.connected(m)
[1] TRUE

Now let’s insert an object which we define using JSON notation:

json <- '{"a":1, "b":2, "c": {"d":3, "e":4}}'
bson <- mongo.bson.from.JSON(json)
mongo.insert(m, ns, bson)

Maybe this intermediate step is a bit surprising – after all MongoDB stores JSONs!? Well, it doesn’t – it works with BSONs.

BSON [bee · sahn], short for Bin­ary JSON, is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. Like JSON, BSON sup­ports the em­bed­ding of doc­u­ments and ar­rays with­in oth­er doc­u­ments and ar­rays. BSON also con­tains ex­ten­sions that al­low rep­res­ent­a­tion of data types that are not part of the JSON spec. For ex­ample, BSON has a Date type and a BinData type. [bsonspec.org]

 

The data structure closest to a JSON in R is a list – so naturally we can use that too for specifying a document to be inserted:

list <- list(a=2, b=3, c=list(d=4, e=5))
bson <- mongo.bson.from.list(list)
mongo.insert(m, ns, bson)

 Retrieving Data from MongoDB

Now let me show you how to retrieve those two documents and print them – for illustrative purposes I query for documents whose field “a” holds a value greater or equal to 1:

> json <- '{"a":{"$gte":1}}'
> bson <- mongo.bson.from.JSON(json)
> cursor <- mongo.find(m, ns, bson)
> while(mongo.cursor.next(cursor)) {
+   value <- mongo.cursor.value(cursor)
+   list <- mongo.bson.to.list(value)
+   str(list)
+ }

List of 4
 $ _id:Class 'mongo.oid'  atomic [1:1] 1
  .. ..- attr(*, "mongo.oid")=<externalptr> 
 $ a  : num 1
 $ b  : num 2
 $ c  : Named num [1:2] 3 4
  ..- attr(*, "names")= chr [1:2] "d" "e"

List of 4
 $ _id:Class 'mongo.oid'  atomic [1:1] 0
  .. ..- attr(*, "mongo.oid")=<externalptr> 
 $ a  : num 2
 $ b  : num 3
 $ c  : Named num [1:2] 4 5
  ..- attr(*, "names")= chr [1:2] "d" "e"

Also the search query is only superficially a JSON and hence has to be casted to BSON before applying it. The result is a cursor which has to be iterated and leads to the resulting documents – as BSONs, of course. This little ritual with converting to and from BSON feels a bit clumsy at times but one gets used to it eventually and of course nothing keeps you from writing a more comfortable wrapper.

Implicit Coversion of Sub-Document to Vector

For the purpose of illustrating my point I added a document into the collection from MongoDB shell:

db.collection.insert({_id:1, a:{b:2, c:3, d:4, e:5}})

Now what you will see is that rmongodb implicitely casts the lowest sub-document as named R vector:

> json <- '{"_id":1}'
> bson <- mongo.bson.from.JSON(json)
> cursor <- mongo.find(m, ns, bson)
> mongo.cursor.next(cursor)
[1] TRUE

> value <- mongo.cursor.value(cursor)
> list <- mongo.bson.to.list(value)

> print(list)
$`_id`
[1] 1

$a
b c d e 
2 3 4 5 

> str(list)
List of 2
 $ _id: num 1
 $ a  : Named num [1:4] 2 3 4 5
  ..- attr(*, "names")= chr [1:4] "b" "c" "d" "e"

# it is not possible to access the sub-document as a list()
> list$a$b
Error in list$a$b : $ operator is invalid for atomic vectors

> list$a["b"]
b 
2

This is something you have to keep in mind. Personally I find this to be a tad uncomfortable. As a list() would work, too, and would allow for a more homogenous processing. But that is just my two cents.

 Formulating Queries Using Arrays is Problematic

At the present it is not possible to easily formulate a BSON containing an array. This is primarily a problem when you want to query documents and that query expression needs an array. For example:

{"$or": [{"a":1}, {"a":3}]}

Let’s assume three documents in

database.collection
 :

> db.collection.find()
{ "_id" : ObjectId("53fe48857f953e7c617eea04"), "a" : 1 }
{ "_id" : ObjectId("53fe48877f953e7c617eea05"), "a" : 2 }
{ "_id" : ObjectId("53fe488a7f953e7c617eea06"), "a" : 3 }

We would expect to receive the two obvious documents. But doing so in rmongodb currently yields an error:

> library(rmongodb)
> M <- mongo.create("localhost")
> mongo.is.connected(M)
[1] TRUE
> 
> qry1 <- list(
+     "a" = 1
+ )
> 
> qry2 <- list(
+     "$or" = list(
+         list("a" = 1),
+         list("a" = 3)
+     )
+ )
> 
> qry1 <- mongo.bson.from.list(qry1)
> qry2 <- mongo.bson.from.list(qry2)
> 
> mongo.count(M, "test.xxx", qry1)
[1] 1
> mongo.count(M, "test.xxx", qry2)
[1] -1
> mongo.get.last.err(M, "test")
    connectionId : 16    24
    err : 2      $or needs an array
    code : 16    2
    n : 16   0
    ok : 1   1.000000

 Building Up the BSON Buffer from Scratch

Internally rmongodb constructs a BSON for a JSON by constructing an initial BSON buffer and then adding to it the elements of the JSON-document. For the suggested query this would be done as follows:

buf <- mongo.bson.buffer.create()

# "$or":[ ...
mongo.bson.buffer.start.array(buf, "$or")

# dummy name "0" for object in array
# "0": { ...
mongo.bson.buffer.start.object(buf, "0")
# "a":1
mongo.bson.buffer.append.int(buf, "a", 1)
# ... }
mongo.bson.buffer.finish.object(buf)

mongo.bson.buffer.start.object(buf, "1")
mongo.bson.buffer.append.int(buf, "a", 3)
mongo.bson.buffer.finish.object(buf)

# ...]
mongo.bson.buffer.finish.object(buf)

bson <- mongo.bson.from.buffer(buf)

Obviously this approach is going to get quite complicated and error-prone for even mildly complex queries/documents. But then again this method of building up a BSON is also quite generic and straightforward. So, to simplify this task I composed a little package that recursively traverses the representing list and invokes those functions appropriately.

rmongodbHelper at Your Service

The following code will turn the or-query from a JSON into a BSON:

# install rmongodbHelper package from GitHub

# install.packages("devtools")
library(devtools)

devtools::install_github("joyofdata/rmongodbHelper")
library(rmongodbHelper)

json_qry <- 
'{
  "$or": [
    {"a":1},
    {"a":3}
  ]
}'

bson <- rmongodbhelper::json_to_bson(json_qry)
cur <- mongo.find(M, "dbx.collx", bson)

while(mongo.cursor.next(cur)) {
    print(mongo.cursor.value(cur))
}

And its result:

_id : 7 53fa14315aed8483db4ae794 a : 16 1 
_id : 7 53fa14315aed8483db4ae796 a : 16 3

Why Oh Why?

The reason why I took the time to craft a solution for this issue is two fold:

  1. I don’t program C, so I cannot contribute to rmongodb by fixing the issue myself
  2. The issue seems to be around already for almost a year

I really hope my solution will become superfluous very soon because R deserves an awesome MongoDB connector and rmongodb is pretty damn awesome already.

Some Details about rmongodbHelper

I answered a couple of questions on stackoverflow.com which will provide more code on how to use the package:

Keep few rules in mind in case you would like to use it:

  • It is version 0.0.0 – so there will be bugs – you are welcome to contribute or complain
  • If you would like to feed it a JSON then keep in mind that all keys have to be placed within double quotes
  • '"x":3'
      will be casted as double
  • '"x":__int(3)'
      will be casted as integer
  • Internally arrays and sub-documents are implemented as
    list()
    s. They are differentiated by non-existence of
    names()
     for arrays and presence of nams for sub-documents. An empty array has to contain one
    list()
    -element .ARR with arbitrary value.

To give an example for the last point:

L <- list(
  obj = list(
    obj = list(), 
    array = list(.ARR=1)
  ), 
  arr = list(
    list(a = 1, b = 1), 
    list("c","d")
  )
) 

bson <- rmongodbhelper::list_to_bson(L)
mongo.insert(M, "database.collection", bson)

The resulting document will look pretty-printed on MongoDB shell as follows:

> use database
> db.collection.find().pretty()
{
    "_id" : ObjectId("540085c4d915dc2b16c1327a"),
    "obj" : {
        "obj" : {

        },
        "array" : [ ]
    },
    "arr" : [
        {
            "a" : 1,
            "b" : 1
        },
        [
            "c",
            "d"
        ]
    ]
}


(original article published on www.joyofdata.de)

To leave a comment for the author, please follow the link and comment on his blog: joy of data » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.