MongoDB – State of the R

[This article was first published on joy of data » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

mongodbNaturally there are two reasons for why you need to access MongoDB from R:

  1. MongoDB is already used for whatever reason and you want to analyze the data stored therein
  2. You decide you want store your data in MongoDB instead of using native R technology like data.table or data.frame

In-memory data storage like data.table is very fast especially for numerical data, provided the data actually fits into your RAM – but even then MongoDB comes along with a bag of goodies making it a tempting choice for a number of use cases:

  • Flexible schema-less data structures
  • spatial and textual indexing
  • spatial queries
  • persistence of data
  • easily accessible from other languages and systems

In case you would like to learn more about MongoDB then I have good news for you – MongoDB Inc. provides a number of very well made online courses catering to various languages. An overview you may find here.

rmongodb versus RMongo

The good news is – there are two packages available for making R talk to MongoDB.

For a larger project at work I decided to go with rmongodb because it does not require Java in contrast to RMongo and it seems to be more actively developed – not to mention that MongoDB Incorporation itself has a finger in the pie apparently. And I can say I did not regret that choice. It’s a great package – and along these lines a big thank you to Markus Schmidberger for investing his time in its development. Having said that – there are a few quirks and a rather uncomfortable not-yet resolved issue one is going to face for non-trivial queries. But before I get to those subjects let me first give you a quick introduction into its application.

Storing Data with rmongodb

I assume that a MongoDB daemon is running on your localhost. By the way installing and using MongoDB from R is pretty much effortless for Ubuntu as well as Windows. First we are going to establish a connection and check whether we were successful.

> library(rmongodb)
> 
> m <- mongo.create()
> ns <- "database.collection"
> 
> mongo.is.connected(m)
[1] TRUE

Now let’s insert an object which we define using JSON notation:

json <- '{"a":1, "b":2, "c": {"d":3, "e":4}}'
bson <- mongo.bson.from.JSON(json)
mongo.insert(m, ns, bson)

Maybe this intermediate step is a bit surprising – after all MongoDB stores JSONs!? Well, it doesn’t – it works with BSONs.

BSON [bee · sahn], short for Bin­ary JSON, is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. Like JSON, BSON sup­ports the em­bed­ding of doc­u­ments and ar­rays with­in oth­er doc­u­ments and ar­rays. BSON also con­tains ex­ten­sions that al­low rep­res­ent­a­tion of data types that are not part of the JSON spec. For ex­ample, BSON has a Date type and a BinData type. [bsonspec.org]

 

The data structure closest to a JSON in R is a list – so naturally we can use that too for specifying a document to be inserted:

list <- list(a=2, b=3, c=list(d=4, e=5))
bson <- mongo.bson.from.list(list)
mongo.insert(m, ns, bson)

 Retrieving Data from MongoDB

Now let me show you how to retrieve those two documents and print them – for illustrative purposes I query for documents whose field “a” holds a value greater or equal to 1:

> json <- '{"a":{"$gte":1}}'
> bson <- mongo.bson.from.JSON(json)
> cursor <- mongo.find(m, ns, bson)
> while(mongo.cursor.next(cursor)) {
+   value <- mongo.cursor.value(cursor)
+   list <- mongo.bson.to.list(value)
+   str(list)
+ }

List of 4
 $ _id:Class 'mongo.oid'  atomic [1:1] 1
  .. ..- attr(*, "mongo.oid")= 
 $ a  : num 1
 $ b  : num 2
 $ c  : Named num [1:2] 3 4
  ..- attr(*, "names")= chr [1:2] "d" "e"

List of 4
 $ _id:Class 'mongo.oid'  atomic [1:1] 0
  .. ..- attr(*, "mongo.oid")= 
 $ a  : num 2
 $ b  : num 3
 $ c  : Named num [1:2] 4 5
  ..- attr(*, "names")= chr [1:2] "d" "e"

Also the search query is only superficially a JSON and hence has to be casted to BSON before applying it. The result is a cursor which has to be iterated and leads to the resulting documents – as BSONs, of course. This little ritual with converting to and from BSON feels a bit clumsy at times but one gets used to it eventually and of course nothing keeps you from writing a more comfortable wrapper.

Implicit Coversion of Sub-Document to Vector

For the purpose of illustrating my point I added a document into the collection from MongoDB shell:

db.collection.insert({_id:1, a:{b:2, c:3, d:4, e:5}})

Now what you will see is that rmongodb implicitely casts the lowest sub-document as named R vector:

> json <- '{"_id":1}'
> bson <- mongo.bson.from.JSON(json)
> cursor <- mongo.find(m, ns, bson)
> mongo.cursor.next(cursor)
[1] TRUE

> value <- mongo.cursor.value(cursor)
> list <- mongo.bson.to.list(value)

> print(list)
$`_id`
[1] 1

$a
b c d e 
2 3 4 5 

> str(list)
List of 2
 $ _id: num 1
 $ a  : Named num [1:4] 2 3 4 5
  ..- attr(*, "names")= chr [1:4] "b" "c" "d" "e"

# it is not possible to access the sub-document as a list()
> list$a$b
Error in list$a$b : $ operator is invalid for atomic vectors

> list$a["b"]
b 
2

This is something you have to keep in mind. Personally I find this to be a tad uncomfortable. As a list() would work, too, and would allow for a more homogenous processing. But that is just my two cents.

 Formulating Queries Using Arrays is Problematic

At the present it is not possible to easily formulate a BSON containing an array. This is primarily a problem when you want to query documents and that query expression needs an array. For example:

{"$or": [{"a":1}, {"a":3}]}

Let’s assume three documents in

database.collection

 :

> db.collection.find()
{ "_id" : ObjectId("53fe48857f953e7c617eea04"), "a" : 1 }
{ "_id" : ObjectId("53fe48877f953e7c617eea05"), "a" : 2 }
{ "_id" : ObjectId("53fe488a7f953e7c617eea06"), "a" : 3 }

We would expect to receive the two obvious documents. But doing so in rmongodb currently yields an error:

> library(rmongodb)
> M <- mongo.create("localhost")
> mongo.is.connected(M)
[1] TRUE
> 
> qry1 <- list(
+     "a" = 1
+ )
> 
> qry2 <- list(
+     "$or" = list(
+         list("a" = 1),
+         list("a" = 3)
+     )
+ )
> 
> qry1 <- mongo.bson.from.list(qry1)
> qry2 <- mongo.bson.from.list(qry2)
> 
> mongo.count(M, "test.xxx", qry1)
[1] 1
> mongo.count(M, "test.xxx", qry2)
[1] -1
> mongo.get.last.err(M, "test")
    connectionId : 16    24
    err : 2      $or needs an array
    code : 16    2
    n : 16   0
    ok : 1   1.000000

 Building Up the BSON Buffer from Scratch

Internally rmongodb constructs a BSON for a JSON by constructing an initial BSON buffer and then adding to it the elements of the JSON-document. For the suggested query this would be done as follows:

buf <- mongo.bson.buffer.create()

# "$or":[ ...
mongo.bson.buffer.start.array(buf, "$or")

# dummy name "0" for object in array
# "0": { ...
mongo.bson.buffer.start.object(buf, "0")
# "a":1
mongo.bson.buffer.append.int(buf, "a", 1)
# ... }
mongo.bson.buffer.finish.object(buf)

mongo.bson.buffer.start.object(buf, "1")
mongo.bson.buffer.append.int(buf, "a", 3)
mongo.bson.buffer.finish.object(buf)

# ...]
mongo.bson.buffer.finish.object(buf)

bson <- mongo.bson.from.buffer(buf)

Obviously this approach is going to get quite complicated and error-prone for even mildly complex queries/documents. But then again this method of building up a BSON is also quite generic and straightforward. So, to simplify this task I composed a little package that recursively traverses the representing list and invokes those functions appropriately.

rmongodbHelper at Your Service

The following code will turn the or-query from a JSON into a BSON:

# install rmongodbHelper package from GitHub

# install.packages("devtools")
library(devtools)

devtools::install_github("joyofdata/rmongodbHelper")
library(rmongodbHelper)

json_qry <- 
'{
  "$or": [
    {"a":1},
    {"a":3}
  ]
}'

bson <- rmongodbhelper::json_to_bson(json_qry)
cur <- mongo.find(M, "dbx.collx", bson)

while(mongo.cursor.next(cur)) {
    print(mongo.cursor.value(cur))
}

And its result:

_id : 7 53fa14315aed8483db4ae794 a : 16 1 
_id : 7 53fa14315aed8483db4ae796 a : 16 3

Why Oh Why?

The reason why I took the time to craft a solution for this issue is two fold:

  1. I don’t program C, so I cannot contribute to rmongodb by fixing the issue myself
  2. The issue seems to be around already for almost a year

I really hope my solution will become superfluous very soon because R deserves an awesome MongoDB connector and rmongodb is pretty damn awesome already.

Some Details about rmongodbHelper

I answered a couple of questions on stackoverflow.com which will provide more code on how to use the package:

Keep few rules in mind in case you would like to use it:

  • It is version 0.0.0 – so there will be bugs – you are welcome to contribute or complain
  • If you would like to feed it a JSON then keep in mind that all keys have to be placed within double quotes
  • '"x":3'

      will be casted as double

  • '"x":__int(3)'

      will be casted as integer

  • Internally arrays and sub-documents are implemented as
    list()

    s. They are differentiated by non-existence of

    names()

     for arrays and presence of nams for sub-documents. An empty array has to contain one

    list()

    -element .ARR with arbitrary value.

To give an example for the last point:

L <- list(
  obj = list(
    obj = list(), 
    array = list(.ARR=1)
  ), 
  arr = list(
    list(a = 1, b = 1), 
    list("c","d")
  )
) 

bson <- rmongodbhelper::list_to_bson(L)
mongo.insert(M, "database.collection", bson)

The resulting document will look pretty-printed on MongoDB shell as follows:

> use database
> db.collection.find().pretty()
{
    "_id" : ObjectId("540085c4d915dc2b16c1327a"),
    "obj" : {
        "obj" : {

        },
        "array" : [ ]
    },
    "arr" : [
        {
            "a" : 1,
            "b" : 1
        },
        [
            "c",
            "d"
        ]
    ]
}


(original article published on www.joyofdata.de)

To leave a comment for the author, please follow the link and comment on their blog: joy of data » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)