Vertica doesn’t suit ML or ‘Why I stopped using the Vertica R package’

[This article was first published on R – My contRibution, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

2 Years, are a long time. For a blogger wannabe for sure. 2 years since my last post, a lot has happened and my skills have broadened beyond R, so this R dedicated blog, might not speak about R so much, maybe some other things.

I’ve received some good comments from readers and friends on my contribution to the use of the Vertica R package, I even had an interviewee that shook my hand and said my posts were great help to his team. After using the package extensively to solve a few ML use cases, I must conclude that it is not the tool you would want to use. I’d dedicate this post to explain why in a few points I believe are important.

1 – Lack of community

Product communities are important, I never paid attention to it in the past, but today I suggest to anyone considering a new technology, look into the community of that product, who are the users, and how many are there. How well do they manage Braistorming and tickets. HP’s Vertica is not an open source product, so you would expect a smaller community compared to, say, MySQL, but still there is not much buzz around the search term ‘R and Vertica’.

As of writing this post, on incognito mode, my post from June 2014 is still ranked #1 on google
baby-eme.

As flattering as that may be, it is not a good sign for the product, you’d want to see stackoverflow questions and linkedin groups and a multitude of HP owned documents on the subject. Without a big community of developers, bugs don’t get attention, packages and solutions are not created and you find yourself stuck.

The number of tags in Stackoverflow for Vertica alone, without, filtering for R related issues is 439, vs. ~100K for Spark and ~400K for MySQL, get the picture?

2 -Lack of non-SQL API

Vertica’s great, Vertica’s cool, but, it only has a SQL API. Now, I can say that a LOT can be done with a SQL, and packages like sqlalchemy or frameworks like Apache Spark allow connections via JDBC and then a layer of their own sophistication and programmability, but then Vertica doesn’t really smoothly comply with sqlalchemy and there is no simple connection to Spark.

Also, SQL means that Vertica is stateless, you cannot cache your data, you cannot loop over it, and basically you cannot develop. So you’ll need some wrapping application to call Vertica for you with a JDBC connection – aha! brilliant! But, if you need to write an application that calls it, why not use open source framework like Apache Spark, and have your data in Hadoop instead of paying for Vertica’s license.

3 – Vertica is parallel, but, it doesn’t apply so much to R

Vertica segments or copies it’s data to many nodes and uses novel algorithms to read records quickly from zip, it can do some real magic with most complex queries I’ve ever seen. But when you call R through Vertica, the data gets unzipped then piped into an R process which is:

  1. Limited by local resources.
  2. Processes cannot communicate between or within nodes. You cannot broadcast any data. As you can do with spark.

So what is parallel here? Let’s see an example, assume you have a table with some grouping column ‘G’ and then features columns {x1, x2, ….xp} and an output column ‘y’. You’d like to run some, very general, R model that has the formula:

y~x1+x2+...+xp

for every group in ‘G’.

Vertica will split/shuffle the data by group to all nodes and then pipe it to R for analysis, the R instances (1 per node) will work in a queue until finished then each will return results back to Vertica. Why is this bad? You cannot run a huge regression that is truly parallel with something like BFGS or SGD this way, you are limited by the size of data that the R instance can handle, which will not be in the 100s of GB, sadly it would be much smaller.

To Conclude…

Vertica is still my first choice DBMS for analytics, used and maintained correctly it’s amazing. If your team has need for scalar functions that are dependent on R and are not readily available in Vertica then I recommend to use the R package, otherwise, don’t.

 


To leave a comment for the author, please follow the link and comment on their blog: R – My contRibution.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)