R and Vertica

June 14, 2014
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been spending the last few months working my way through the integration of R and Vertica, and will try to keep here things that I find handy. I’m quite sad to see there is not much about this Vertica feature on the web, that’s a little disappointing. But, it didn’t stop us from creating a scalable statistical model learning machine out of this feature (I will write about it in later posts…).

For those of you who don’t know HP Vertica, it is a powerful columnar DBMS.  I’ve worked with two installations of it and me and my colleagues are very impressed with it. For those of you who are familiar with PostgreSQL, you will find many things similar  as both products were invented by the same guy.

That’s all the intro I’m going to give since we pay them and not the other way around 🙂

I suggest people reading this post to read through the chapter called “Developing a User Defined Function in R” in “HP Vertica 6.1.x (or higher) Programmer’s Guide” I hope you are familiar in vsql and R.

Short example – normal inverse function in Vertica

Vertica and R communicate via User Defined Functions (UDFs) that are written in the R language. The columns selected in Vertica are passed to R as a data.frame, the functions must treat their main argument as a data.frame. Here is an example that creates an inverse normal CDF function, x is the data.frame passed from Vertica, in this case it is a 3-column data.frame with the percentile, mean and standard deviation.

```# normal inverse
norm_inv <- function(x) { # 1 - percentile, 2 - mean, 3 - sd
apply(x,1,function(i) qnorm(p=i[1L],mean=i[2L],sd=i[3L]))
}

norm_invFactory <- function() { # this is the factory function that 'links' between R and Vertica
list(
name=norm_inv
,udxtype=c("scalar")
,intype=c("float","float","float")
,outtype=c("float")
)
}
```

You create the file above and place somewhere on your machine, then load this function to the Vertica database :

```MyDB=> CREATE LIBRARY r_func AS '/home/dbadmin/norm_inverse.R' LANGUAGE 'R';

MyDB=> CREATE FUNCTION norm_inv AS NAME 'norm_invFactory' LIBRARY r_func;

MyDB=> select norm_inv(.25,0,1);
norm_inv
--------------------
-0.674489750196082
(1 row)

```

Imagine being able to implicitly parallelize an R function across an infinite amount of segments and data. This is basically the promise behind the R language package for Vertica.

What is parallelized exactly? Vertica allows you to partition the data sent into R, it implicitly works out how to divide the load between nodes and the nodes’ cores, so you don’t have to work out elaborate code. Each R instance is run independetly from other instances, so you cannot parallelize ,say, a single lm() function, but rather perform multiple ones at once.

The user defined function are loaded into Vertica using CREATE LIBRARY syntax and by writing R code that has two main parts, as you might have noticed above:

1. A main function – the main function is what does the work
2. A factory function – tells the Vertica engine about the input and output to expect.

source() example

Using R’s source() function is possible through an R UDFs in Vertica. Which is a very useful ability for large projects with a lot of code you can then change the sourced code without changing any part of the main functions. Here is how.
Create some R file called ‘foo.R’ with a variable called ‘bar’.

```## Start of R code
bar <- "we've done it!"
## End of R code
```

Create the main and factory functions for in an R file

```# Main function
sourceTest <- function(x)
{
source('foo.R')
return(bar)
}
# Factory function
sourceTestFactory <- function()
{
list(
name=sourceTest
,udxtype=c("scalar")
,intype=c("char")
,outtype=c("char")
)
}
```

In vertica run :

```MyDB=> CREATE LIBRARY RsourceTest AS 'sourceTest.r' LANGUAGE 'R';
MyDB=> CREATE FUNCTION sourceTest AS NAME 'sourceTestFactory' LIBRARY RsourceTest;
MyDB=> SELECT sourceTest ("");
sourceTest
--------------------
we've done it!
(1 row)
```

Pre-requisites for the R language pack

The vertica programmer guide instructs you to install a version of libgfortran. Aside from that you may notice that packages that rely on gfortran such as “xts” require installing other parts :

```yum install gcc-gfortran
yum install gcc-c++
```

Installing the ‘forecast’ package for Vertica

The ‘forecast’ package, by Rob J Hyndman, is great for many time series analysis. Connecting it with Vertica is very powerful for creating forecasts out of your data. Trouble is, the R version used in Vertica is 3.0.0 so you will have to get the older version of forecast 4.8. In short :

```wget https://stat.ethz.ch/CRAN/contrib/main/Archive/forecast/forecast_4.8.tar.gz
```

And then in R:

```install.packages("tseries")
install.packages("fracdiff")
install.packages("/root/forecast/forecast_4.8.tar.gz")
```

Voila. Next time I’ll show some more elaborate examples of transform functions that I use.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.