Converting Existing R Scripts to ORE – Getting Started

May 28, 2013
By

(This article was first published on Oracle R Enterprise, and kindly contributed to R-bloggers)

Oracle R Enterprise provides a comprehensive,
database-centric environment for end-to-end analytical processes in R, with
immediate deployment to production environments. This message really resonates
with our customers who are interested in executing R functions on
database-resident data while seamlessly leveraging Oracle Database as a high-performance
computing (HPC) environment. The ability to develop and operationalize R
scripts for analytical applications in one step is quite appealing.

One frequently asked question is how to convert existing R code that access data in flat files or
the database to use Oracle R Enterprise. In this blog post, we talk about a few
scenarios and how to begin a conversion from existing R code to using Oracle R
Enterprise.

Consider the following scenarios:

Scenario 1: A stand-alone R script that generates its own
data and simply returns a result. Data is not obtained from the file system or
database. This may result from performing simulations where dadta is
dynamically generated, or perhaps access from a URL on the internet.

Scenario 2: An R script that loads data from a flat file
such as a CSV file, performs some computations in R, and then writes the result
back to a file.

Scenario 3: An R script that loads data from a database
table, via one of the database connector packages like RODBC, RJDBC, or
ROracle, and writes a result back to the database –using SQL statements or
package functions.

Scenario 1

A stand-alone R script might normally be run on a user’s
desktop, invoked as a cron job, or even via Java to spawn an R engine and
retrieve the result, but we’d like to operationalize its execution as part of a
database application, invoked from SQL. Here’s a simple script to illustrate
the concept of converting such a script to be executed at the database server
using ORE’s embedded R execution. The script generates a data.frame with some random columns,
performs summary on that data
and returns the summary statistics, which are represented as an R table.

#
generate data

set.seed(1)

n
<- 1000

df
<- 3

x
<- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

#
perform some analysis

res
<- summary(x)

#return
the result

res

To convert this to use ORE, create a function with
appropriate arguments and body, for example:

myFunction1
<- function (n = 1000, df = 3,seed=1) {

set.seed(seed)

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

res <- summary(x)

res

}

Next, load the ORE packages and connect to Oracle
Database using the ore.connect function. Using the all argument set to TRUE loads metadata for all the
tables and views in that schema. We then store the function in the R script
repository, invoking it via ore.doEval.

#
load ORE packages and connect to Oracle Database

library(ORE)

ore.connect("schema","sid","hostname","password",port=1521,
all=TRUE)

#
load function into R script repository

ore.scriptDrop("myFunction-1")

ore.scriptCreate("myFunction-1",
myFunction1)

#
invoke using embedded R execution at the database server

ore.doEval(FUN.NAME="myFunction-1")

> ore.doEval(FUN.NAME="myFunction-1")
       a                b                  c           
 Min.   :   1.0   Min.   :-3.00805   Min.   : 0.03449  
 1st Qu.: 250.8   1st Qu.:-0.69737   1st Qu.: 1.27386  
 Median : 500.5   Median :-0.03532   Median : 2.36454  
 Mean   : 500.5   Mean   :-0.01165   Mean   : 3.07924  
 3rd Qu.: 750.2   3rd Qu.: 0.68843   3rd Qu.: 4.25994  
 Max.   :1000.0   Max.   : 3.81028   Max.   :17.56720  

Of course, we’re using default values here. To provide
different arguments, change the invocation with arguments as follows:

ore.doEval(FUN.NAME="myFunction-1", n=500, df=5,
seed=2)

> ore.doEval(FUN.NAME="myFunction-1", n=500, df=5, seed=2)
       a               b                  c          
 Min.   :  1.0   Min.   :-2.72182   Min.   : 0.1621  
 1st Qu.:125.8   1st Qu.:-0.65346   1st Qu.: 2.6144  
 Median :250.5   Median : 0.04392   Median : 4.4592  
 Mean   :250.5   Mean   : 0.06169   Mean   : 5.0386  
 3rd Qu.:375.2   3rd Qu.: 0.79096   3rd Qu.: 6.8467  
 Max.   :500.0   Max.   : 2.88842   Max.   :17.0367  

Having successfully invoked this from the R client (my
laptop), we can now invoke it from SQL. Here, we retrieve the summary result, which
is an R table, as an XML string.

select *

from table(rqEval(
NULL,’XML’,’myFunction-1′));

The
result can be viewed from SQL Developer.

The
following shows the XML output in a more structured manner.

What if we wanted to get the result to appear as a SQL
table? Since the current result is an R table
(an R object), we need to convert this to a data.frame
to return it. We’ll make a few
modifications to “myFunction-1” above. Most notably is the need to convert the table object in res to a data.frame. There are a variety of
ways to do this.

myFunction2
<- function (n = 1000, df = 3,seed=1) {

# generate data

set.seed(seed)

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

# perform some analysis

res <- summary(x)

# convert the table result to a data.frame

res.df <- as.matrix(res)

res.sum <-
as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum) <- c(‘a’,’b’,’c’)

res.sum$statname <-
c("min","1stQ","median","mean","3rdQ","max")

res.sum <- res.sum[,c(4,1:3)]

res.sum

}

#
load function into R script repository

ore.scriptDrop("myFunction-2")

ore.scriptCreate("myFunction-2",
myFunction2)

We’ll now modify the SQL statement to specify the format of
the result.

select
*

from table(rqEval( NULL,’select cast(”a” as VARCHAR2(12))
as "statname",

1 "a", 1 "b",
1 "c" from dual ‘,’myFunction-2′));

Here’s
the result as viewed from SQL Developer.

This type of result could be incorporated into any SQL
application accepting table or view input from a SQL query. That is particular
useful in combination with OBIEE dashboards via an RPD.

Scenario 2

If you’ve been loading data from a flat file, perhaps a CSV
file, your R code may look like the following, where it specifies to builds a
model and write hat model to a file for future use, perhaps in scoring. It also generates a graph of the clusters
highlighting the individual points, colored by their cluster id, with the
centroids indicated with a star.

#
read data

setwd("D:/datasets")

dat
<- read.csv("myDataFile.csv")

#
build a clustering model

cl
<- kmeans(x, 2)

#
write model to file

save(cl,
file="myClusterModel.dat")

#
create a graph and write it to a file

pdf("myGraphFile.pdf")

plot(x,
col = cl$cluster)

points(cl$centers,
col = 1:2, pch = 8, cex=2)

dev.off()

The resulting PDF file contains the following image.

To convert this script for use in ORE, there are several options.
We’ll explore two: the first involving minimal change to use embedded R
execution, and the second leveraging in-database techniques. First, we’ll want
the data we used above in variable dat
to be loaded into the database.

#
create a row id to enable ordered results (if a key doesn’t already exist)

dat$ID
<- 1:nrow(dat)

#
remove the table if it exists

ore.drop("MY_DATA")

#
create the table using the R data.frame, resulting in an ore.frame named
MY_DATA

ore.create(dat,"MY_DATA")

#
assign the ID column as the row.names of the ore.frame

row.names(MY_DATA)
<- MY_DATA$ID

In the first example, we’ll use embedded R execution and pass
the data to the function via ore.tableApply.
We’ll generate the graph, but simply display it within the function to allow embedded
R execution to return the graph as a result. (Note we could also write the
graph to a file in any directory accessible to the database server.) Instead of
writing the model to a file, which requires keeping track of its location, as
well as worring about backup and recovery, we store the model in the database R
datastore using ore.save. All this
requires minimal change. As above, we could store the function in the R script
repository and invoke it by name – both from R and SQL. In this example, we
simply provide the function itself as argument.

myClusterFunction1
<- function(x) {

cl <- kmeans(x, 2)

ore.save(cl,
name="myClusterModel",overwrite=TRUE)

plot(x, col = cl$cluster)

points(cl$centers, col = 1:2, pch = 8, cex=2)

TRUE

}

ore.tableApply(MY_DATA[,c(‘x’,’y’)],
myClusterFunction1,

ore.connect=TRUE,ore.png.height=700,ore.png.width=700)

The ore.tableApply
function projects the x and y columns of MY_DATA as input and also specifies ore.connect as TRUE since we are using
the R datastore, which requires a database connection. Optionally, we can
specify control arguments to the PNG output. In this example, these are the height
and width of the image.

For the second example, we convert this to leverage the ORE Transparency
Layer. We’ll use the in-database K-Means algorithm and save the model in a
datastore named “myClusteringModel”, as we did above. Since ore.odmKMeans doesn’t automatically
assign cluster ids (since the data may be very large or are not required), the
scoring is done separately. Note, however, that the prediction results also
exist in the database as an ore.frame.
To ensure ordering, we also assign row.names
to the ore.frame pred. Lastly, we create the plot.
Coloring the nodes requires pulling the cluster assignments; however, the
points themselves can be accessed from the ore.frame.
The centroids points are obtained from cl$centers2
of the cluster model.

#
build a clustering model in-database

cl
<- ore.odmKMeans(~., MY_DATA, 2, auto.data.prep=FALSE)

#
save model in database R datastore

ore.save(cl,name="myClusterModel",overwrite=TRUE)

#
generate predictions to assign each row a cluster id, supplement with original
data

pred
<- predict(cl,MY_DATA,supp=c(‘x’,’y’,’ID’),type="class")

#
assign row names to ensure ordering of results

row.names(pred)
<- pred$ID

#
create the graph

plot(pred[,c(‘x’,’y’)],
col = ore.pull(pred$CLUSTER_ID))

points(cl$centers2[,c(‘x’,’y’)],
col = c(2,3), pch = 8, cex=2)

We can also combine using the transparency layer within an
embedded R function. But we’ll leave that as an exercise to the reader.

Scenario 3

In this last scenario, the data already exists in the
database and one of the database interface packages, such as RODBC, RJDBC, and
ROracle is be used to retrieve data from and write data to the database. We’ll
illustrate this with ROracle, but the same holds for the other two packages.

#
connect to the database

drv
<- dbDriver("Oracle")

con
<- dbConnect(drv, "mySchema", "myPassword")

#
retrieve the data specifying a SQL query

dat
<- dbGetQuery(con, ‘select * from MY_RANDOM_DATA where "a" >
100′)

#
perform some analysis

res
<- summary(dat)

#
convert the table result to a data.frame for output as table

res.df
<- as.matrix(res)

res.sum
<- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum)
<- c(‘a’,’b’,’c’)

res.sum$statname
<-
c("min","1stQ","median","mean","3rdQ","max")

res.sum
<- res.sum[,c(4,1:3)]

res.sum

dbWriteTable(con,
"SUMMARY_STATS", res.sum)

Converting this to ORE is straightforward. We’re already
connected to the database using ore.connect
from previous scenarios, so the existing table MY_RANDOM_DATA
was already loaded in the environment as an ore.frame. Executing ore.ls lists this table is the result,
so we can just start using it.

> ore.ls(pattern="MY_RAND")

[1] "MY_RANDOM_DATA"

#
no need to retrieve the data, use the transparency layer to compute summary

res
<- with(MY_RANDOM_DATA , summary(MY_RANDOM_DATA[a > 100,]))

#
convert the table result to a data.frame for output as table

res.df
<- as.matrix(res)

res.sum
<- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum)
<- c(‘a’,’b’,’c’)

res.sum$statname
<-
c("min","1stQ","median","mean","3rdQ","max")

res.sum
<- res.sum[,c(4,1:3)]

#
create the database table

ore.create(res.sum,
"SUMMARY_STATS")

SUMMARY_STATS

As we did in previous scenarios, this script can also be
wrapped in a function and used in embedded R execution. This too is left as an
exercise to the reader.

Summary

As you can see from the three scenarios discussed here, converting
a script that accesses no external data, accesses and manipulates file data, or
accesses and manipulates database data can be accomplished with a few strategic
modifications. More involved scripts, of course, may require additional
manipulation. For example, if the SQL query performs complex joins and
filtering, along with derived column creation, the user may want to convert
this SQL to the corresponding ORE Transparency Layer code, thereby eliminating
reliance on SQL. But that’s a topic for another post.

To leave a comment for the author, please follow the link and comment on his blog: Oracle R Enterprise.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.