# Converting Existing R Scripts to ORE – Getting Started

May 28, 2013
By

(This article was first published on Oracle R Enterprise, and kindly contributed to R-bloggers)

Oracle R Enterprise provides a comprehensive, database-centric environment for end-to-end analytical processes in R, with immediate deployment to production environments. This message really resonates with our customers who are interested in executing R functions on database-resident data while seamlessly leveraging Oracle Database as a high-performance computing (HPC) environment. The ability to develop and operationalize R scripts for analytical applications in one step is quite appealing.

One frequently asked question is how to convert existing R code that access data in flat files or the database to use Oracle R Enterprise. In this blog post, we talk about a few scenarios and how to begin a conversion from existing R code to using Oracle R Enterprise.

Consider the following scenarios:

Scenario 1: A stand-alone R script that generates its own data and simply returns a result. Data is not obtained from the file system or database. This may result from performing simulations where dadta is dynamically generated, or perhaps access from a URL on the internet.

Scenario 2: An R script that loads data from a flat file such as a CSV file, performs some computations in R, and then writes the result back to a file.

Scenario 3: An R script that loads data from a database table, via one of the database connector packages like RODBC, RJDBC, or ROracle, and writes a result back to the database –using SQL statements or package functions.

Scenario 1

A stand-alone R script might normally be run on a user’s desktop, invoked as a cron job, or even via Java to spawn an R engine and retrieve the result, but we’d like to operationalize its execution as part of a database application, invoked from SQL. Here’s a simple script to illustrate the concept of converting such a script to be executed at the database server using ORE’s embedded R execution. The script generates a data.frame with some random columns, performs summary on that data and returns the summary statistics, which are represented as an R table.

# generate data

set.seed(1)

n <- 1000

df <- 3

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

# perform some analysis

res <- summary(x)

#return the result

res

To convert this to use ORE, create a function with appropriate arguments and body, for example:

myFunction1 <- function (n = 1000, df = 3,seed=1) {

set.seed(seed)

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

res <- summary(x)

res

}

Next, load the ORE packages and connect to Oracle Database using the ore.connect function. Using the all argument set to TRUE loads metadata for all the tables and views in that schema. We then store the function in the R script repository, invoking it via ore.doEval.

# load ORE packages and connect to Oracle Database

library(ORE)

# load function into R script repository

ore.scriptDrop("myFunction-1")

ore.scriptCreate("myFunction-1", myFunction1)

# invoke using embedded R execution at the database server

ore.doEval(FUN.NAME="myFunction-1")

> ore.doEval(FUN.NAME="myFunction-1")
       a                b                  c
 Min.   :   1.0   Min.   :-3.00805   Min.   : 0.03449
 1st Qu.: 250.8   1st Qu.:-0.69737   1st Qu.: 1.27386
 Median : 500.5   Median :-0.03532   Median : 2.36454
 Mean   : 500.5   Mean   :-0.01165   Mean   : 3.07924
 3rd Qu.: 750.2   3rd Qu.: 0.68843   3rd Qu.: 4.25994
 Max.   :1000.0   Max.   : 3.81028   Max.   :17.56720

Of course, we’re using default values here. To provide different arguments, change the invocation with arguments as follows:

ore.doEval(FUN.NAME="myFunction-1", n=500, df=5, seed=2)

> ore.doEval(FUN.NAME="myFunction-1", n=500, df=5, seed=2)
       a               b                  c
 Min.   :  1.0   Min.   :-2.72182   Min.   : 0.1621
 1st Qu.:125.8   1st Qu.:-0.65346   1st Qu.: 2.6144
 Median :250.5   Median : 0.04392   Median : 4.4592
 Mean   :250.5   Mean   : 0.06169   Mean   : 5.0386
 3rd Qu.:375.2   3rd Qu.: 0.79096   3rd Qu.: 6.8467
 Max.   :500.0   Max.   : 2.88842   Max.   :17.0367

Having successfully invoked this from the R client (my laptop), we can now invoke it from SQL. Here, we retrieve the summary result, which is an R table, as an XML string.

select *

from table(rqEval( NULL,'XML','myFunction-1'));

The result can be viewed from SQL Developer.

The following shows the XML output in a more structured manner.

What if we wanted to get the result to appear as a SQL table? Since the current result is an R table (an R object), we need to convert this to a data.frame to return it. We’ll make a few modifications to “myFunction-1” above. Most notably is the need to convert the table object in res to a data.frame. There are a variety of ways to do this.

myFunction2 <- function (n = 1000, df = 3,seed=1) {

# generate data

set.seed(seed)

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

# perform some analysis

res <- summary(x)

# convert the table result to a data.frame

res.df <- as.matrix(res)

res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum) <- c('a','b','c')

res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max") res.sum <- res.sum[,c(4,1:3)] res.sum } # load function into R script repository ore.scriptDrop("myFunction-2") ore.scriptCreate("myFunction-2", myFunction2) We’ll now modify the SQL statement to specify the format of the result. select * from table(rqEval( NULL,'select cast(''a'' as VARCHAR2(12)) as "statname", 1 "a", 1 "b", 1 "c" from dual ','myFunction-2')); Here’s the result as viewed from SQL Developer. This type of result could be incorporated into any SQL application accepting table or view input from a SQL query. That is particular useful in combination with OBIEE dashboards via an RPD. Scenario 2 If you’ve been loading data from a flat file, perhaps a CSV file, your R code may look like the following, where it specifies to builds a model and write hat model to a file for future use, perhaps in scoring. It also generates a graph of the clusters highlighting the individual points, colored by their cluster id, with the centroids indicated with a star. # read data setwd("D:/datasets") dat <- read.csv("myDataFile.csv") # build a clustering model cl <- kmeans(x, 2) # write model to file save(cl, file="myClusterModel.dat") # create a graph and write it to a file pdf("myGraphFile.pdf") plot(x, col = cl$cluster)

points(cl$centers, col = 1:2, pch = 8, cex=2) dev.off() The resulting PDF file contains the following image. To convert this script for use in ORE, there are several options. We’ll explore two: the first involving minimal change to use embedded R execution, and the second leveraging in-database techniques. First, we’ll want the data we used above in variable dat to be loaded into the database. # create a row id to enable ordered results (if a key doesn’t already exist) dat$ID <- 1:nrow(dat)

# remove the table if it exists

ore.drop("MY_DATA")

# create the table using the R data.frame, resulting in an ore.frame named MY_DATA

ore.create(dat,"MY_DATA")

# assign the ID column as the row.names of the ore.frame

row.names(MY_DATA) <- MY_DATA$ID In the first example, we’ll use embedded R execution and pass the data to the function via ore.tableApply. We’ll generate the graph, but simply display it within the function to allow embedded R execution to return the graph as a result. (Note we could also write the graph to a file in any directory accessible to the database server.) Instead of writing the model to a file, which requires keeping track of its location, as well as worring about backup and recovery, we store the model in the database R datastore using ore.save. All this requires minimal change. As above, we could store the function in the R script repository and invoke it by name – both from R and SQL. In this example, we simply provide the function itself as argument. myClusterFunction1 <- function(x) { cl <- kmeans(x, 2) ore.save(cl, name="myClusterModel",overwrite=TRUE) plot(x, col = cl$cluster)

points(cl$centers, col = 1:2, pch = 8, cex=2) TRUE } ore.tableApply(MY_DATA[,c('x','y')], myClusterFunction1, ore.connect=TRUE,ore.png.height=700,ore.png.width=700) The ore.tableApply function projects the x and y columns of MY_DATA as input and also specifies ore.connect as TRUE since we are using the R datastore, which requires a database connection. Optionally, we can specify control arguments to the PNG output. In this example, these are the height and width of the image. For the second example, we convert this to leverage the ORE Transparency Layer. We’ll use the in-database K-Means algorithm and save the model in a datastore named “myClusteringModel”, as we did above. Since ore.odmKMeans doesn’t automatically assign cluster ids (since the data may be very large or are not required), the scoring is done separately. Note, however, that the prediction results also exist in the database as an ore.frame. To ensure ordering, we also assign row.names to the ore.frame pred. Lastly, we create the plot. Coloring the nodes requires pulling the cluster assignments; however, the points themselves can be accessed from the ore.frame. The centroids points are obtained from cl$centers2 of the cluster model.

# build a clustering model in-database

cl <- ore.odmKMeans(~., MY_DATA, 2, auto.data.prep=FALSE)

# save model in database R datastore

ore.save(cl,name="myClusterModel",overwrite=TRUE)

# generate predictions to assign each row a cluster id, supplement with original data

pred <- predict(cl,MY_DATA,supp=c('x','y','ID'),type="class")

# assign row names to ensure ordering of results

row.names(pred) <- pred$ID # create the graph plot(pred[,c('x','y')], col = ore.pull(pred$CLUSTER_ID))

points(cl$centers2[,c('x','y')], col = c(2,3), pch = 8, cex=2) We can also combine using the transparency layer within an embedded R function. But we’ll leave that as an exercise to the reader. Scenario 3 In this last scenario, the data already exists in the database and one of the database interface packages, such as RODBC, RJDBC, and ROracle is be used to retrieve data from and write data to the database. We’ll illustrate this with ROracle, but the same holds for the other two packages. # connect to the database drv <- dbDriver("Oracle") con <- dbConnect(drv, "mySchema", "myPassword") # retrieve the data specifying a SQL query dat <- dbGetQuery(con, 'select * from MY_RANDOM_DATA where "a" > 100') # perform some analysis res <- summary(dat) # convert the table result to a data.frame for output as table res.df <- as.matrix(res) res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3)) names(res.sum) <- c('a','b','c') res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")

res.sum <- res.sum[,c(4,1:3)]

res.sum

dbWriteTable(con, "SUMMARY_STATS", res.sum)

Converting this to ORE is straightforward. We’re already connected to the database using ore.connect from previous scenarios, so the existing table MY_RANDOM_DATA was already loaded in the environment as an ore.frame. Executing ore.ls lists this table is the result, so we can just start using it.

> ore.ls(pattern="MY_RAND")

[1] "MY_RANDOM_DATA"

# no need to retrieve the data, use the transparency layer to compute summary

res <- with(MY_RANDOM_DATA , summary(MY_RANDOM_DATA[a > 100,]))

# convert the table result to a data.frame for output as table

res.df <- as.matrix(res)

res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum) <- c('a','b','c')

res.sum\$statname <- c("min","1stQ","median","mean","3rdQ","max")

res.sum <- res.sum[,c(4,1:3)]

# create the database table

ore.create(res.sum, "SUMMARY_STATS")

SUMMARY_STATS

As we did in previous scenarios, this script can also be wrapped in a function and used in embedded R execution. This too is left as an exercise to the reader.

Summary

As you can see from the three scenarios discussed here, converting a script that accesses no external data, accesses and manipulates file data, or accesses and manipulates database data can be accomplished with a few strategic modifications. More involved scripts, of course, may require additional manipulation. For example, if the SQL query performs complex joins and filtering, along with derived column creation, the user may want to convert this SQL to the corresponding ORE Transparency Layer code, thereby eliminating reliance on SQL. But that’s a topic for another post.