Introduction to Oracle R Connector for Hadoop

[This article was first published on Oracle R Enterprise, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

MapReduce, the heart of Hadoop, is a programming framework that enables massive scalability across servers using data stored in the Hadoop Distributed File System (HDFS). The Oracle R Connector for Hadoop (ORCH) provides access to a Hadoop cluster from R, enabling manipulation of HDFS-resident data and the execution of MapReduce jobs.

Conceptutally, MapReduce is similar to combination of apply operations in R or GROUP BY in Oracle Database: transform elements of a list or table, compute an index, and apply a function to the specified groups. The value of MapReduce in ORCH is the extension beyond a single-process to parallel processing using modern architectures: multiple cores, processes, machines, clusters, data appliance, or clouds.

ORCH can be used on the Oracle Big Data Appliance or on non-Oracle Hadoop clusters. R users write mapper and reducer functions in R and execute MapReduce jobs from the R environment using a high level interface. As such, R users are not required to learn a new language, e.g., Java, or environment, e.g., cluster software and hardware, to work with Hadoop. Moreover, functionality from R open source packages can be used in the writing of mapper and reducer functions. ORCH also gives R users the ability to test their MapReduce programs locally, using the same function call, before deploying on the Hadoop cluster.


In the following example, we use the ONTIME_S data set typically installed in Oracle Database when Oracle R Enterprise is installed. ONTIME_S is a subset of the airline on-time performance data (from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs. 
We’re providing a relatively large sample data set (220K rows), but this example could be performed in ORCH on the full data set, which contains 123 millions rows and requires 12 GB disk space . This data set is significantly larger than R can process on it’s own using a typical laptop with 8 GB RAM.

ONTIME_S is a database-resident table with metadata on the R side, represented by an ore.frame object.

> class(ONTIME_S)
[1] “ore.frame” attr(,”package”)
[1] “OREbase”

ORCH includes functions for manipulating HDFS data. Users can move data between HDFS and the file system, R data frames, and Oracle Database tables and views. This next example shows one such function, hdfs.push, which accepts an ore.frame object as its first argument, followed by the name of the key column, and then the name of the file to be used within HDFS.

ontime.dfs_DB <- hdfs.push(ONTIME_S,
                       key=’DEST’,
                       dfs.name=’ontime_DB’)

The following R script example illustrates how users can attach to an existing HDFS file object, essentially getting a handle to the HDFS file.
Then, using the hadoop.run function in ORCH, we specify the HDFS file handle, followed by the mapper and reducer functions. The mapper function takes the key and value as arguments, which correspond to one row of data at a time from the HDFS block assigned to the mapper. The function keyval in the mapper returns data to Hadoop for further processing by the reducer.

The reducer function receives all the values associated with one key (resulting from the “shuffle and sort” of Hadoop processing). The result of the reducer is also returned to Hadoop using the keyval function. The results of the reducers are consolidated in an HDFS file, which can be obtained using the hdfs.get function.

The following example computes the average arrival delay for flights where the destination is San Francisco Airport (SFO).  It selects the SFO airport in the mapper and the mean of arrival delay in the reducer.

dfs <- hdfs.attach("ontime_DB")

res <- hadoop.run(
        dfs,
        mapper = function(key, value) {
          if (key == ‘SFO’ & !is.na(x$ARRDELAY)) {
            keyval(key, value)
          }
          else {
            NULL
          }
            },
         reducer = function(key, values) {
            for (x in values) {
                sumAD <- sumAD + x$ARRDELAY
                count <- count + 1
                  }
                  res <- sumAD / count
                  keyval(key, res)
            })

> hdfs.get(res)
   key     val1
1  SFO   17.44828

Oracle R Connector for Hadoop is part of the Oracle Big Data Connectors software suite and is supported for Oracle Big Data Appliance and Oracle R Enterprise customers.  We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  We welcome comments and questions on the Oracle R Forum.





To leave a comment for the author, please follow the link and comment on their blog: Oracle R Enterprise.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)