Real-time model scoring for streaming data – a prototype based on Oracle Stream Explorer and Oracle R Enterprise

Posted on March 30, 2016 by Alexandru Ardel-Oracle in R bloggers | 0 Comments

[This article was first published on Oracle R Technologies, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Whether applied to manufacturing, financial services, energy, transportation, retail, government, security or other domains, real-time analytics is an umbrella term which covers a broad spectrum of capabilities (data integration, analytics, business intelligence) built on streaming input from multiple channels. Examples of such channels are: sensor data, log data, market data, click streams, social media and monitoring imagery.

Key metrics separating real-time analytics from more traditional, batch, off-line analytics are latency and availability. At one end of the analytics spectrum are complex, long running batch analyses with slow response time and low availability requirements. At the other end are real-time, lightweight analytic applications with fast response time (O[ms]) and high availability (99.99..%). Another distinction is between the capability for responding to individual events and/or ordered sequences of events versus the capability for handling only event collections in micro batches without preservation of their ordered characteristics. The complexity of the analysis performed on the real-time data is also a big differentiator: capabilities range from simple filtering and aggregations to complex predictive procedures. The level of integration between the model generation and the model scoring functionalities needs also to be considered for real-time applications. Machine learning algorithms specially designed for online model building exist and are offered by some streaming data platforms but their number is small. Practical solutions could be built by combining an off-line model generation platform with a data streaming platform augmented with scoring capabilities.

In this blog we describe a new prototype for real time analytics integrating two components : Oracle Stream Explorer (OSX) and Oracle R Enterprise (ORE). Examples of target applications for this type of integration are: equipment monitoring through sensors, anomaly detection and failure prediction for large systems made of a high number of components.

The basic architecture is illustrated below:

ORE is used for model building, in batch mode, at low frequency, and OSX handles the high frequency streams and pushes data toward a scoring application, performs predictions in real time and returns results to consumer applications connected to the output channels.

OSX is a middleware platform for developing streaming data applications. These applications monitor and process large amounts of streaming data in real time, from a multitude of sources like sensors, social media, financial feeds, etc. Readers unfamiliar with OSX should visit Getting Started with Event Processing for OSX.

In OSX, streaming data flows into, through, and out of an application. The applications can be created, configured and deployed with pre-built components provided with the platform or built from customized adapters and event beans. The application in this case is a custom scoring application for real time data. A thorough description of the application building process can be found in the following guide: Developing Applications for Event Processing with Oracle Stream Explorer.

In our solution prototype for streaming analytics, the model exchange between ORE and OSX is realized by converting the R models to a PMML representation. After that, JPMML – the Java Evaluator API for PMML – is leveraged for reading the model and building a custom OSX scoring application.
The end-to-end workflow is represented below:

and the subsequent sections of this blog will summarize the essentials aspects.

Model Generation

As previously stated, the use cases targeted by this OSX-ORE integration prototype application consist of systems made of a large number of different components. Each component type is abstracted by a different model. We leverage ORE’s Embedded R Execution capability for data and task parallelism to generate a large number of models concurrently. This is accomplished for example with ore.groupApply():

res <- ore.groupApply(
   X=…
   INDEX=…
   function(dat,frml) {mdl<-...},
   …,
   parallel=np)

Model representation in PMML

The model transfer between the model generator and the scoring engine is enabled by conversion to a PMML representation. PMML is an XML-based mature standard for model exchange. A model in PMML format is represented by a collection of XML elements, or PMML components, which completely describe the modeling flow. For example, the Data Dictionary component contains the definitions for all fields used by the model (attribute types, value ranges, etc) the Data Transformations component describes the mapping functions between the
raw data and its desired form for the modeling algorithms, the Mining Schema component assigns the active and target variables and enumerates the policies for missing data, outliers, and so on. Besides the specifications for the data mining algorithms together with accompanying of pre- and post-processing steps, PMML can also describe more complex modeling concepts like model composition, model hierarchies, model verification and fields scoping – to find out more about PMML’s structure and functionality go to General Structure. PMML representations have been standardized for several classes of data mining algorithms. Details are available at the same location.

PMML in R

In R the conversion/translation to PMML formats is enabled through the pmml package. The following algorithms are supported:

ada (ada)
arules (arules)
coxph (survival)
glm (stats)
glmnet (glmnet)
hclust (stats)
kmeans (stats)
ksvm (kernlab)
lm (stats)
multinom (nnet)
naiveBayes (e1071)
nnet (nnet)
randomForest (randomFoerst)
rfsrc (randomForestSRC)
rpart (rpart)
svm (e1071)

The r2pmml package offers complementary support for

gbm (gbm)
train(caret)

and a much better (performance-wise) conversion to PMML for randomForest. Check the details at converting_randomforest.

The conversion to pmml is done via the pmml() generic function which dispatches the appropriate method for the supplied model, depending on it’s class.

library(pmml)
mdl <- randomForest(...)
pmld <- pmml(mdl)

Exporting the PMML model

In the current prototype, the pmml model is exported to the streaming platform as a physical XML file. A better solution is to leverage R’s serialization interface which supports a rich set of connections through pipes, url’s, sockets, etc.
The pmml objects can be also saved in ORE datastores within the database and specific policies can be implemented to control the access and usage.

write(toString(pmmdl),file=”..”)
serialize(pmmdl,connection)
ore.save(pmmdl,name=dsname,grantable=TRUE)
ore.grant(name=dsname, type=”datastore”, user=…)

OSX Applications and the Event Processing Network (EPN)

The OSX workflow, implemented as an OSX application, consists of three logical steps: the pmml model is imported into OSX, a scoring application is created and scoring is performed on the input streams.

In OSX, applications are modeled as Data Flow graphs named Event Processing Networks (EPN). Data flows into, through, and out of EPNs. When raw data flows into an OSX application it is first converted into events. Events flow through the different stages of application where they are processed according to the specifics of the application. At the end, events are converted back to data in suitable format for consumption by downstream applications.

The EPN for our prototype is basic:

Streaming data flows from the Input Adapters through Input Channels, reaches the Scoring Processor where the prediction is performed, flows through the Output Channel to an Output Adapter and exits the application in a desired form. In our demo application the data is streamed out of a CSV file into the Input Adapter. The top adaptors (left & right) on the EPN diagram represent connections to the Stream Explorer User Interface (UI). Their purpose is to demonstrate options for controlling the scoring process (like, for example, change the model while the application is still running) and visualizing the predictions.

The JPMML-Evaluator

The Scoring Processor was implemented by leveraging the open source library JPMML library, the Java Evaluator API for PMML. The methods of this class allow, among others to pre-process the active & target fields according to the DataDictionary and MiningSchema elements, evaluate the model for several classes of algorithms and post-process the results according to the Targets element.

JPMML offers support for:

Association Rules
Cluster Models
Regression
General Regression
k-Nearest Neighbors
Naïve Bayes
Neural Networks
Tree Models
Support Vector Machines
Ensemble Models

which covers most of the models which can be converted to PMML from R, using the pmml() method, except for time series, sequence rules & text models.

The Scoring Processor

The Scoring Processor (see EPN) is implemented as a JAVA class with methods that automate scoring based on the PMML model. The important steps of this automation are enumerated below:

The PMML schema is loaded, from the xml document,

pmml = pmmlUtil.loadModel(pmmlFileName);

An instance of the Model Evaluator is created. In the example below we assume that we don’t know what type of model we are dealing with so the instantiation is delegated to an instance of a ModelEvaluatorFactory class.

    ModelEvaluatorFactory modelEvaluatorFactory =
                                           ModelEvaluatorFactory.newInstance();
    ModelEvaluator evaluator = modelEvaluatorFactory.newModelManager(pmml);

This Model Evaluator instance is queried for the fields definitions. For the active fields:

List activeModelFields = evaluator.getActiveFields();

The subsequent data preparation performs several tasks: value conversions between the Java type system and the PMML type system, validation of these values according to the specifications in the Data Field element, handling of invalid, missing values and outliers as per the Mining Field element.

FieldValue activeValue = evaluator.prepare(activeField, inputValue)
pmmlArguments.put(activeField, activeValue);

The prediction is executed next

Map results = evaluator.evaluate(pmmlArguments);

Once this is done, the mapping between the scoring results & other fields to output events is performed. This needs to differentiate between the cases where the target values are Java primitive values or smth different.

    FieldName targetName = evaluator.getTargetField();
    Object targetValue = results.get(targetName);
    if (targetValue instanceof Computable){ ….

More details about this approach can be found at JPMML-Evaluator: Preparing arguments for evaluation and Java (JPMML) Prediction using R PMML model.

The key aspect is that the JPMML Evaluator API provides the functionality for implementing the Scoring Processor independently of the actual model being used. The active variables, mappings, assignments, predictor invocations are figured out automatically, from the PMML representation. This approach allows flexibility for the scoring application. Suppose that several PMML models have been generated off-line, for the same system component/equipment part, etc. Then, for example, an n-variables logistic model could be replaced by an m-variables decision tree model via the UI control by just pointing a Scoring Processor variable to the new PMML object. Moreover the substitution can be executed via signal events sent through the UI Application Control (upper left of EPN) without stopping and restarting the scoring application. This is practical because the real-time data keeps flowing in !

Tested models

The R algorithms listed below were tested and identical results were obtained for predictions based on the OSX PMML/JPMML scoring application and predictions in R.

lm (stats)
glm (stats)
rpart (rpart)
naiveBayes (e1071)
nnet (nnet)
randomForest (randomForest)

The prototype is new and other algorithms are currently tested. The details will follow in a subsequent post.

Acknowledgment

The OSX-ORE PMML/JPMML-based prototype for real-time scoring was developed togehther with Mauricio Arango | A-Team Cloud Solutions Architects. The work was presented at BIWA 2016.

To leave a comment for the author, please follow the link and comment on their blog: Oracle R Technologies.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers