An the New York R User Group* last night, 100 R users heard Ni Wang and Max Lin talk explain how "R is one of the important tools used by analysts and engineers at Google for analyzing data". During the talk, Lin revealed that Google plans to make "R more integrated with internal machine learning algorithms and infrastructure", and one component of that plan was announced at the meeting: a new library for R to build and score models using the Google Prediction API.
The Google Prediction API is a black-box system for building predictive models. Given a set of training data (a set of continuous and/or categorical explanatory variables and a dependent variable), the Google algorithms automatically selects from several available machine learning techniques create a model from the training model. Then later, given a set of explanatory variables, you can predict the value of the dependent variable under this model.
Now with the googlepredictionapi R package (which you can download from Google Code), you can create such models based on data stored in a local CSV file or in the Google Storage system. The model is represented as an object in R, which you can then use to make predictions using the standard predict function, as illustrated in the following code:
## Make a training call to the Prediction API against data in the Google Storage. ## Replace MYBUCKET and MYDATA with your data. my.model <- PredictionApiTrain(data="gs://MYBUCKET/MYDATA") ## Alternatively, make a training call against training data stored locally as a CSV file. ## Replace MYPATH and MYFILE with your data. my.model <- PredictionApiTrain(data="MYPATH/MYFILE.csv") ## Read the summary of the trained model summary(my.model) ## Make a prediction call for text data using the trained model predict(my.model, "This is a new piece of text") ## Similarly, predict() works for numeric features predict(my.model, c(6, 3, 5, 2))
You need to request access to the Google Prediction API to use this package (instructions how to request are here). Anyone tried this out yet? Given that all the standard statistical (as distinct from machine language) models are in R, this package would make it easy to compare the performance of the automated Prediction API with more traditional statistical techniques.
[*] The New York R User Group is proudly sponsored by Revolution Analytics.