Using your R models for in-database scoring

August 11, 2016
By

(This article was first published on R – Bora Beran, and kindly contributed to R-bloggers)

Let’s say you trained a predictive model using R integration in Tableau or your favorite R authoring environment.

If your data is very large you probably used a subset of your data for training and want to see how well it will do over the entire database.Or maybe you’re satisfied with the model but you want it to return you only the interesting results from the database e.g. customers who are most likely to churn.

In both scenarios, being able to push the model to the database means being able to rely on the database’s scalability and avoiding delays caused by data movement.

You may be asking, “but what if our database systems can’t run predictive models?” Luckily once trained, some of the very commonly used predictive methods can be expressed in SQL hence also in Tableau’s calculation language. All you need is a way to convert the model into a Tableau calculated field.

If you created a decision tree in R, you can use the following function to convert it to a Tableau formula:

decisionTreeToFormula <- function (tree) {

if (class(tree) != "rpart") {
warning(class(tree), " is not supported.Please use an rpart model.")
stop()
}

capture.output({rpartRules <- path.rpart(tree,rownames(tree$frame)[tree$frame$var==""])})
ruleTexts <- "IF "
operators <- c("<=",">=","<",">","=")
i <- 1

levelToCategory<-data.frame(labels=attr(tree,"ylevels"),catkey=rep(1:length(attr(tree,"ylevels"))))
nodeToLevel<-data.frame(catkey=tree$frame$yval,node=rownames(tree$frame))
mapping_table<-merge(levelToCategory,nodeToLevel,by='catkey')[,2:3]

for (rule in rpartRules) {
ruleText <- character(0)
for (component in rule) {
whichSeparator <- lapply(operators, function(x) length(unlist(strsplit(component,x)))) > 1
parts <- unlist(strsplit(component,(operators[whichSeparator])[1]))
if(!(parts[1]=="root")) {
if (is.finite(suppressWarnings(as.numeric(parts[2])))) {
ruleText <- c(ruleText,paste("[",parts[1],"]",(operators[whichSeparator])[1],parts[2],sep=""))
} else {
ruleText <- c(ruleText,paste0(" (",paste0("[",parts[1],"]","='",unlist(strsplit(parts[2],",")),"'",collapse=" OR "),")"))
}
}
}

ruleTexts <- c(ruleTexts, paste0(if(length(ruleTexts)>1){"\nELSEIF "}, paste(ruleText,collapse=" AND ")," THEN '" ,mapping_table[mapping_table$node==as.numeric(names
(rpartRules)[i]),]$labels,"'"))
if(i==length(rpartRules)) ruleTexts <- c(ruleTexts,"\nEND")
i <- i +1
}
tableauFormula <- paste(ruleTexts,collapse=" ")
return(tableauFormula)
}

You can add this to your Rserve.cfg so you can call the decisionTreeToFormula function  in your SCRIPT_ calls from Tableau or you can use it in your R environment. Either way it will give you some text which you can copy-paste into Tableau and use it as a formula that can run on the database.

In the example below, I fit a decision tree on a database of cars trying to predict the gas mileage (Bad|OK|Good). I used only a small portion of variables (just maker and year) to get a compact result back. Applying the function gives me a nice IF..THEN..ELSE formula that handles both numeric and categorical items. Now I can use this classifier on any database. As a dimension, as a filter… by just copy-pasting the formula into a calculated field in Tableau.

> myTree<-rpart(mpg~modelyear+maker,data=cars)
> decisionTreeToFormula(myTree)
IF ([maker]='America') AND [modelyear] < 79 THEN 'Bad'
ELSEIF ([maker]='America') AND [modelyear] >= 79 THEN 'Good'
ELSEIF ([maker]='Asia' OR [maker]='Europe') THEN 'OK'
END

What about a regression model? This post has that covered, too. You can achieve similar results with the following function :

regressionModelToFormula <- function(model, decimalPlaces) {
if ((class(model)[1] != "glm" && model$family$family != "binomial") & class(model)[1] != "lm") {
warning("Function is not applicable to this model type.")
stop()
}
factors=c("",gsub("[-^0-9]", "", names(unlist(model$xlevels))))[-1]
levels=c("",as.character(unlist(model$xlevels)))[-1]
coeff <- coefficients(model);
names=data.frame(factors,levels,varNames=paste(factors,levels,sep=""))
varCoeffMap=data.frame(varNames=names(coeff)[-1],coefficients=as.numeric(coeff[-1]))
catVars <-merge(names,varCoeffMap,all.x=TRUE)

if (nrow(catVars) !=0) {
contVars <- round(coeff[-which(names(coeff) %in% catVars$varNames)],decimalPlaces)
} else {
contVars <- round(coeff,decimalPlaces)
}
catFormula <- " "

for(catVar in unique(catVars$factors)){
curVar <- catVars[catVars$factors==catVar & !is.na(catVars$coefficients) ,]
catFormula <- paste("WHEN",curVar$levels, "THEN",round(curVar$coefficients,decimalPlaces), sep=" ", collapse="\n")
catFormula <- paste("+ CASE [",catVar,"] \n",catFormula," ELSE 0 \n END \n", sep="")
}

numericFormula <- paste(contVars[-1], "*" ,paste("[",names(contVars[-1]),"]",sep=""), "+", collapse = " ")
if (class(model)[1]=="lm") {
combinedFormula <- paste(numericFormula,contVars[1],catFormula)
}else{
combinedFormula <- paste("1 / (1 + exp(", numericFormula, contVars[1],catFormula, "))")
}
tableauFormula <- gsub(pattern=":", replacement="*", x=combinedFormula)
return(tableauFormula)
}

If you call this function from R on the logistic regression model I used in one of my earlier blog posts, below is what you will get. Copy-paste this into a Tableau calculated field and you’re ready to make predictions with your logistic regression model in your database.

> regressionModelToFormula(lrmodel,5)
1 / (1 + exp( 0.00628 * [gre] + 1.60393 * [gpa] -8.82164 +
CASE [rank]
WHEN 2 THEN -1.10904
WHEN 3 THEN -1.31259
WHEN 4 THEN -2.01912
ELSE 0
END))

And if you apply this function on the multiple linear regression model in the last sheet of the example workbook from my last blog post, what you get will resemble the following:

> regressionModelToFormula(mrmodel, 4)
1.6224 * [Economic Indicator X] + 0.597 * [Economic Indicator Y] -37.1103

I hope you find this useful.

To leave a comment for the author, please follow the link and comment on their blog: R – Bora Beran.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)