Sanitizing data in SAP HANA with R

(This article was first published on Blag's bag of rants, and kindly contributed to R-bloggers)

From April 10 to April 11, my team (Anne, Juergen and myself) host an InnoJam in Boston. It was a really great event, but the data provided by the City of Boston wasn't exactly in the best shape, so we took a lot of efforts (with a help of the SAP Guru's that helped us) to sanitize the data.

At some point, as I was asked to use my Regular Expressions skills (everybody knows I'm crazy about RegEx) to sanitize some data that came into a really weird format inside a field...something like this (Can't post the real data for obvious reasons):

Type: [This is a test] Area: [My Area] Description: [This data is not right]

What I need to do was basically, took each record and generate a new table with those 3 fields and end with something like this...

This is a testMy AreaThis data is not right

I began to think on which language could be the best for the job...I thought on ABAP, Python...and then of course...R...so I choose R.

The problems that immediately arise were simple, how to pull data from HANA and how to send it back, also, even when I believe myself a RegEx Hero, R uses a not very standard RegEx schema. I thought on downloading the data from HANA as .CSV, clean it up and then upload the .CSV back to HANA...but then I thought that the extra work was not worth it...so then...my good old friend RODBC came into the show...even when it's not supported by SAP, I decided that for this particular case, it would be just fine...I could read the data back and forth and have everything back into HANA in a very fast way.

Let's create a table and call it BAD_DATA...and just create 10 dummy records (I know...I'm lazy)...

So this is the script:

library("RODBC")
ch<-odbcConnect("HANA_SERVER",uid="P075400",pwd="***")
CRM_TAB<-sqlQuery(ch,query)

SR_Type<-c()
SR_Area<-c()
Description<-c()

for(i in 1:nrow(CRM_TAB)){
mypattern = '\$([^\\[]*)\$'
datalines = grep(mypattern,CRM_TAB\$CASE_DESCRIPTION[i],value=T)
getexpr = function(s,g)substring(s,g,g+attr(g,'match.length')-1)
g_list = gregexpr(mypattern,datalines)
matches = mapply(getexpr,datalines,g_list)
result = gsub(mypattern,'\\1',matches)
var<-0
i<-0
for(i in 1:length(result)){
var<-var+1
if(var==4){
break
}
if(var==1){
SR_Type<-append(SR_Type,result[i])
}
if(var==2){
SR_Area<-append(SR_Area,result[i])
}
if(var==3){
Description<-append(Description,result[i])
}
}
if(length(SR_Type)>length(Description)){
Description<-append(Description,NA)
}
if(length(SR_Type)>length(SR_Area)){
SR_Area<-append(SR_Area,NA)
}
}

GOOD_DATA<-data.frame(SR_Type,SR_Area,Description,stringsAsFactors=FALSE)
sqlDrop(ch,"GOOD_DATA",errors=FALSE)
sqlSave(ch,GOOD_DATA,rownames="id")
odbcClose(ch)

So basically, but we're doing is to grab all the information inside the brackets, then pass it to vectors to finally create a Data.Frame and send it back to HANA. If you wonder why I'm comparing the length of the different vector to add NA values, it's very simple...we can have something like this...

Type: [This is a test] Area: [My Area] Description: [This data is not right |

The last bracket...it's not a bracket! It's a pipe, so the RegEx is going to fail and this will provoke the vector to be empty and that would be messy...if that happens with can add an NA and at least have a value on it...

So...when we run our program...a new data called GOOD_DATA is going to be created with all the data clean and sanitized.

Nice, right? -;)

See you in my next blog! -:)