RevoScaleR’s Naive Bayes Classifier rxNaiveBayes()

May 28, 2015
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert,

Because of its simplicity and good performance over a wide spectrum of classification problems the Naïve Bayes classifier ought to be on everyone's short list of machine learning algorithms. Now, with version 7.4 we have a high performance Naïve Bayes classifier in Revolution R Enterprise too. Like all Parallel External Memory Algorithms (PEMAs) in the RevoScaleR package, rxNaiveBayes is an inherently parallel algorithm that may be distributed across Microsoft HPC, Linux and Hadoop clusters and may be run on data in Teradata databases.

The following example shows how to get started with rxNaiveBayes() on a moderately sized data in your local environment. It uses the Mortgage data set which may be downloaded for the Revolution Analytics data set repository. The first block of code imports the .csv files for the years 2000 through 2008 and concatenates them into a single training file in the .XDF binary format. Then, the data for the year 2009 is imported to a test file that will be used for making predictions

#-----------------------------------------------
# Set up the data location information
bigDataDir <- "C:/Data/Mortgage" 
mortCsvDataName <- file.path(bigDataDir,"mortDefault") 
trainingDataFileName <- "mortDefaultTraining" 
mortCsv2009 <- paste(mortCsvDataName, "2009.csv", sep = "") 
targetDataFileName <- "mortDefault2009.xdf"
 
 
#--------------------------------------- 
# Import the data from multiple .csv files into2 .XDF files
# One file, the training file containing data from the years
# 2000 through 2008.
# The other file, the test file, containing data from the year 2009.
 
defaultLevels <- as.character(c(0,1)) 
ageLevels <- as.character(c(0:40)) 
yearLevels <- as.character(c(2000:2009)) 
colInfo <- list(list(name  = "default", type = "factor", levels = defaultLevels), 
	       list(name = "houseAge", type = "factor", levels = ageLevels), 
		   list(name = "year", type = "factor", levels = yearLevels)) 
 
append= FALSE 
for (i in 2000:2008) { 
     importFile <- paste(mortCsvDataName, i, ".csv", sep = "")     
	 rxImport(inData = importFile, outFile = trainingDataFileName,     
		      colInfo = colInfo, append = append, overwrite=TRUE)     
	          append = TRUE }  

The rxGetInfo() command shows that the training file has 9 million observation with 6 variables and the test file contains 1 million observations. The binary factor variable, default, which indicates whether or not an individual defaulted on the mortgage will be the target variable in the classification exercise.

rxGetInfo(trainingDataFileName, getVarInfo=TRUE)
#File name: C:UsersjrickertDocumentsRevolutionNaiveBayesmortDefaultTraining.xdf 
#Number of observations: 9e+06 
#Number of variables: 6 
#Number of blocks: 18 
#Compression type: zlib 
#Variable information: 
#Var 1: creditScore, Type: integer, Low/High: (432, 955)
#Var 2: houseAge
       #41 factor levels: 0 1 2 3 4 ... 36 37 38 39 40
#Var 3: yearsEmploy, Type: integer, Low/High: (0, 15)
#Var 4: ccDebt, Type: integer, Low/High: (0, 15566)
#Var 5: year
       #10 factor levels: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
#Var 6: default
       #2 factor levels: 0 1
 
rxImport(inData = mortCsv2009, outFile = targetDataFileName, colInfo = colInfo)
rxGetInfo(targetDataFileName)
#> rxGetInfo(targetDataFileName)
#File name: C:UsersjrickertDocumentsRevolutionNaiveBayesmortDefault2009.xdf 
#Number of observations: 1e+06 
#Number of variables: 6 
#Number of blocks: 2 
#Compression type: zlib 

Next, the rxNaiveBayes() function is used to fit a classification model with default as the target variable and year, credit score, years employed and credit card debt as predictors. Note that the smoothingFactor parameter instructs the classifier to perform Laplace smoothing. (Since the conditional probabilities are being multiplied in the model, adding a small number to 0 probabilities, precludes missing categories from wiping out the calculation.) Also note that it took about 1.9 seconds to fit the model on my modest Lenovo Thinkpad which is powered by an Intel i7 5600U processor and equipped with 8GB of RAM.

# Build the classifier on the training data
mortNB <- rxNaiveBayes(default ~ year + creditScore + yearsEmploy + ccDebt,  
	                   data = trainingDataFileName, smoothingFactor = 1) 
 
 
#Rows Read: 500000, Total Rows Processed: 8500000, Total Chunk Time: 0.110 seconds
#Rows Read: 500000, Total Rows Processed: 9000000, Total Chunk Time: 0.125 seconds 
#Computation time: 1.875 seconds.		

Looking at the model object we see that conditional probabilities are calculated for all of the factor (categorical) variables and means and standard deviations are calculated for numeric variables. rxNaiveBayes() follows the standard practice of assuming that these variables follow Gaussian distributions.

#> mortNB 
#
#Naive Bayes Classifier
#
#Call:
#rxNaiveBayes(formula = default ~ year + creditScore + yearsEmploy + 
    #ccDebt, data = trainingDataFileName, smoothingFactor = 1)
#
#A priori probabilities:
#default
          #0           1 
#0.997242889 0.002757111 
#
#Predictor types:
     #Variable    Type
#1        year  factor
#2 creditScore numeric
#3 yearsEmploy numeric
#4      ccDebt numeric
#
#Conditional probabilities:
#$year
       #year
#default         2000         2001         2002         2003         2004
      #0 1.113034e-01 1.110692e-01 1.112866e-01 1.113183e-01 1.113589e-01
      #1 4.157267e-02 1.262488e-01 4.765549e-02 3.617467e-02 2.151144e-02
       #year
#default         2005         2006         2007         2008         2009
      #0 1.113663e-01 1.113403e-01 1.111888e-01 1.097681e-01 1.114182e-07
      #1 1.885272e-02 2.823880e-02 8.302449e-02 5.966806e-01 4.028360e-05
#
#$creditScore
     #Means   StdDev
#0 700.0839 50.00289
#1 686.5243 49.71074
#
#$yearsEmploy
     #Means   StdDev
#0 5.006873 2.009446
#1 4.133030 1.969213
#
#$ccDebt
     #Means   StdDev
#0 4991.582 1976.716
#1 9349.423 1459.797				
 

Next, we use the rxPredict() function to predict default values for the test data set. Setting the type = "prob" parameter produced the table of probabilities below. Using the default for type would have produced only the default_Pred column of forecasts. In a multi-value forecast, the probability table would contain entries for all possible values.

# use the model to predict wheter a loan will default on the test data
mortNBPred <- rxPredict(mortNB, data = targetDataFileName, type="prob") 
#Rows Read: 500000, Total Rows Processed: 500000, Total Chunk Time: 3.876 
# secondsRows Read: 500000, Total Rows Processed: 1000000, Total Chunk Time: 2.280 seconds 
names(mortNBPred) <- c("prob_0","prob_1")
mortNBPred$default_Pred <- as.factor(round(mortNBPred$prob_1))
 
#head(mortNBPred)
     #prob_0      prob_1 default_Pred
#1 0.9968860 0.003114038            0
#2 0.9569425 0.043057472            0
#3 0.5725627 0.427437291            0
#4 0.9989603 0.001039729            0
#5 0.7372746 0.262725382            0
#6 0.4142266 0.585773432            1

In this next step, we tabulate the actual vs. predicted values for the test data set to produce the "confusion matrix" and an estimate of the misclassification rate.

# Tabulate the actual and predicted values
actual_value <- rxDataStep(targetDataFileName,maxRowsByCols=6000000)[["default"]]
predicted_value <- mortNBPred[["default_Pred"]]
results <- table(predicted_value,actual_value) 
#> results
               #actual_value
#predicted_value      0      1
              #0 877272   3792
              #1  97987  20949
 
pctMisclassified <- sum(results[1:2,2])/sum(results[1:2,1])*100 
pctMisclassified 
#[1] 2.536865

Since the results object produced above is an ordinary table we can use the confusionMatrix() from the caret package to produce additional performance measures.

# Use confusionMatrix from the caret package to look at the results
library(caret)
library(e1071)
confusionMatrix(results,positive="1")
 
#Confusion Matrix and Statistics
#
               #actual_value
#predicted_value      0      1
              #0 877272   3792
              #1  97987  20949
                                          #
               #Accuracy : 0.8982          
                 #95% CI : (0.8976, 0.8988)
    #No Information Rate : 0.9753          
    #P-Value [Acc > NIR] : 1               
                                          #
                  #Kappa : NA              
 #Mcnemar's Test P-Value : <2e-16          
                                          #
            #Sensitivity : 0.84673         
            #Specificity : 0.89953         
         #Pos Pred Value : 0.17614         
         #Neg Pred Value : 0.99570         
             #Prevalence : 0.02474         
         #Detection Rate : 0.02095         
   #Detection Prevalence : 0.11894         
      #Balanced Accuracy : 0.87313         
                                          #
       #'Positive' Class : 1               

Finally, we use the rxhist() function to look at a histogram (not shown) of the actual values to get a feel for how unbalanced the data set is,  and then use the rxRocCurve() function to produce the ROC Curve.

roc_data <- data.frame(mortNBPred$prob_1,as.integer(actual_value)-1)
names(roc_data) <- c("predicted_value","actual_value")
head(roc_data)
hist(roc_data$actual_value)
rxRocCurve("actual_value","predicted_value",roc_data,title="ROC Curve for Naive Bayes Mortgage Defaults Model")

 

  ROC_defaults

Here we have a "picture-perfect" representation of how one hopes a classifier will perform.

For more on the Naïve Bayes classification algorithm have a look at these two papers referenced in the Wikipedia link above.

  1. Minskey paper
  2. Domingos and Pazzani paper

The first is a prescient, 1961 paper by Marvin Minskey that explicitly calls attention to the naïve, independence assumption. The second paper provides some theoretical arguments for why the overall excellent performance of the Naïve Bayes Classifier is not accidental.

 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)