Classifying Emails as Spam or Ham using RTextTools

[This article was first published on Strategic Thinking - Automated In R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently, I had read an article on R-bloggers, titled Classifying Breast Cancer as Benign or Malignent using RTextTools by Timothy P. Jurka, who is the author of both that article and the RTextTools package. Having reproduced the results using the author’s R code successfully, I was motivated to explore the usefulness of this package.

Also, there is an excellent book by Conway & White (2012), Machine Learning for Hackers, that shows the reader how to build a Bayesian Spam Classifier (let’s called it the Benchmark). I was interested to find out how a spam classifier model built using RTextTools would compare with the Benchmark. However, as this is largely an unexplored area as there are NOT many example models built using RTextTools package, I decided to explore the feasibility of building a model used to classify large text, i.e. raw text without ANY features.

(1) Obtaining the data and loading it into R

suppressPackageStartupMessages(require(RTextTools))
suppressPackageStartupMessages(require(tm))
source(“C:/Users/denbrige/100 FxOption/103 FxOptionVerBack/080 Fx Git/R-source/PlusReg.R”, echo=FALSE)
spam.dir <- paste0(RegGetRNonSourceDir(), "spamassassin/")
get.msg <- function(path.dir)
{
  con <- file(path.dir, open="rt", encoding="latin1")
  text <- readLines(con)
  msg <- text[seq(which(text=="")[1]+1,length(text),1)]
  close(con)
  return(paste(msg, collapse=”\n”))
}
get.msg.try <- function(path.dir)
{
  con <- file(path.dir, open="rt", encoding="latin1")
  text <- readLines(con)
  options(warn=-1)
  msg <- tryCatch( text[seq(which(text=="")[1]+1,length(text),1)],
                      error=function(e) { 9999 }, finally={} )
  close(con)
  if( substr(msg, 1, 5)==”Error” )
  {
    return(“Error”)
  }
  else
  {
    return(paste(msg, collapse=”\n”))
  }
}
get.all <- function(path.dir)
{
  all.file <- dir(path.dir)
  all.file <- all.file[which(all.file!="cmds")]
  msg.all <- sapply(all.file, function(p) get.msg(paste0(path.dir,p)))
}
get.all.try <- function(path.dir)
{
  all.file <- dir(path.dir)
  all.file <- all.file[which(all.file!="cmds")]
  msg.all <- sapply(all.file, function(p) get.msg.try(paste0(path.dir,p)))
}
easy_ham.all    <- get.all(paste0(spam.dir, "easy_ham/"))
easy_ham_2.all  <- get.all(paste0(spam.dir, "easy_ham_2/"))
hard_ham.all    <- get.all(paste0(spam.dir, "hard_ham/"))
hard_ham_2.all  <- get.all(paste0(spam.dir, "hard_ham_2/"))
spam.all        <- get.all.try(paste0(spam.dir, "spam/"))
spam_2.all      <- get.all(paste0(spam.dir, "spam_2/"))


First, we download the email data from the SpamAssassin public corpus. EACH classification has TWO (2) sub-folders, e.g. “easy_ham” and “easy_ham_2”. This makes it easier as the first set is used for training data, and the second set (with “_2”) is used for testing data. Remember to change the above code to work with your own folders, i.e. “spam.dir”.

Chapter 3 of Conway & White (2012) explains what the structure of an email looks like, why we are focusing on only the email message body and how to extract this text from the message files using the function get.msg() above. We use the get.all() function to apply get.msg() function to ALL of the filenames, except “cmds”, for EACH folder and construct a vector of messages from the returned text.

(2) Split the data into train/test sets

easy_ham.dfr    <- as.data.frame(easy_ham.all)
easy_ham_2.dfr  <- as.data.frame(easy_ham_2.all)
hard_ham.dfr    <- as.data.frame(hard_ham.all)
hard_ham_2.dfr  <- as.data.frame(hard_ham_2.all)
spam.dfr        <- as.data.frame(spam.all)
spam_2.dfr      <- as.data.frame(spam_2.all)
rownames(easy_ham.dfr)    <- NULL
rownames(easy_ham_2.dfr)  <- NULL
rownames(hard_ham.dfr)    <- NULL
rownames(hard_ham_2.dfr)  <- NULL
rownames(spam.dfr)        <- NULL
rownames(spam_2.dfr)      <- NULL
easy_ham.dfr$outcome    <- 2
easy_ham_2.dfr$outcome  <- 2
hard_ham.dfr$outcome    <- 2
hard_ham_2.dfr$outcome  <- 2
spam.dfr$outcome        <- 4
spam_2.dfr$outcome      <- 4
names(easy_ham.dfr)   <- c("text", "outcome")
names(easy_ham_2.dfr) <- c("text", "outcome")
names(hard_ham.dfr)   <- c("text", "outcome")
names(hard_ham_2.dfr) <- c("text", "outcome")
names(spam.dfr)       <- c("text", "outcome")
names(spam_2.dfr)     <- c("text", "outcome")
train.data  <- rbind(easy_ham.dfr, hard_ham.dfr, spam.dfr)
train.num   <- nrow(train.data)
train.data  <- rbind(train.data, easy_ham_2.dfr, hard_ham_2.dfr, spam_2.dfr)
names(train.data) <- c("text", "outcome")
spam.str <- paste0(RegGetRNonSourceDir(),"Jurka_03_spam.rda")
if( !file.exists(spam.str) )
{
  save(train.data, train.num, file=spam.str)
}


Note: There is PROBABLY a more elegant way of writing ALL the above code, but it was late at night and I had a brain freeze, hence NON-elegant code was produced as a result…

Basically, for EACH data frame, we add a new column “outcome”. This contains a numeric integer that classifies both easy_ham and hard_ham to TWO (2), and spam to FOUR (4). We merge these data frames into “train.data” and we rename the columns to “text” and “outcome”.

We could increase execution speed by saving the train_data as an R native file (.rda) and loading that file, instead of loading the raw data EACH time. The saved variable “train.num” contains the number of rows for training data. Again, remember to change the above code to work with your own folders, i.e. “spam.str”. 

(3) Build the model

set.seed(2012)
train_out.data <- train.data$outcome
train_txt.data <- train.data$text

matrix <- create_matrix(train_txt.data, language="english", minWordLength=3, removeNumbers=TRUE, stemWords=FALSE, removePunctuation=TRUE, weighting=weightTfIdf)
container <- create_container(matrix,t(train_out.data), trainSize=1:train.num, testSize=(train.num+1):nrow(train.data), virgin=FALSE)
maxent.model    <- train_model(container, "MAXENT")
svm.model       <- train_model(container, "SVM")


The steps for training a model is as follows:
  1. Create a document-term matrix; 
  2. Create a container; 
  3. Create a model by feeding a container to the machine learning algorithm.
I had used the default parameters in the function create_matrix(), except for the following:
  • removeNumbers=TRUE – numbers are removed; 
  • weighting=weightTfIdf – this parameter is from the package tm.
  • stemWords=TRUE – there is a LIMIT of 255 characters on the number of characters in a word being stemmed. (As a result of an error, I had to set this parameter back to default: FALSE)

To create a container, you need to pass it BOTH a document-term matrix AND an outcome vector, i.e. train_out.data, which is the reason why I had to split the “train.data”. I had used the default parameters in the function create_container(), except for the following:
  • trainSize=1:train.num – a range specifying the row numbers in the data to use for training the model;
  • testSize=(train.num+1):nrow(train.data) – a range specifying the row numbers in the data to use for cross-validation (out-of-sample testing);
  • virgin=FALSE – to specify whether the testing set is unclassified data with NO true value.

To create a model, you need to pass it a container. I had used ALL NINE (9) algorithms initially, but due to the memory limitations of R (32-bit) project, I was forced to create models ONLY for TWO (2) algorithms:
  • SVM – Support Vector Machines; and
  • MAXENT – Maximum Entrophy.

(3) Comparing the model to the Benchmark

svm.result    <- classify_model(container, svm.model)
svm.analytic  <- create_analytics(container, svm.result)
svm.doc       <- svm.analytic@document_summary
svm_spam.doc  <- svm.doc[svm.doc$MANUAL_CODE==4, ]
svm_ham.doc   <- svm.doc[svm.doc$MANUAL_CODE==2, ]
svm.true.pos  <- nrow(svm_spam.doc[svm_spam.doc$CONSENSUS_CODE==4,]) / nrow(svm_spam.doc)
svm.false.neg <- nrow(svm_spam.doc[svm_spam.doc$CONSENSUS_CODE==2,]) / nrow(svm_spam.doc)
svm.true.neg  <- nrow(svm_ham.doc[svm_ham.doc$CONSENSUS_CODE==2,]) / nrow(svm_ham.doc)
svm.false.pos <- nrow(svm_ham.doc[svm_ham.doc$CONSENSUS_CODE==4,]) / nrow(svm_ham.doc)
maxent.result   <- classify_model(container, maxent.model)
maxent.analytic <- create_analytics(container, maxent.result)
maxent.doc      <- maxent.analytic@document_summary
maxent_spam.doc <- maxent.doc[maxent.doc$MANUAL_CODE==4, ]
maxent_ham.doc  <- maxent.doc[maxent.doc$MANUAL_CODE==2, ]
maxent.true.pos <- nrow(maxent_spam.doc[maxent_spam.doc$CONSENSUS_CODE==4,]) / nrow(maxent_spam.doc)
maxent.false.neg<- nrow(maxent_spam.doc[maxent_spam.doc$CONSENSUS_CODE==2,]) / nrow(maxent_spam.doc)
maxent.true.neg <- nrow(maxent_ham.doc[maxent_ham.doc$CONSENSUS_CODE==2,]) / nrow(maxent_ham.doc)
maxent.false.pos<- nrow(maxent_ham.doc[maxent_ham.doc$CONSENSUS_CODE==4,]) / nrow(maxent_ham.doc)


We compare the results of our model with the Benchmark, which is evaluated based on the FALSE-positive (Type I error) and FALSE-negative (Type II error) rates.

For the benchmark, we had about 25% false-positive rate, with the classifier doing slightly better on easy_ham (22%) than the hard stuff (27%). On the other hand, the false-negative rate is much lower at only 15%.

For the SVM algorithm, we beat the benchmark on TWO (2) counts:
  1. 3.2% false-positive rate (significantly lower than benchmark);
  2. 13.2% false-negative rate (slightly lower than benchmark).
For the MAXENT algorithm, we again beat the benchmark on BOTH counts:
  1. 0.4% false-positive rate (the lowest of ALL three models);
  2. 14.7% false-negative rate (only marginally lower than benchmark).

 The Benchmark

Email Type TRUE FALSE
spamT-Pos: 85%F-Neg: 15%
easy_hamT-Neg: 78%F-Pos: 22%
hard_hamT-Neg: 73%F-Pos: 27%

Our results using SVM algorithm
Email Type TRUE FALSE
spamT-Pos: 86.8%F-Neg: 13.2%
hamT-Neg: 96.8%F-Pos: 3.2%

Our results using MAXENT algorithm
Email Type TRUE FALSE
spamT-Pos: 85.3%F-Neg: 14.7%
hamT-Neg: 99.6%F-Pos: 0.4%

Conclusion


I have shown you how to build a spam classifier model using RTextTools, and explored the feasibility of building a model used to classify large text. Alternatively, you could build a spam classifier model using a database that has features. You are NOT limited to only these TWO (2) algorithms, however the other algorithms require an R (64-bit) project, or maybe some big data packages. Although the model is quite simple (only ONE predictor), however the results did beat the Benchmark. Hence, it could be a useful tool to build classifier models for automating your work or tasks.

The entire code can be viewed here.

To leave a comment for the author, please follow the link and comment on their blog: Strategic Thinking - Automated In R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)