MICAD: A new algorithm/R package for anomaly detection

[This article was first published on Scott Mutchler, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Overview

Anomaly detection algorithms are core to many fraud and security applications/business solutions.  Identifying cases where specific values are outside norms can be useful in outlier detection (as a predicate to predictive modeling) and to identify cases of interest when labeled data is not available for supervised learning models.  For example, an insurance company might run anomaly detection against a claims database in the hopes of identifying potentially fraudulent (anomalous) claims.  If the medical bills for personal injury claims are anomalously high (given the other characteristics of the claim), then those cases can be further reviewed (by a claims adjuster).  Finally these newly (human  labeled) claims could be used in a supervised model to predict fraud.

The Most Common Approach to Anomaly Detection

Probably the most common approach to identifying anomalies is a case-wise comparison (value by value) to peer group averages.  For example, if we take a personal injury claim and compare it’s medical bill total to it’s peer claim medical bill averages.  If our claim has one or more extreme variable values when compared to the cluster distribution (for the same variable) it can be considered an outlier.  Here’s some pseudocode for a naive modeling algorithm based on this approach:

cluster cases;
for each cluster
  for each variable
    calculate variable averages;
    calculate standard deviation;
  endfor;
endfor;

for each case
  score case using cluster model to determine appropriate cluster;
  if abs(case variable – cluster average for variable) > 4.0 * cluster stddev
    anomaly score += variable weight;
  endif;
endfor;

High scores indicate anomalies.  Supplying variables weights to the above algorithm allows you to tune the overall score (such that [subjectively] important variables contribute more heavily to the overall score total).  Once a supervised model is built, these weights can be tuned using the variable importance measure of the model.

Again this approach is pretty naive.  There are several challenges with this approach:

  • How do we handle nominal/unordered (factor) variables?
  • What if the data distributions are strongly skewed?
  • What about variable interactions?  Might outlier values be perfectly predictable (and normal) if we included variable interactions?

MICAD

MICAD is an attempt to improve upon the above naive approach.  The simplest explanation of MICAD is:

Multiple imputation comparison anomaly detection (MICAD) is an algorithm that compares the imputed (or predicted) value of each variable to the actual value. If the predicted value != the actual value, the anomaly score is incremented by the variable weight.

Imputation of values is done using RandomForest (or similar predictive model).  The predictors are the remaining variables in the case.  For example, using the Iris data set we can impute the Sepal Length using the Species, Petal Length, Petal Width and Sepal Width.

Here is the pseudocode for MICAD:

# data preparation
for each variable
  if (type of variable is numeric)
    convert variable to quartile;
  endif;
endfor;

# model building
for each variable
   build randomForest classifcation model to predict variable using remaining variables;
   store randomForest model;
endfor;

# model scoring
for each variable
   retrieve randomForest model;
   score randomForest model for all cases;
   if (predicted class != actual class)
     anomaly score += variable weight
   endif;
endfor;

An Example Using an Appended Iris Data Set

Downloading & Installing MICAD

install.packages("devtools")
library("devtools")
install_github("smutchler/micad/micad")
library("micad")

Loading the Appended Iris Data Set

data(iris_anomaly)

Building the MICAD S3 Model

micad.model <- micad(x=iris_anomaly[iris_anomaly$ANOMALY==0,], 
  vars=c("SEPAL_LENGTH","SEPAL_WIDTH",
         "PETAL_LENGTH","PETAL_WIDTH",
         "SPECIES"),
  weights=c(10,10,10,10,20))
 
print(micad.model)

We build the model while excluding the anomaly records.  The reason we do this is because Iris is a small data set and a few anomaly records will have a large impact on the models being built.  In production data sets, the affects of a few anomaly records will [likely] not have such a large impact on the models.

The weights are driven by subject matter expertise intially.  Once a supervised model can be built, the weights could be adjusted using the variable importances of each variable.

Scoring the Iris Anomaly Data Set

scored.data <- predict(micad.model, iris_anomaly)
tail(scored.data)

The output is:

Capture

The last 4 cases are labeled anomaly = 1.  The appended A$_SCORE column reveals high aggregate scores for the anomaly cases.


To leave a comment for the author, please follow the link and comment on their blog: Scott Mutchler.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)