Update to autoencoders and anomaly detection with machine learning in fraud analytics

[This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a reply to Wojciech Indyk’s comment on yesterday’s post on autoencoders and anomaly detection with machine learning in fraud analytics:

“I think you can improve the detection of anomalies if you change the training set to the deep-autoencoder. As I understand the train_unsupervised contains both class 0 and class 1. If you put only class 0 as the input of the autoencoder, the network should learn the “normal” pattern. Then the MSE should be higher for 1 class in the test set (of course if anomaly==fraud).”

To test this, I follow the same workflow as in yesterday’s post but this time, I am moving all fraud instances from the first training set for unsupervised learning to the second training test for supervised learning. Now, the autoencoder learns a pattern solely on non-fraud cases.

<span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span>
<span class="n">library</span><span class="p">(</span><span class="n">h</span><span class="m">2</span><span class="n">o</span><span class="p">)</span><span class="w">
</span><span class="n">h</span><span class="m">2</span><span class="n">o.init</span><span class="p">(</span><span class="n">nthreads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-1</span><span class="p">)</span><span class="w">
</span>
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         20 minutes 11 seconds 
##     H2O cluster version:        3.10.4.4 
##     H2O cluster version age:    17 days  
##     H2O cluster name:           H2O_started_from_R_Shirin_erp741 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.54 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.4.0 (2017-04-21)
<span class="c1"># convert data to H2OFrame
</span><span class="n">creditcard_hf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.h2o</span><span class="p">(</span><span class="n">creditcard</span><span class="p">)</span><span class="w">
</span>
<span class="n">splits</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">h</span><span class="m">2</span><span class="n">o.splitFrame</span><span class="p">(</span><span class="n">creditcard_hf</span><span class="p">,</span><span class="w"> 
                         </span><span class="n">ratios</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">),</span><span class="w"> 
                         </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">42</span><span class="p">)</span><span class="w">

</span><span class="n">train_unsupervised</span><span class="w">  </span><span class="o"><-</span><span class="w"> </span><span class="n">splits</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="n">train_supervised</span><span class="w">  </span><span class="o"><-</span><span class="w"> </span><span class="n">splits</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="n">test</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">splits</span><span class="p">[[</span><span class="m">3</span><span class="p">]]</span><span class="w">

</span><span class="c1"># move class 1 instances to second training set...
</span><span class="n">train_supervised</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">train_supervised</span><span class="p">),</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">train_unsupervised</span><span class="p">[</span><span class="n">train_unsupervised</span><span class="o">$</span><span class="n">Class</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"1"</span><span class="p">,</span><span class="w"> </span><span class="p">]))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">as.h2o</span><span class="p">()</span><span class="w">

</span><span class="c1"># ... and remove from first training set
</span><span class="n">train_unsupervised</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">train_unsupervised</span><span class="p">[</span><span class="n">train_unsupervised</span><span class="o">$</span><span class="n">Class</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"0"</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">

</span><span class="n">response</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"Class"</span><span class="w">
</span><span class="n">features</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">setdiff</span><span class="p">(</span><span class="n">colnames</span><span class="p">(</span><span class="n">train_unsupervised</span><span class="p">),</span><span class="w"> </span><span class="n">response</span><span class="p">)</span><span class="w">
</span>
<span class="n">model_nn</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">h</span><span class="m">2</span><span class="n">o.deeplearning</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">features</span><span class="p">,</span><span class="w">
                             </span><span class="n">training_frame</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">train_unsupervised</span><span class="p">,</span><span class="w">
                             </span><span class="n">model_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"model_nn"</span><span class="p">,</span><span class="w">
                             </span><span class="n">autoencoder</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
                             </span><span class="n">reproducible</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="c1">#slow - turn off for real problems
</span><span class="w">                             </span><span class="n">ignore_const_cols</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w">
                             </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">42</span><span class="p">,</span><span class="w">
                             </span><span class="n">hidden</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">),</span><span class="w"> 
                             </span><span class="n">epochs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w">
                             </span><span class="n">activation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Tanh"</span><span class="p">)</span><span class="w">
</span>
<span class="n">anomaly</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">h</span><span class="m">2</span><span class="n">o.anomaly</span><span class="p">(</span><span class="n">model_nn</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">as.data.frame</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">tibble</span><span class="o">::</span><span class="n">rownames_to_column</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">Class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.vector</span><span class="p">(</span><span class="n">test</span><span class="p">[,</span><span class="w"> </span><span class="m">31</span><span class="p">]))</span><span class="w">

</span><span class="n">mean_mse</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">anomaly</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">Class</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">Reconstruction.MSE</span><span class="p">))</span><span class="w">
</span>
<span class="n">ggplot</span><span class="p">(</span><span class="n">anomaly</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">rowname</span><span class="p">),</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Reconstruction.MSE</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.factor</span><span class="p">(</span><span class="n">Class</span><span class="p">)))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_point</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.3</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_hline</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean_mse</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">yintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Class</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_color_brewer</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Set1"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"instance number"</span><span class="p">,</span><span class="w">
       </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Class"</span><span class="p">)</span><span class="w">
</span>

Compared to the results from yesterday’s post, this model seems to have learned a pattern that found two major cases. The mean reconstruction MSE was slightly higher for class 0 and definitely higher for class 1.

<span class="n">anomaly</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">anomaly</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">outlier</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">Reconstruction.MSE</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">0.04</span><span class="p">,</span><span class="w"> </span><span class="s2">"outlier"</span><span class="p">,</span><span class="w"> </span><span class="s2">"no_outlier"</span><span class="p">))</span><span class="w">

</span><span class="n">anomaly</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">Class</span><span class="p">,</span><span class="w"> </span><span class="n">outlier</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">freq</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> 
</span>
## Source: local data frame [4 x 4]
## Groups: Class [2]
## 
##   Class    outlier     n         freq
##   <chr>      <chr> <int>        <dbl>
## 1     0 no_outlier 56608 0.9995762113
## 2     0    outlier    24 0.0004237887
## 3     1 no_outlier    60 0.6521739130
## 4     1    outlier    32 0.3478260870

Anomaly detection with a higher threshold based on the plot above did not improve the results compared to yesterday’s post.

With a lower threshold of 0.2 (not shown here), the test set performed much better for detecting fraud cases as outliers (65 vs 27, compared to 32 vs 60 in yesterday’s post). However, this model also categorized many more non-fraud cases as outliers (2803 vs 53829, compared to only 30 vs 56602).

Now, I am again using the autoencoder model as pre-training input for supervised learning.

<span class="n">model_nn_2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">h</span><span class="m">2</span><span class="n">o.deeplearning</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">response</span><span class="p">,</span><span class="w">
                               </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">features</span><span class="p">,</span><span class="w">
                               </span><span class="n">training_frame</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">train_supervised</span><span class="p">,</span><span class="w">
                               </span><span class="n">pretrained_autoencoder</span><span class="w">  </span><span class="o">=</span><span class="w"> </span><span class="s2">"model_nn"</span><span class="p">,</span><span class="w">
                               </span><span class="n">reproducible</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="c1">#slow - turn off for real problems
</span><span class="w">                               </span><span class="n">balance_classes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
                               </span><span class="n">ignore_const_cols</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w">
                               </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">42</span><span class="p">,</span><span class="w">
                               </span><span class="n">hidden</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">),</span><span class="w"> 
                               </span><span class="n">epochs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w">
                               </span><span class="n">activation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Tanh"</span><span class="p">)</span><span class="w">
</span>
<span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">h</span><span class="m">2</span><span class="n">o.predict</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">model_nn_2</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">test</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.vector</span><span class="p">(</span><span class="n">test</span><span class="p">[,</span><span class="w"> </span><span class="m">31</span><span class="p">]))</span><span class="w">
</span>
<span class="n">pred</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">actual</span><span class="p">,</span><span class="w"> </span><span class="n">predict</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">freq</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> 
</span>
## Source: local data frame [4 x 4]
## Groups: actual [2]
## 
##   actual predict     n       freq
##    <chr>  <fctr> <int>      <dbl>
## 1      0       0 56347 0.99496751
## 2      0       1   285 0.00503249
## 3      1       0     9 0.09782609
## 4      1       1    83 0.90217391

This model is now much better at identifying fraud cases than in yesterday’s post (90%, compared to 83% – even though we can’t directly compare the two models as they were trained on different training sets) but it is also slightly less accurate at predicting non-fraud cases (99.5%, compared to 99.8%).


If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.


<span class="n">sessionInfo</span><span class="p">()</span><span class="w">
</span>
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] h2o_3.10.4.4    dplyr_0.5.0     purrr_0.2.2     readr_1.1.0    
## [5] tidyr_0.6.1     tibble_1.3.0    ggplot2_2.2.1   tidyverse_1.1.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10       RColorBrewer_1.1-2 cellranger_1.1.0  
##  [4] compiler_3.4.0     plyr_1.8.4         bitops_1.0-6      
##  [7] forcats_0.2.0      tools_3.4.0        digest_0.6.12     
## [10] lubridate_1.6.0    jsonlite_1.4       evaluate_0.10     
## [13] nlme_3.1-131       gtable_0.2.0       lattice_0.20-35   
## [16] psych_1.7.3.21     DBI_0.6-1          yaml_2.1.14       
## [19] parallel_3.4.0     haven_1.0.0        xml2_1.1.1        
## [22] stringr_1.2.0      httr_1.2.1         knitr_1.15.1      
## [25] hms_0.3            rprojroot_1.2      grid_3.4.0        
## [28] R6_2.2.0           readxl_1.0.0       foreign_0.8-68    
## [31] rmarkdown_1.5      modelr_0.1.0       reshape2_1.4.2    
## [34] magrittr_1.5       backports_1.0.5    scales_0.4.1      
## [37] htmltools_0.3.6    rvest_0.3.2        assertthat_0.2.0  
## [40] mnormt_1.5-5       colorspace_1.3-2   labeling_0.3      
## [43] stringi_1.1.5      RCurl_1.95-4.8     lazyeval_0.2.0    
## [46] munsell_0.4.3      broom_0.4.2

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)