Predicting Fraud with Autoencoders and Keras
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Overview
In this post we will train an autoencoder to detect credit card fraud. We will also demonstrate how to train Keras models in the cloud using CloudML.
The basis of our model will be the Kaggle Credit Card Fraud Detection dataset, which was collected during a research collaboration of Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.
The dataset contains credit card transactions by European cardholders made over a two day period in September 2013. There are 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for only 0.172% of all transactions.
Reading the data
After downloading the data from Kaggle, you can read it in to R with read_csv()
:
library(readr) df <- read_csv("data-raw/creditcard.csv", col_types = list(Time = col_number()))
The input variables consist of only numerical values which are the result of a PCA transformation. In order to preserve confidentiality, no more information about the original features was provided. The features V1, …, V28 were obtained with PCA. There are however 2 features (Time and Amount) that were not transformed. Time is the seconds elapsed between each transaction and the first transaction in the dataset. Amount is the transaction amount and could be used for cost-sensitive learning. The Class variable takes value 1 in case of fraud and 0 otherwise.
Autoencoders
Since only 0.172% of the observations are frauds, we have a highly unbalanced classification problem. With this kind of problem, traditional classification approaches usually don’t work very well because we have only a very small sample of the rarer class.
An autoencoder is a neural network that is used to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction. For this problem we will train an autoencoder to encode non-fraud observations from our training set. Since frauds are supposed to have a different distribution then normal transactions, we expect that our autoencoder will have higher reconstruction errors on frauds then on normal transactions. This means that we can use the reconstruction error as a quantity that indicates if a transaction is fraudulent or not.
If you want to learn more about autoencoders, a good starting point is this video from Larochelle on YouTube and Chapter 14 from the Deep Learning book by Goodfellow et al.
Visualization
For an autoencoder to work well we have a strong initial assumption: that the distribution of variables for normal transactions is different from the distribution for fraudulent ones. Let’s make some plots to verify this. Variables were transformed to a [0,1]
interval for plotting.
library(tidyr) library(dplyr) library(ggplot2) library(ggridges) df %>% gather(variable, value, -Class) %>% ggplot(aes(y = as.factor(variable), fill = as.factor(Class), x = percent_rank(value))) + geom_density_ridges()

We can see that distributions of variables for fraudulent transactions are very different then from normal ones, except for the Time variable, which seems to have the exact same distribution.
Preprocessing
Before the modeling steps we need to do some preprocessing. We will split the dataset into train and test sets and then we will Min-max normalize our data (this is done because neural networks work much better with small input values). We will also remove the Time variable as it has the exact same distribution for normal and fraudulent transactions.
Based on the Time variable we will use the first 200,000 observations for training and the rest for testing. This is good practice because when using the model we want to predict future frauds based on transactions that happened before.
df_train <- df %>% filter(row_number(Time) <= 200000) %>% select(-Time) df_test <- df %>% filter(row_number(Time) > 200000) %>% select(-Time)
Now let’s work on normalization of inputs. We created 2 functions to help us. The first one gets descriptive statistics about the dataset that are used for scaling. Then we have a function to perform the min-max scaling. It’s important to note that we applied the same normalization constants for training and test sets.
library(purrr) #' Gets descriptive statistics for every variable in the dataset. get_desc <- function(x) { map(x, ~list( min = min(.x), max = max(.x), mean = mean(.x), sd = sd(.x) )) } #' Given a dataset and normalization constants it will create a min-max normalized #' version of the dataset. normalization_minmax <- function(x, desc) { map2_dfc(x, desc, ~(.x - .ymax - .y
Class y_test <- df_test
*** QuickLaTeX cannot compile formula: Class</pre> </div> <div id="model-definition" class="section level1"> <h1>Model definition</h1> We will now define our model in Keras, a symmetric autoencoder with 4 dense layers. <pre>library(keras) model <- keras_model_sequential() model %>% layer_dense(units = 15, activation = "tanh", input_shape = ncol(x_train)) %>% layer_dense(units = 10, activation = "tanh") %>% layer_dense(units = 15, activation = "tanh") %>% layer_dense(units = ncol(x_train)) summary(model) ___________________________________________________________________________________ Layer (type) Output Shape Param # =================================================================================== dense_1 (Dense) (None, 15) 450 ___________________________________________________________________________________ dense_2 (Dense) (None, 10) 160 ___________________________________________________________________________________ dense_3 (Dense) (None, 15) 165 ___________________________________________________________________________________ dense_4 (Dense) (None, 29) 464 =================================================================================== Total params: 1,239 Trainable params: 1,239 Non-trainable params: 0 ___________________________________________________________________________________</pre> We will then compile our model, using the mean squared error loss and the Adam optimizer for training. <pre>model %>% compile( loss = "mean_squared_error", optimizer = "adam" )</pre> <div id="training-the-model" class="section level2"> <h2>Training the model</h2> We can now train our model using the <code>fit()</code> function. Training the model is reasonably fast (~ 14s per epoch on my laptop). We will only feed to our model the observations of normal (non-fraudulent) transactions. We will use <code>callback_model_checkpoint()</code> in order to save our model after each epoch. By passing the argument <code>save_best_only = TRUE</code> we will keep on disk only the epoch with smallest loss value on the test set. We will also use <code>callback_early_stopping()</code> to stop training if the validation loss stops decreasing for 5 epochs. <pre>checkpoint <- callback_model_checkpoint( filepath = "model.hdf5", save_best_only = TRUE, period = 1, verbose = 1 ) early_stopping <- callback_early_stopping(patience = 5) model %>% fit( x = x_train[y_train == 0,], y = x_train[y_train == 0,], epochs = 100, batch_size = 32, validation_data = list(x_test[y_test == 0,], x_test[y_test == 0,]), callbacks = list(checkpoint, early_stopping) ) Train on 199615 samples, validate on 84700 samples Epoch 1/100 199615/199615 [==============================] - 17s 83us/step - loss: 0.0036 - val_loss: 6.8522e-04d from inf to 0.00069, saving model to model.hdf5 Epoch 2/100 199615/199615 [==============================] - 17s 86us/step - loss: 4.7817e-04 - val_loss: 4.7266e-04d from 0.00069 to 0.00047, saving model to model.hdf5 Epoch 3/100 199615/199615 [==============================] - 19s 94us/step - loss: 3.7753e-04 - val_loss: 4.2430e-04d from 0.00047 to 0.00042, saving model to model.hdf5 Epoch 4/100 199615/199615 [==============================] - 19s 94us/step - loss: 3.3937e-04 - val_loss: 4.0299e-04d from 0.00042 to 0.00040, saving model to model.hdf5 Epoch 5/100 199615/199615 [==============================] - 19s 94us/step - loss: 3.2259e-04 - val_loss: 4.0852e-04 improve Epoch 6/100 199615/199615 [==============================] - 18s 91us/step - loss: 3.1668e-04 - val_loss: 4.0746e-04 improve ...</pre> After training we can get the final loss for the test set by using the <code>evaluate()</code> fucntion. <pre>loss <- evaluate(model, x = x_test[y_test == 0,], y = x_test[y_test == 0,]) loss loss 0.0003534254 </pre> </div> <div id="tuning-the-model-with-cloudml" class="section level2"> <h2>Tuning the model with CloudML</h2> We may be able to get better results by tuning our model hyperparameters. We can tune, for example, the normalization function, the learning rate, the activation functions and the size of hidden layers. CloudML uses Bayesian optimization to tune hyperparameters of models as described in <a href="https://cloud.google.com/blog/big-data/2017/08/hyperparameter-tuning-in-cloud-machine-learning-engine-using-bayesian-optimization">this blog post</a>. We can use the <a href="https://tensorflow.rstudio.com/tools/cloudml/">cloudml package</a> to tune our model, but first we need to prepare our project by creating a <a href="https://tensorflow.rstudio.com/tools/training_flags.html">training flag</a> for each hyperparameter and a <code>tuning.yml</code> file that will tell CloudML what parameters we want to tune and how. The full script used for training on CloudML can be found at <a href="https://github.com/dfalbel/fraud-autoencoder-example" class="uri">https://github.com/dfalbel/fraud-autoencoder-example</a>. The most important modifications to the code were adding the training flags: <pre>FLAGS <- flags( flag_string("normalization", "minmax", "One of minmax, zscore"), flag_string("activation", "relu", "One of relu, selu, tanh, sigmoid"), flag_numeric("learning_rate", 0.001, "Optimizer Learning Rate"), flag_integer("hidden_size", 15, "The hidden layer size") )</pre> We then used the <code>FLAGS</code> variable inside the script to drive the hyperparameters of the model, for example: <pre>model %>% compile( optimizer = optimizer_adam(lr = FLAGS *** Error message: Missing $ inserted. Missing $ inserted. leading text: _ Missing { inserted. leading text: __ Missing { inserted. leading text: ___ Missing { inserted. leading text: ____ Missing { inserted. leading text: _____ Missing { inserted. leading text: ______ Missing { inserted. leading text: _______ Missing { inserted. leading text: ________ Missing { inserted. leading text: _________ Missing { inserted. leading text: __________ Missing { inserted. leading text: ___________ Missing { inserted. leading text: ____________ Missing { inserted. leading text: _____________ Missing { inserted. leading text: ______________ Missing { inserted. leading text: _______________ Missing { inserted. leading text: ________________ Missing { inserted.learning_rate),
loss = 'mean_squared_error',
)We also created a
tuning.yml
file describing how hyperparameters should be varied during training, as well as what metric we wanted to optimize (in this case it was the validation loss:val_loss
).tuning.yml
trainingInput: scaleTier: CUSTOM masterType: standard_gpu hyperparameters: goal: MINIMIZE hyperparameterMetricTag: val_loss maxTrials: 10 maxParallelTrials: 5 params: - parameterName: normalization type: CATEGORICAL categoricalValues: [zscore, minmax] - parameterName: activation type: CATEGORICAL categoricalValues: [relu, selu, tanh, sigmoid] - parameterName: learning_rate type: DOUBLE minValue: 0.000001 maxValue: 0.1 scaleType: UNIT_LOG_SCALE - parameterName: hidden_size type: INTEGER minValue: 5 maxValue: 50 scaleType: UNIT_LINEAR_SCALEWe describe the type of machine we want to use (in this case a
standard_gpu
instance), the metric we want to minimize while tuning, and the the maximum number of trials (i.e. number of combinations of hyperparameters we want to test). We then specify how we want to vary each hyperparameter during tuning.You can learn more about the tuning.yml file at the Tensorflow for R documentation and at Google’s official documentation on CloudML.
Now we are ready to send the job to Google CloudML. We can do this by running:
library(cloudml) cloudml_train("train.R", config = "tuning.yml")The cloudml package takes care of uploading the dataset and installing any R package dependencies required to run the script on CloudML. If you are using RStudio v1.1 or higher, it will also allow you to monitor your job in a background terminal. You can also monitor your job using the Google Cloud Console.
After the job is finished we can collect the job results with:
job_collect()This will copy the files from the job with the best
val_loss
performance on CloudML to your local system and open a report summarizing the training run.Since we used a callback to save model checkpoints during training, the model file was also copied from Google CloudML. Files created during training are copied to the “runs” subdirectory of the working directory from which
cloudml_train()
is called. You can determine this directory for the most recent run with:latest_run()1 but if we don’t verify a transaction and it’s a fraud we will lose this transaction amount. Let’s find for each threshold value how much money we would lose.
cost_per_verification <- 1 lost_money <- sapply(possible_k, function(k) { predicted_class <- as.numeric(mse_test > k)
sum(cost_per_verification * predicted_class + (predicted_class == 0) * y_test * df_test13,000. Using our model we can reduce this to ~$2,500.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.