Using recurrent neural networks to segment customers

[This article was first published on R – Gradient Metrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Understanding consumer segments is key to any successful business. Analytically, segmentations involve clustering a dataset to find groups of similar customers. What “similar” means is defined by the data that goes into the clustering — it could be demographic, attitudinal, or other characteristics. And the data that goes into the clustering is often limited by the clustering algorithms themselves — most require some kind of tabular data structure, and common techniques like k-Means require strictly numeric input. Breaking out of these restrictions has been one of our top priorities since starting the company. So what do you do when you want to find segments of customers that are “similar” because they behave similarly — their experience with you, their brand, has been similar. How would you define that? Increasingly, companies are collecting sequence data, with each entry being an interaction with a customer — be it a purchase, reading an email, visiting the website, etc. Given the popularity of deep learning techniques to tackle sequence-related learning tasks, we thought applying neural networks to customer segmentation was the natural approach. This post builds off of our previous customer journey segmentation post and demonstrates a prototype of a deep learning approach to behavior sequence segmentation. We wanted to investigate if we could leverage the internal state of a recurrent neural network (RNN) on complex sequences of data to identify distinctive customer segments. Turns out that we can. And it works well.

Data description

Our client recorded a behavioral dataset for each customer interaction such as receiving an email, opening an email or using the app, so a single users “sequence” looks like this. Note that each sequence can have a variable number of rows.
User ID Cancel Sent Email Open email Click email App used Site visited Days since last interaction
1001 0 0 0 0 0 1 0
1001 0 0 0 0 0 1 2
1001 0 0 0 0 0 1 4
1001 0 1 0 0 0 0 5
1001 0 0 1 0 0 0 7
1001 0 0 0 0 0 1 1

Developing the Neural Network

We developed a very simple neural network architecture which is described below. For this sample of customers, we knew whether or not they had churned by the time the data was collected, so our “X’s” were the sequences of customer behavior, and our “Y’s” were 0/1s depending on if the customer had churned. Therefore we had a sigmoid output layer which predicted either a 0 or 1 and a recurrent input layer, which is able to handle variable length sequences. We included a dense layer to make the network more powerful, and to generate encodings.  
Layer Input dimension Output dimension
Recurrent Variable 10
Dense 10 10 (used for encoding)
Sigmoid 10 1
We used Keras (on R) to specify and train the network. After training the network on the churn data, we used the weights from the Recurrent and Dense layers to produce a set of encodings for each user. After feeding in a user’s sequence, we get a ten-dimensional numeric encoding out:
User ID Encoding_1 Encoding_2 Encoding_3 Encoding_4 Encoding_5
1001 0 0 0.4 12.8 0.5
1002 0.1 1.3 0.9 14.7 141.0
1003 0.1 1.3 0.9 14.7 141.0
1004 0.1 1.3 0.9 14.7 141.0
1005 0.0 0.0 0.0 0.5 0

Clustering the RNN encodings

The encodings capture all of the information of the neural network. Although they do not have any inherent  meaning we can use them in a clustering algorithm to identify distinct segments. Which is exactly what we did. We decided to run a DBSCAN on the encoded sequence data. DBSCAN had the advantage (in this case) of being able to handle non-linearities in the data and for not needing to specify the number of clusters in advance. K-means performed similarly.


The DBSCAN algorithm identifies  five distinct clusters with some significant, and valuable differences between them.  
Segment Percentage of customers Avg. E-mails Clicked Avg. E-mails Opened Avg. App Actions Avg. Site Visits Avg. Churn Date Churn percentage
1 0.3% 2.11 22.8 16.4 18.2 325 30.1%
2 34.5% 1.13 11.5 3.6 8.1 308 16.7%
3 59.5% 0.3 3.2 0.1 2.9 88 98%
4 5.5% 4.0 27.0 89.5 16.5 337 0.1%
5 0.2% 0.5 2.0 0.0 1.5 93 93%
  Although the clusters  are fairly imbalanced (likely an artifact of using a supervised clustering technique), the number of days since the first interaction is clearly a strong driver in defining segments. The key takeaway here is that clusters with the highest churn rate have an interaction history of three months or less. This business  absolutely must focus on getting customers through the first three months to decrease the likelihood of churning early.


  • Sequence data is increasingly being captured by brands and methods for exploring it must be developed
  • Recurrent neural networks are an effective way of generating encodings for behavioral sequence data
  • Clustering the encodings (results of intermediate layers) of a neural network can be an effective way of peering inside the black box
We welcome any thoughts or comments you might have, and feel free to share this blog posts with your friends and colleagues!      

To leave a comment for the author, please follow the link and comment on their blog: R – Gradient Metrics. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)