# Using recurrent neural networks to segment customers

**R – Gradient Metrics**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Understanding consumer segments is key to any successful business. Analytically, segmentations involve clustering a dataset to find groups of similar customers. What “similar” means is defined by the data that goes into the clustering — it could be demographic, attitudinal, or other characteristics. And the data that goes into the clustering is often limited by the clustering algorithms themselves — most require some kind of tabular data structure, and common techniques like k-Means require strictly numeric input. Breaking out of these restrictions has been one of our top priorities since starting the company.

So what do you do when you want to find segments of customers that are “similar” because they behave similarly — their experience with you, their brand, has been similar. How would you define that? Increasingly, companies are collecting sequence data, with each entry being an interaction with a customer — be it a purchase, reading an email, visiting the website, etc. Given the popularity of deep learning techniques to tackle sequence-related learning tasks, we thought applying neural networks to customer segmentation was the natural approach.

This post builds off of our previous customer journey segmentation post and demonstrates a prototype of a deep learning approach to behavior sequence segmentation. We wanted to investigate if we could leverage the internal state of a recurrent neural network (RNN) on complex sequences of data to identify distinctive customer segments.

Turns out that we can. And it works well.

**Data description**

Our client recorded a behavioral dataset for each customer interaction such as receiving an email, opening an email or using the app, so a single users “sequence” looks like this. Note that each sequence can have a variable number of rows.

User ID |
Cancel |
Sent Email |
Open email |
Click email |
App used |
Site visited |
Days since last interaction |

1001 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |

1001 | 0 | 0 | 0 | 0 | 0 | 1 | 2 |

1001 | 0 | 0 | 0 | 0 | 0 | 1 | 4 |

1001 | 0 | 1 | 0 | 0 | 0 | 0 | 5 |

1001 | 0 | 0 | 1 | 0 | 0 | 0 | 7 |

1001 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |

**Developing the Neural Network**

We developed a very simple neural network architecture which is described below. For this sample of customers, we knew whether or not they had churned by the time the data was collected, so our “X’s” were the sequences of customer behavior, and our “Y’s” were 0/1s depending on if the customer had churned.

Therefore we had a sigmoid output layer which predicted either a 0 or 1 and a recurrent input layer, which is able to handle variable length sequences. We included a dense layer to make the network more powerful, and to generate encodings.

Layer |
Input dimension |
Output dimension |

Recurrent | Variable | 10 |

Dense | 10 | 10 (used for encoding) |

Sigmoid | 10 | 1 |

We used Keras (on R) to specify and train the network.

After training the network on the churn data, we used the weights from the Recurrent and Dense layers to produce a set of encodings for each user. After feeding in a user’s sequence, we get a ten-dimensional numeric encoding out:

User ID |
Encoding_1 |
Encoding_2 |
Encoding_3 |
Encoding_4 |
Encoding_5 |
… |

1001 | 0 | 0 | 0.4 | 12.8 | 0.5 | |

1002 | 0.1 | 1.3 | 0.9 | 14.7 | 141.0 | |

1003 | 0.1 | 1.3 | 0.9 | 14.7 | 141.0 | |

1004 | 0.1 | 1.3 | 0.9 | 14.7 | 141.0 | |

1005 | 0.0 | 0.0 | 0.0 | 0.5 | 0 |

**Clustering the RNN encodings**

The encodings capture all of the information of the neural network. Although they do not have any inherent meaning we can use them in a clustering algorithm to identify distinct segments. Which is exactly what we did.

We decided to run a DBSCAN on the encoded sequence data. DBSCAN had the advantage (in this case) of being able to handle non-linearities in the data and for not needing to specify the number of clusters in advance. K-means performed similarly.

**Results**

The DBSCAN algorithm identifies five distinct clusters with some significant, and valuable differences between them.

Segment |
Percentage of customers |
Avg. E-mails Clicked |
Avg. E-mails Opened |
Avg. App Actions |
Avg. Site Visits |
Avg. Churn Date |
Churn percentage |

1 | 0.3% | 2.11 | 22.8 | 16.4 | 18.2 | 325 | 30.1% |

2 | 34.5% | 1.13 | 11.5 | 3.6 | 8.1 | 308 | 16.7% |

3 | 59.5% | 0.3 | 3.2 | 0.1 | 2.9 | 88 | 98% |

4 | 5.5% | 4.0 | 27.0 | 89.5 | 16.5 | 337 | 0.1% |

5 | 0.2% | 0.5 | 2.0 | 0.0 | 1.5 | 93 | 93% |

Although the clusters are fairly imbalanced (likely an artifact of using a supervised clustering technique), the number of days since the first interaction is clearly a strong driver in defining segments. The key takeaway here is that clusters with the highest churn rate have an interaction history of three months or less. This business absolutely *must *focus on getting customers through the first three months to decrease the likelihood of churning early.

**Takeaways**

- Sequence data is increasingly being captured by brands and methods for exploring it must be developed
- Recurrent neural networks are an effective way of generating encodings for behavioral sequence data
- Clustering the encodings (results of intermediate layers) of a neural network can be an effective way of peering inside the black box

We welcome any thoughts or comments you might have, and feel free to share this blog posts with your friends and colleagues!

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Gradient Metrics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.