Customer Segmentation using RFM Analysis

[This article was first published on Rsquared Academy Blog - Explore Discover Learn, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In a previous post, we had introduced our R package rfm but did not go into the conceptual details of RFM analysis. In this post, we will explore RFM in much more depth and work through a case study as well. RFM (Recency, Frequency & Monetary) analysis is a behavior based technique used to segment customers by examining their transaction history such as:

  • how recently a customer has purchased?
  • how often do they purchase?
  • how much the customer spends?

It is based on the marketing axiom that 80% of your business comes from 20% of your customers. RFM helps to identify customers who are more likely to respond to promotions by segmenting them into various categories.

Resources

Below are the links to all resources related to this post:

You can try our free online course Customer Segmentation using RFM Analysis if you like to learn through self paced online courses.

Case Study

We will work through a case study to better understand the underlying concepts of RFM analysis. To pique your curiosity, we will start with the results or the final outcome of the case study as shown below:

The table has the following details:

  • name or id of the customer
  • number of days since the last transaction of the customer
  • number of transactions of the customer
  • total value of the all the transactions of the customer
  • RFM score
  • customer segment

The rest of this post will focus on generating a similar result for our case study and along the way we will learn to:

  • structure data for RFM analysis
  • generate RFM score
  • and segment customers using RFM score

Applications

Let us talk about applications. Though largely identified with retail or ecommerce, RFM analysis can be applied in a lot of other domains or industry as well. In social media and apps, RFM can be used to segment users as well. The only difference is instead of using monetary value as the third metric, we will use the amount of time spent (or some other metric based on it) on the site/app. The more time we spend on the platform and consume the content, the more ads can be displayed by the platform. So in those cases, the amount of time we spend will be the third metric.

RFM Workflow

The typical workflow for RFM analysis is shown above. It can be broadly divided into the following:

  • In the first step of the workflow, we collect transaction data. This should include a unique customer id, transaction date and transaction amount. In case of ecommerce, we need to decide how to treat visits that did not result in a transaction. If data is aggregated and made available at the customer level, it must include a unique customer id, last transaction date and total revenue from the customer. The last transaction date may be replaced by days since last visit as well. The details available in data supplied depends on the data pipeline and the rfm package can handle any of the above 3 scenarios.

  • In the second step, we generate RFM table from the raw data available. The RFM table aggregates data at the customer level. It includes the unique customer id, days since last transaction/visit, frequency of transactions/visits and the total revenue from all the transactions of the customer.

  • In the third step, we generate scores for recency, frequency and monetary value, and use them to create the RFM score for each customer.

  • In the final step, we use the recency, frequency and monetary scores to define customer segments and design customised campaigns, promotions, offers & discounts to retain and reactivate customers.

RFM Table

Let us assume we have completed the first step in RFM analysis by collecting transaction data. Now, we have to generate the RFM table from the transaction data. In the transaction data, each row represents a transaction and we may get the transaction details in any of the following ways:

  • according to transaction date
  • sorted by customer id
  • or in a random order

The first two cases are more likely but if we get the transaction data in a random order, the first order of business is to sort them by customer id. In the below example, we have transaction details for 3 customers Lionel, Jaineel and Taj but they are not sorted by transaction date or customer id. Since we want to create the RFM table from this data, we sort it by customer id.

From the sorted data, we aggregate the transaction details at the customer level as shown below.

From the above step, we have created the RFM table which contains recency (days since last visit), frequency (frequency of visits) and monetary (total revenue from the customer) data for each customer.

Metrics

Time to talk about the key metrics Recency, Frequency and Monetary in more detail. In this section, we will understand how they are calculated, and in the next section, we will learn how they are used for generating the RFM score.

Recency

Let us begin with recency. Earlier, we defined it as the number of days since the last transaction of the customer. How do we calculate this metric? Apart from the date of the last transaction of the customer, what other information do we need? In all the discussions till now, we have missed out on a key point i.e. the time frame of the analysis.

The most crucial step in RFM analysis is to select a time frame from which we use the transaction data. How do we decide on this time frame? It depends on the domain to which we are applying this analysis. Customers visit a grocery store more often than they visit a consumer durables store. Similarly, people consume content from news & blogs more frequently while they may visit an e-commerce website only when they have to purchase something. Keeping in mind the domain to which the analysis is being applied, select an appropriate time frame. To calculate recency, compute the difference between the last transaction date and the analysis date i.e. the last date of the selected time frame.

In the above example, the analysis date is 2016-12-31. To compute recency, we first extract all the transaction date of customer Taj and then select the last transaction date, 2015-04-21, and subtract if from the analysis date to get the number of days since the last transaction date, 620.

Frequency

Frequency is the count of transactions. In the online/digital world, we need to decide whether we will consider all the visits to a website or app as the frequency or only those which resulted in a transaction/conversion. In the below example, we count the transactions for each customer and use it as frequency. Lionel has 6 transactions, Jaineel has 9 transactions and Taj has 4 transactions.

Monetary Value

Monetary value is the total revenue from each customer in the selected time frame. It is computed by summing up the transaction amount. In our case study, Jaineel has spent the highest amount of $843, followed by Lionel who has spent $472 and Taj has spent the lowest, $196. As you can observe, we have arrived at these figures by summing the values in the third column of the second table, Order Amount.

If we are applying this analysis to the digital world, we may think of using a metric such as time spent on the website/app instead of transaction/order amount.

RFM Score

As shown in the workflow, the third step in RFM analysis is to generate the individual score for each metric and then use them to generate the RFM score. In this section, we will explain in detail how the scores are computed for recency, frequency and monetary. This section is slightly complex (we received a lot of mails from readers after we published the previous post) and we have tried our best to break down the complexity as much as possible. Still, if you have any questions feel free to write to us at .

We follow the below steps to create the score:

  • use quantiles to generate cut off points
  • create intervals based on the cut off points
  • use the intervals to assign score

Monetary Score

Let us generate the monetary score in our case study. The first step is to compute the quantiles using the quantile() function. We use the revenue column from the RFM table to compute the quantiles. If you look at the example, it gives us the cut off below which a certain percentage of customers are present.

  • the bottom 20% of customers spend below $254.8.
  • the next 20% of customers spend between $254.8 and $381.0.
  • the top 20% of customers spend above $665.0.

Using these cut off points we have created intervals which can then be converted to if else statements. The intervals are then used to assign scores. For example, Lionel falls in the interval > 381.0 & <= 505.4 and hence is assigned the score 3. Similarly, Jaineel and Taj are assigned the scores 5 and 1. How do we interpret the scores? The score is more like a rank. A customer with a score of 3 is ranked higher than a customer with score of 1 as his transaction amount is higher. In the rfm package, we use the above method to assign the scores.

Some users reverse the order of the score i.e. top 20% customers by transaction amountare assigned the score 1 and the bottom 20% are assigned the score 5.

Frequency Score

The frequency score is computed in the same way as the monetary score. Instead of using the revenue column from the RFM table, we use the frequency column. Using quantiles, we arrive at the cut off points below which a certain percentage of customer are present. If you observe the example, the first table shows the quantiles and the associated cut off points.

  • the bottom 20% of customers visit/transact less than 3 times.
  • the next 20% of customers visit/transact around 4 times.
  • the top 20% of customers visit/transact more than 7 times.

The cut off points are then used to create the intervals and assign the scores as shown in the second table. We assign a higher score to those who visit more frequently and a lower score to those who visit less frequently.

In our case study, Jaineel has visited 9 times and hence assigned the score 5 where as Taj has visited only 4 times and hence the score 2.

Recency Score

The recency score follows the same methodology but uses a slightly different concept while assigning the score. If you look at the metrics, the higher the values of frequency and monetary, the better as we want customers to transact frequently and spend higher amount but it is not the case with recency. Since recency represents the number of days since the last transaction, the lower its value the better i.e. customers who visited in the recent past are more likely to visit again whereas customers who visited long back may be as good as lost. Hence in the case of recency, higher score is assigned to those with lower recency value and vice versa.

In the above example, we have used quantiles to compute the cut off point for recency. The first table shows the quantiles and the associated cut off points:

  • the bottom 20% of customers visited more than 481 days back.
  • the next 20% of customers visited between 296.4 and 481 days back.
  • the top 20% of customers visited less than 114 days back.

The above statements will become clear if you study the second table which includes the interval and the score. We have assigned a higher score to those who visited in the recent past (< 114 days) compared to those who visited way back (> 481 days). In our case study, Jaineel visited in the past 3 months and hence the score of 5 where as Taj visited almost 20 months back and has been assigned the score 1.

RFM Score

Now that we have calculated the individual scores, let us compute the RFM score using the below formula:

RFM Score = Recency Score * 100 + Frequency Score * 10 + Monetary Score

The below table shows the individal scores of recency, frequency and monetary as well as the RFM score. All of them are computed from the RFM table which in itself is based on the transaction data.

Segments

Great! We have finally computed the RFM score. Now what? How do we define the segments using this score? In this section, we will learn how to define customer segments using the RFM score. The below table is an example of how segments are defined. It has the following details:

  • the name of the segment
  • the definition of the segment
  • the intervals for the recency, frequency & monetary scores

We should be careful while creating the intervals for the scores in the segments table. Look at this Wkipedia article to understand how intervals work.

Let us apply the above rules to our case study.

Defining segments is another crucial step in RFM analysis. We need to ensure that there is no duplication or large number of customers get classified into Others segment. In the case study, we will show you some of the mistakes that can happen while defining the segments.

Case Study

It is time to work through the case study. Let us first load all the libraries we will use as shown below:

library(rfm)
library(dplyr)
library(magrittr)
library(lubridate)

Data

To calculate the RFM score for each customer we need transaction data which should include the following:

  • a unique customer id
  • date of transaction/order
  • transaction/order amount

rfm includes a sample data set rfm_data_orders which includes the above details:

rfm_data_orders
## # A tibble: 4,906 x 3
##    customer_id         order_date revenue
##    <chr>               <date>       <dbl>
##  1 Mr. Brion Stark Sr. 2004-12-20      32
##  2 Ethyl Botsford      2005-05-02      36
##  3 Hosteen Jacobi      2004-03-06     116
##  4 Mr. Edw Frami       2006-03-15      99
##  5 Josef Lemke         2006-08-14      76
##  6 Julisa Halvorson    2005-05-28      56
##  7 Judyth Lueilwitz    2005-03-09     108
##  8 Mr. Mekhi Goyette   2005-09-23     183
##  9 Hansford Moen PhD   2005-09-07      30
## 10 Fount Flatley       2006-04-12      13
## # ... with 4,896 more rows

RFM Score

Use rfm_table_order() to generate the score for each customer from the sample data set rfm_data_orders.

rfm_table_order() takes 8 inputs:

  • data: a data set with
    • unique customer id
    • date of transaction
    • and amount
  • customer_id: name of the customer id column
  • order_date: name of the transaction date column
  • revenue: name of the transaction amount column
  • analysis_date: date of analysis
  • recency_bins: number of rankings for recency score (default is 5)
  • frequency_bins: number of rankings for frequency score (default is 5)
  • monetary_bins: number of rankings for monetary score (default is 5)

RFM Table

analysis_date <- lubridate::as_date("2006-12-31", tz = "UTC")
rfm_result <- rfm_table_order(rfm_data_orders, customer_id, order_date, revenue, analysis_date)
rfm_result
## Warning: `tz` argument is ignored by `as_date()`
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
Abbey O’Reilly DVM 2006-06-09 205 6 472 3 4 3 343
Add Senger 2006-08-13 140 3 340 4 1 2 412
Aden Lesch Sr.  2006-06-20 194 4 405 3 2 3 323
Admiral Senger 2006-08-21 132 5 448 4 3 3 433
Agness O’Keefe 2006-10-02 90 9 843 5 5 5 555
Aileen Barton 2006-10-08 84 9 763 5 5 5 555
Ailene Hermann 2006-03-25 281 8 699 3 5 5 355
Aiyanna Bruen PhD 2006-04-29 246 4 157 3 2 1 321
Ala Schmidt DDS 2006-01-16 349 3 363 2 1 2 212
Alannah Borer 2005-04-21 619 4 196 1 2 1 121

rfm_table_order() will return the following columns as seen in the above table:

  • customer_id: unique customer id
  • date_most_recent: date of most recent visit
  • recency_days: days since the most recent visit
  • transaction_count: number of transactions of the customer
  • amount: total revenue generated by the customer
  • recency_score: recency score of the customer
  • frequency_score: frequency score of the customer
  • monetary_score: monetary score of the customer
  • rfm_score: RFM score of the customer

Segments

Let us classify our customers based on the individual recency, frequency and monetary scores.

Segment Description R F M
Champions Bought recently, buy often and spend the most 4 - 5 4 - 5 4 - 5
Loyal Customers Spend good money. Responsive to promotions 2 - 4 3 - 4 4 - 5
Potential Loyalist Recent customers, spent good amount, bought more than once 3 - 5 1 - 3 1 - 3
New Customers Bought more recently, but not often 4 - 5 < 2 < 2
Promising Recent shoppers, but haven’t spent much 3 - 4 < 2 < 2
Need Attention Above average recency, frequency & monetary values 3 - 4 3 - 4 3 - 4
About To Sleep Below average recency, frequency & monetary values 2 - 3 < 3 < 3
At Risk Spent big money, purchased often but long time ago < 3 2 - 5 2 - 5
Can’t Lose Them Made big purchases and often, but long time ago < 2 4 - 5 4 - 5
Hibernating Low spenders, low frequency, purchased long time ago 2 - 3 2 - 3 2 - 3
Lost Lowest recency, frequency & monetary scores < 2 < 2 < 2

Segmented Customer Data

We can use the segmented data to identify

  • champion customers
  • loyal customers
  • at risk customers
  • and lost customers

Once we have classified a customer into a particular segment, we can take appropriate action to increase his/her lifetime value.

segment_names <- 
  c("Champions", "Loyal Customers", "Potential Loyalist",
    "New Customers", "Promising", "Need Attention", 
    "About To Sleep", "At Risk", "Can't Lose Them", 
    "Hibernating", "Lost")

recency_lower   <- c(4, 2, 3, 4, 3, 3, 2, 1, 1, 2, 1)
recency_upper   <- c(5, 4, 5, 5, 4, 4, 3, 2, 1, 3, 1)
frequency_lower <- c(4, 3, 1, 1, 1, 3, 1, 2, 4, 2, 1)
frequency_upper <- c(5, 4, 3, 1, 1, 4, 2, 5, 5, 3, 1)
monetary_lower  <- c(4, 4, 1, 1, 1, 3, 1, 2, 4, 2, 1)
monetary_upper  <- c(5, 5, 3, 1, 1, 4, 2, 5, 5, 3, 1)

segments <- 
  rfm_segment(rfm_result, segment_names, recency_lower, 
              recency_upper, frequency_lower, frequency_upper, 
              monetary_lower, monetary_upper)

segments %>% 
  select(customer_id, segment, rfm_score)
## # A tibble: 995 x 3
##    customer_id        segment            rfm_score
##    <chr>              <chr>                  <dbl>
##  1 Abbey O'Reilly DVM Need Attention           343
##  2 Add Senger         Potential Loyalist       412
##  3 Aden Lesch Sr.     Hibernating              323
##  4 Admiral Senger     Need Attention           433
##  5 Agness O'Keefe     Champions                555
##  6 Aileen Barton      Champions                555
##  7 Ailene Hermann     Others                   355
##  8 Aiyanna Bruen PhD  About To Sleep           321
##  9 Ala Schmidt DDS    About To Sleep           212
## 10 Alannah Borer      Others                   121
## # ... with 985 more rows

Segment Size

Now that we have defined and segmented our customers, let us examine the distribution of customers across the segments. If our segmentation logic is good, few or no customer should be categorized as Others.

segments %>%
  count(segment) %>%
  arrange(desc(n)) %>%
  rename(Segment = segment, Count = n)
## # A tibble: 12 x 2
##    Segment            Count
##    <chr>              <int>
##  1 At Risk              157
##  2 Potential Loyalist   132
##  3 Others               128
##  4 Champions            116
##  5 Need Attention       100
##  6 Hibernating           97
##  7 About To Sleep        92
##  8 Lost                  75
##  9 Loyal Customers       43
## 10 Promising             21
## 11 Can't Lose Them       17
## 12 New Customers         17

We can also examine the median recency, frequency and monetary value across segments to ensure that the logic used for customer classification is sound and practical.

Median Recency

rfm_plot_median_recency(segments)

Median Frequency

rfm_plot_median_frequency(segments)

Median Monetary Value

rfm_plot_median_monetary(segments)

Heat Map

The heat map shows the average monetary value for different categories of recency and frequency scores. Higher scores of frequency and recency are characterized by higher average monetary value as indicated by the darker areas in the heatmap.

rfm_heatmap(rfm_result)

Bar Chart

Use rfm_bar_chart() to generate the distribution of monetary scores for the different combinations of frequency and recency scores.

rfm_bar_chart(rfm_result)

Histogram

Use rfm_histograms() to examine the relative distribution of

  • monetary value (total revenue generated by each customer)
  • recency days (days since the most recent visit for each customer)
  • frequency (transaction count for each customer)
rfm_histograms(rfm_result)

Customers by Orders

Visualize the distribution of customers across orders.

rfm_order_dist(rfm_result)

Scatter Plots

The best customers are those who:

  • bought most recently
  • most often
  • and spend the most

Now let us examine the relationship between the above.

Recency vs Monetary Value

Customers who visited more recently generated more revenue compared to those who visited in the distant past. The customers who visited in the recent past are more likely to return compared to those who visited long time ago as most of those would be lost customers. As such, higher revenue would be associated with most recent visits.

rfm_rm_plot(rfm_result)

Frequency vs Monetary Value

As the frequency of visits increases, the revenue generated also increases. Customers who visit more frquently are your champion customers, loyal customers or potential loyalists and they drive higher revenue.

rfm_fm_plot(rfm_result)

Recency vs Frequency

Customers with low frequency visited in the distant past while those with high frequency have visited in the recent past. Again, the customers who visited in the recent past are more likely to return compared to those who visited long time ago. As such, higher frequency would be associated with the most recent visits.

rfm_rf_plot(rfm_result)

Your Turn…

  • if you look at the distribution of segments, around 13% of the customers are in the Others segment For segmentation to be effective and optimal, the Others segment should be eliminated or should have few customers only. Redefine the segments and try to reduce the number of customers in the Others segment.

  • we have defined 11 segments. Try to combine some of the existing segments and bring down the total segments to around 6 or 8.

  • the RFM score we generated uses score between 1 and 5. Try to create segments by using a score between 1 and 3 i.e. the lowest RFM should be 111 and the highest should be 333 instead of 555.

  • reverse the scores i.e. so far we have assigned a score of 5 to customers who visited recently, frequently and had higher transaction amount and 1 to customers who visited way back, rarely and have low transaction amount. Reverse this score pattern and create the segments.

Learning More

The rfm website includes comprehensive documentation on using the package, including the following articles that cover various aspects of using rfm:

Feedback

If you see mistakes or want to suggest changes, please create an issue on the source repository or reach out to us at .

To leave a comment for the author, please follow the link and comment on their blog: Rsquared Academy Blog - Explore Discover Learn.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)