For database marketing or direct marketing people, they are always concerned about two questions before they send out mails or make calls to their customers:-
- How can they segment the customers in the database to find out who are more likely to response to their mails or buy their products?
- Which type of customers they should send the mails to so that they can reach breakeven and make profit?
The RFM method is a very simple but effective customer analysis way to address the above questions. Where,
R, F, and M stand for^{1}
Recency – How recently did the customer purchase?
Frequency – How often do they purchase?
Monetary Value – How much do they spend (each time on average)?
(The above definitions are from Wikipeida.)
You can get more details on RFM from the following paper and case study besides Wikipeida:-
Segmentation and RFM Basics (Bill Ruppert)
Case Study: La Quinta narrows target to boost guest loyalty
Seems there is no out-of-box package on RFM analysis in R programming language area while other data analysis tools such as IBM SPSS have easy-to-use RFM functionalities. So in this article, we will demonstrate how to implement RFM analysis with R language.
Explore Data
We will use the CDNOW Sample data during this demonstration; you can download the data here. For more details about the dataset, read the paper of “Creating an RFM Summary Using Excel (Peter S. Fader, Bruce G. S. Hardie)” please.
Let’s explore the data.
> # read CDNOW_SAMPLE.txt
> df <- read.table(file.choose(),header=F)
> # construct a data frame with the necessary columns of customer ID, transaction date, and money amount paid by a customer per transaction
> df <- as.data.frame(cbind(df[,1],df[,3],df[,5]))
> # add appropriate column names for the above three column and
> names <- c(“ID”,”Date”,”Amount”)
> names(df) <- names
> #tranfer the the text column type to date type
> df[,2] <- as.Date(as.character(df[,2]),”%Y%m%d”)
> head(df)
ID Date Amount
1 4 1997-01-01 29.33
2 4 1997-01-18 29.73
3 4 1997-08-02 14.96
4 4 1997-12-12 26.48
5 21 1997-01-01 63.34
6 21 1997-01-13 11.77
> dim(df)
[1] 6919 3
#remove the rows with the duplicated IDs to see how many customers in total
> uid <- df[!duplicated(df[,”ID”]),]
> dim(uid)
[1] 2357 3
As shown in the output above, the data set contains some columns, especially three columns of customer ID, transaction date, and money amount paid by a customer per transaction. There are total 6919 transaction records in the sample dataset and 2357 unique customers.
Segment the Customers into RFM Cells
Since we have got the data ready now, we can segment the customers into RFM cells by the three dimensions of R, F, and M. Usually we rate the customers by 1 to 5 points in each dimension, the higher score the better.
According to the experiences, Recency is the most important factor (x100) for high response rate, Frequency is the second one(x10), and the Monetary is the last one(x1). The score of customers who are segmented into the same RFM cell is the same. A score of “542” means that the customer gets 5 points in Recency, 4 points in Frequency, and 2 points in Monetary.
There are usually two ways to segment the customers in each dimension, one is Nested, the other is Independent.
The nested method is to divide the customers into aliquots in the Recency dimension first, then for each aliquot of Recency, divide the customers into aliquots in the Frequency dimension nestedly, so does for the Monetary. The advantage of nested method is the quantity of customers in each RFM cell is roughly the same for all cells; the disadvantage is that the meaning of the Frequency and Monetary with the same score point might be different, e.g. one customer with a “543” RFM score and the other customers with a “443” RFM score, their Recency might be quite different though they have the same Recency score of “4”.
The independent method is to rate the customers into aliquots in the R, F, M dimensions independently; the advantage is that the meaning of the Frequency and Monetary with the same score point is the same; the disadvantage is that the quantities of customers in the RFM cells might be quite different.
You can read the discussion in Quora about the pros and cons of the above two methods.
In this article, we will implement the independent method in R language first, and then we will implement another method that allows users to input the parameters of the breaks for each dimension according to their own business requirement.
R Source Codes
The complete R source codes can be downloaded from RFM_Analysis_R_Source_Codes. Read the descriptions in the source codes please before you continue.
Calculate the Recency, Frequency, and Monetary
To implement the RFM analysis, we need to further process the data set in the CDNOW Sample by the following steps:-
- Remove the duplicate records with the same customer ID
- Find the most recent date for each ID and calculate the days to the now or some other date, to get the Recency data
- Calculate the quantity of translations of a customer, to get the Frequency data
- Sum the amount of money a customer spent and divide it by Frequency, to get the amount per transaction on average, that is the Monetary data.
We have already implemented the above procedures in the functions of “getDataFrame” in the source codes.
# set the startDate and endDate, we will only analysis the records in this date range
> startDate <- as.Date(“19970101″,”%Y%m%d”)
> endDate <- as.Date(“19980701″,”%Y%m%d”)
> df <- getDataFrame(df,startDate,endDate)
> head(df)
ID Date Amount Recency Frequency Monetary
4 1997-12-12 26.48 201 4 25.125
18 1997-01-04 14.96 543 1 14.960
21 1997-01-13 11.77 534 2 37.555
50 1997-01-01 6.79 546 1 6.790
60 1997-02-01 21.75 515 1 21.750
71 1997-01-01 13.97 546 1 13.970
After we calculate the Recency, Frequency, and Monetary, we can implement the RFM scoring.
Independent RFM Scoring
We have implemented the independent scoring in function of “getIndependentScore”. Below is the description of the function. For details, please read the source codes.
# Function
# getIndependentScore(df,r=5,f=5,m=5)
# Description
# Scoring the Recency, Frequency, and Monetary in r, f, and m in aliquots independently
# Arguments
# df – A data frame returned by the function of getDataFrame
# r – The highest point of Recency, it is 5 by default
# f – The highest point of Frequency, it is 5 by default
# m – The highest point of Monetary, it is 5 by default
# Return Value
# Returns a new data frame with four new columns of “R_Score”,”F_Score”,”M_Score”, and
# “Total_Score”.
Let’s execute the function and take a look of the output.
> df1 <-getIndepandentScore(df)
> head(df1[-(2:3)])
ID Recency Frequency Monetary R_Score F_Score M_Score Total_Score
11462 52 4 191.6425 5 5 5 555
8736 55 9 148.3944 5 5 5 555
7856 54 5 130.0980 5 5 5 555
15105 58 9 129.5256 5 5 5 555
22356 106 8 127.3650 5 5 5 555
8481 57 13 117.3492 5 5 5 555
Let’s explore the customer distribution in each RFM cell to see if the customers are distributed evenly.
#Draw the histograms in the R, F, and M dimensions so that we can see the distribution of customers in each RFM cell.
> drawHistograms(df1)
The distribution is not very even due to that the Frequency doesn’t have the scores of 1 and 2. Let’s further find out how many customers have a total score larger than 500 or 400.
> S500<-df1[df1$Total_Score>500,]
> dim(S500)
[1] 471 10
> S400<-df1[df1$Total_Score>400,]
> dim(S500)
[1] 945 10
We can consider those customers are more important to get a higher response rate.
RFM Scoring with Breaks
Sometimes users want to determine the breaks for each dimension by themselves according to their own business requirement. For example, a user can set 0 -30 days, 30 -90 days, 90 – 180 days, 180 – 360 days, and more than 360 days as the 5 breaks for Recency rather than just let the computer segment the customers into 5 aliquots without considering the specific business requirement.
We have implemented this scoring method in the function of “getScoreWithBreaks”. Below is the description of the function. For details, please read the source codes.
# Function
# getScoreWithBreaks(df,r,f,m)
# Description
# Scoring the Recency, Frequency, and Monetary in r, f, and m which are vector object containing a series of breaks
# Arguments
# df – A data frame returned by the function of getDataFrame
# r – A vector of Recency breaks
# f – A vector of Frequency breaks
# m – A vector of Monetary breaks
# Return Value
# Returns a new data frame with four new columns of “R_Score”,”F_Score”,”M_Score”, and “Total_Score”.
Before we execute the function, we can take a look at the distributions of Recency, Frequency, and Monetary.
> par(mfrow = c(1,3))
> hist(df$Recency)
> hist(df$Frequency)
> hist(df$Monetary)
We can find that the distributions are not even. Most of the customers have a Recency of more than 450 days, a Frequency of less than 5 times, and a Monetary of less than 50 dollars. That’s the reason why the customers are not distributed evenly in the RFM cells when using the Independent RFM scoring method in the previous part.
Based on the above observation and relevant business sense, we can construct the breaks for each RFM dimension as follows, so that we can get segmented RFM cells more evenly and reasonably.
# set the Recency ranges as 0-120 days, 120-240 days, 240-450 days, 450-500days, and more than 500days.
> r <-c(120,240,450,500)
# set the Frequency ranges as 0 – 2times, 2-5 times,5-8 times, 8-10 times, and more than 10 times.
> f <-c(2,5,8,10)
# set the Monetary ranges as 0-10 dollars, 10-20 dollars, and so on.
> m <-c(10,20,30,100)
Than we can execute the function of “getScoreWithBreaks” and see the customers distributions in the RFM cells.
> df2<-getScoreWithBreaks(df,r,f,m)
> drawHistograms(df2)
We can also calculate how many customers have a total score of more than 500 or 400.
> S500<-df2[df2$Total_Score>500,]
> dim(S500)
[1] 399 10
> drawHistograms(df2)
> S400<-df2[df2$Total_Score>400,]
> dim(S400)
[1] 641 10
There are 641 customers have a RFM score more than 400.
Estimate response rate for each RFM cell^{4}
After we segment the customers into RFM cells, we can assign the response rate to each RFM cell according to historical responding data. If it is the first time to use RFM analysis and there is no historical data, we can select some customers, say 10% percent, randomly from each RFM cells. Send mails to the selected customers as a trail and count the response rate for each cell. Below is an example of the response rate table.
Response Rate Table
RFM cell (Total Score) | … | Response Rate |
555 | 8.5% | |
… | … | |
441 | 5.6% | |
435 | 4.8% | |
… | … | … |
Calculate the breakeven point to select valuable customers^{4}
To be breakeven,
P – C/R= 0
Where,
P is the price or revenue per deal or per response
C is the cost per mail sent out, including production cost, mailing cost etc.
R is the response rate.
Suppose, P = 100 dollars, C = 5 dollars, to be breakeven:-
R = C/P = 5/100 = 5%
That means, we should chose the customers in the RFM cells, that have a response rate equal to or more than 5%, to send out the direct marketing mails, to make money.
Select the Target Customers
According to the Response Rate Table and the breakeven point, we need to select customers in the cells with a response rate more than 5%, that means we need to select the customers with a total score equal to or more than 441.
> target <- df2[df2$Total_Score>=441,]
> dim(target)
[1] 416 10
There are 416 customers we will send the direct marketing mails to.
Summary
We introduced the basic concept of RFM customer analysis and implemented the analysis procedures in R language by providing complete R source code.
References
- http://en.wikipedia.org/wiki/RFM
- http://brucehardie.com/datasets/CDNOW sample.zip
- http://www.brucehardie.com/notes/022/RFM_summary_in_Excel.pdf
- Recency, Frequency and Monetary (RFM) Analysis (Charlotte Mason)
- Case Study: La Quinta narrows target to boost guest loyalty
- http://www.responseb2b.com/files/rfm.ppt
Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息
2,257 total views, 7 views today
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...