[This article was first published on Data Apple » R Blogs in English, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For database marketing or direct marketing people, they are always concerned about two questions before they send out mails or make calls to their customers:-

1. How can they segment the customers in the database to find out who are more likely to response to their mails or buy their products?
2. Which type of customers they should send the mails to so that they can reach breakeven and make profit?

The RFM method is a very simple but effective customer analysis way to address the above questions. Where,

R, F, and M stand for1

Recency – How recently did the customer purchase?

Frequency – How often do they purchase?

Monetary Value – How much do they spend (each time on average)?

(The above definitions are from Wikipeida.)

You can get more details on RFM from the following paper and case study besides Wikipeida:-

Segmentation and RFM Basics (Bill Ruppert)

Case Study: La Quinta narrows target to boost guest loyalty

Seems there is no out-of-box package on RFM analysis in R programming language area while other data analysis tools such as IBM SPSS have easy-to-use RFM functionalities. So in this article, we will demonstrate how to implement RFM analysis with R language.

Explore Data

We will use the CDNOW Sample data during this demonstration; you can download the data here. For more details about the dataset, read the paper of “Creating an RFM Summary Using Excel (Peter S. Fader, Bruce G. S. Hardie)” please.

Let’s explore the data.

> # construct a data frame with the necessary columns of customer ID, transaction date, and money amount paid by a customer per transaction

> df <- as.data.frame(cbind(df[,1],df[,3],df[,5]))

> # add appropriate column names for the above three column and

> names <- c(“ID”,”Date”,”Amount”)

> names(df) <- names

> #tranfer the the text column type to date type

> df[,2] <- as.Date(as.character(df[,2]),”%Y%m%d”)

ID       Date              Amount

1   4       1997-01-01    29.33

2   4       1997-01-18    29.73

3   4       1997-08-02    14.96

4   4       1997-12-12    26.48

5  21       1997-01-01    63.34

6  21       1997-01-13    11.77

> dim(df)

[1] 6919    3

#remove the rows with the duplicated IDs to see how many customers in total

> uid <- df[!duplicated(df[,"ID"]),]

> dim(uid)

[1] 2357    3

As shown in the output above, the data set contains some columns, especially three columns of customer ID, transaction date, and money amount paid by a customer per transaction. There are total 6919 transaction records in the sample dataset and 2357 unique customers.

Segment the Customers into RFM Cells

Since we have got the data ready now, we can segment the customers into RFM cells by the three dimensions of R, F, and M. Usually we rate the customers by 1 to 5 points in each dimension, the higher score the better.

According to the experiences, Recency is the most important factor (x100) for high response rate, Frequency is the second one(x10), and the Monetary is the last one(x1).  The score of customers who are segmented into the same RFM cell is the same.  A score of “542” means that the customer gets 5 points in Recency, 4 points in Frequency, and 2 points in Monetary.

There are usually two ways to segment the customers in each dimension, one is Nested, the other is Independent.

The nested method is to divide the customers into aliquots in the Recency dimension first, then for each aliquot of Recency, divide the customers into aliquots in the Frequency dimension nestedly, so does for the Monetary. The advantage of nested method is the quantity of customers in each RFM cell is roughly the same for all cells; the disadvantage is that the meaning of the Frequency and Monetary with the same score point might be different, e.g. one customer with a “543” RFM score and the other customers with a “443” RFM score, their Recency might be quite different though they have the same Recency score of “4”.

The independent method is to rate the customers into aliquots in the R, F, M dimensions independently; the advantage is that the meaning of the Frequency and Monetary with the same score point is the same; the disadvantage is that the quantities of customers in the RFM cells might be quite different.

You can read the discussion in Quora about the pros and cons of the above two methods.

In this article, we will implement the independent method in R language first, and then we will implement another method that allows users to input the parameters of the breaks for each dimension according to their own business requirement.

R Source Codes

Calculate the Recency, Frequency, and Monetary

To implement the RFM analysis, we need to further process the data set in the CDNOW Sample by the following steps:-

1. Remove the duplicate records with the same customer ID
2. Find the most recent date for each ID and calculate the days to the now or some other date, to get the Recency data
3. Calculate the quantity of translations of a customer, to get the Frequency data
4. Sum the amount of money a customer spent and divide it by Frequency, to get the amount per transaction on average, that is the Monetary data.

We have already implemented the above procedures in the functions of “getDataFrame” in the source codes.

# set the startDate and endDate, we will only analysis the records in this date range

> startDate <- as.Date(“19970101″,”%Y%m%d”)

> endDate <- as.Date(“19980701″,”%Y%m%d”)

> df <- getDataFrame(df,startDate,endDate)

ID          Date            Amount   Recency     Frequency  Monetary

4           1997-12-12  26.48     201             4                 25.125

18          1997-01-04  14.96     543             1                 14.960

21         1997-01-13  11.77     534              2                 37.555

50         1997-01-01   6.79     546               1                 6.790

60         1997-02-01  21.75     515              1                 21.750

71         1997-01-01  13.97     546              1                 13.970

After we calculate the Recency, Frequency, and Monetary, we can implement the RFM scoring.

Independent RFM Scoring

We have implemented the independent scoring in function of “getIndependentScore”. Below is the description of the function. For details, please read the source codes.

# Function

#     getIndependentScore(df,r=5,f=5,m=5)

# Description

#       Scoring the Recency, Frequency, and Monetary in r, f, and m in aliquots independently

# Arguments

#       df – A data frame returned by the function of getDataFrame

#       r –  The highest point of Recency, it is 5 by default

#       f –  The highest point of Frequency, it is 5 by default

#       m –  The highest point of Monetary, it is 5 by default

# Return Value

#       Returns a new data frame with four new columns of “R_Score”,”F_Score”,”M_Score”, and

#  “Total_Score”.

Let’s execute the function and take a look of the output.

> df1 <-getIndepandentScore(df)

ID      Recency Frequency Monetary      R_Score  F_Score  M_Score  Total_Score

11462   52       4                191.6425       5             5              5              555

8736    55        9                148.3944       5             5              5              555

7856    54        5                130.0980       5             5              5              555

15105   58       9                129.5256       5             5              5              555

22356   106    8                127.3650        5             5              5              555

8481    57     13                117.3492        5             5              5              555

Let’s explore the customer distribution in each RFM cell to see if the customers are distributed evenly.

#Draw the histograms in the R, F, and M dimensions so that we can see the distribution of customers in each RFM cell.

> drawHistograms(df1)

The distribution is not very even due to that the Frequency doesn’t have the scores of 1 and 2. Let’s further find out how many customers have a total score larger than 500 or 400.

> S500<-df1[df1$Total_Score>500,] > dim(S500) [1] 471 10 > S400<-df1[df1$Total_Score>400,]

> dim(S500)

[1] 945  10

We can consider those customers are more important to get a higher response rate.

RFM Scoring with Breaks

Sometimes users want to determine the breaks for each dimension by themselves according to their own business requirement. For example, a user can set 0 -30 days, 30 -90 days, 90 – 180 days, 180 – 360 days, and more than 360 days as the 5 breaks for Recency rather than just let the computer segment the customers into 5 aliquots without considering the specific business requirement.

We have implemented this scoring method in the function of “getScoreWithBreaks”. Below is the description of the function. For details, please read the source codes.

# Function

#     getScoreWithBreaks(df,r,f,m)

# Description

#       Scoring the Recency, Frequency, and Monetary in r, f, and m which are vector object containing a series of breaks

# Arguments

#       df – A data frame returned by the function of getDataFrame

#       r –  A vector of Recency breaks

#       f –  A vector of Frequency breaks

#       m –  A vector of Monetary breaks

# Return Value

#       Returns a new data frame with four new columns of “R_Score”,”F_Score”,”M_Score”, and “Total_Score”.

Before we execute the function, we can take a look at the distributions of Recency, Frequency, and Monetary.

> par(mfrow = c(1,3))

> hist(df$Recency) > hist(df$Frequency)

> hist(df$Monetary) We can find that the distributions are not even. Most of the customers have a Recency of more than 450 days, a Frequency of less than 5 times, and a Monetary of less than 50 dollars. That’s the reason why the customers are not distributed evenly in the RFM cells when using the Independent RFM scoring method in the previous part. Based on the above observation and relevant business sense, we can construct the breaks for each RFM dimension as follows, so that we can get segmented RFM cells more evenly and reasonably. # set the Recency ranges as 0-120 days, 120-240 days, 240-450 days, 450-500days, and more than 500days. > r <-c(120,240,450,500) # set the Frequency ranges as 0 – 2times, 2-5 times,5-8 times, 8-10 times, and more than 10 times. > f <-c(2,5,8,10) # set the Monetary ranges as 0-10 dollars, 10-20 dollars, and so on. > m <-c(10,20,30,100) Than we can execute the function of “getScoreWithBreaks” and see the customers distributions in the RFM cells. > df2<-getScoreWithBreaks(df,r,f,m) > drawHistograms(df2) We can also calculate how many customers have a total score of more than 500 or 400. > S500<-df2[df2$Total_Score>500,]

> dim(S500)

[1] 399  10

> drawHistograms(df2)

> S400<-df2[df2$Total_Score>400,] > dim(S400) [1] 641 10 There are 641 customers have a RFM score more than 400. Estimate response rate for each RFM cell4 After we segment the customers into RFM cells, we can assign the response rate to each RFM cell according to historical responding data. If it is the first time to use RFM analysis and there is no historical data, we can select some customers, say 10% percent, randomly from each RFM cells. Send mails to the selected customers as a trail and count the response rate for each cell. Below is an example of the response rate table. Response Rate Table  RFM cell (Total Score) … Response Rate 555 8.5% … … 441 5.6% 435 4.8% … … … Calculate the breakeven point to select valuable customers4 To be breakeven, P – C/R= 0 Where, P is the price or revenue per deal or per response C is the cost per mail sent out, including production cost, mailing cost etc. R is the response rate. Suppose, P = 100 dollars, C = 5 dollars, to be breakeven:- R = C/P = 5/100 = 5% That means, we should chose the customers in the RFM cells, that have a response rate equal to or more than 5%, to send out the direct marketing mails, to make money. Select the Target Customers According to the Response Rate Table and the breakeven point, we need to select customers in the cells with a response rate more than 5%, that means we need to select the customers with a total score equal to or more than 441. > target <- df2[df2$Total_Score>=441,]

> dim(target)

[1] 416  10

There are 416 customers we will send the direct marketing mails to.

Summary

We introduced the basic concept of RFM customer analysis and implemented the analysis procedures in R language by providing complete R source code.

References