Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Allison Koenecke, Data Scientist, AI & Research Group at Microsoft, with acknowledgements to Amita Gajewar and John-Mark Agosta.

In this tutorial, Allison Koenecke demonstrates how Microsoft could recommend to customers the next set of services they should acquire as they expand their use of the Azure Cloud, by using a temporal extension to conventional Market Basket Analysis.

## Problem Statement

Market Basket Analysis (MBA) answers a standard business question: given a set of grocery store receipts, can we find bundles of products often purchased together (e.g., peanut butter and jelly)?  Suppose we instead want to model the evolution of a customer’s services, such as determining whether buying peanut butter in the past indicates a higher likelihood of buying bread in the future.  For this, we apply a sequential version of MBA, sometimes called “sequential itemset mining” or “sequential pattern mining”, to introduce a time component to the analysis [1]. Sequential itemset mining has been applied across many industries, from determining a patient’s sequence of medical prescriptions [2] to detecting misuse intrusions such as application layer attacks [3]. In this tutorial, when given a series of purchases over time, we determine whether we can (1) find product bundles that we expect to be consumed simultaneously, and also (2) examine how these bundles evolve over time.

In the following tutorial, we answer both questions using the R package arulesSequences [4], which implements the SPADE algorithm [5]. Concretely, given data in an Excel spreadsheet containing historical customer service purchase data, we produce two separate Excel sheet deliverables: a list of service bundles, and a set of temporal rules showing how service bundles evolve over time.  We will focus on interpreting the latter result by showing how to use temporal rules in making predictive sales recommendations.

Our running example below is inspired by the need for Microsoft’s Azure Services salespeople to suggest which additional products to recommend to customers, given the customers’ current cloud product consumption services mix.  We’d like to know, for instance, if customers who have implemented web services also purchase web analytics within the next month.  Actual Azure Service names have been removed for confidentiality reasons.

## Step 1: Reformat Data into Transaction Matrix

First, let’s import standard transaction data and convert it to a sample matrix of the form shown in Figure 1.  If transactions are listed across columns instead of rows over time, simply reconfigure the data using relevant R functions from one of the reshape2, data.table, or tidyr libraries.  Note that to convert to the correct matrix format, the easiest method is to export the sequences to a temporary text file, and re-import using the read_baskets function within the arulesSequences package.

Figure 1: Sample conversion of data input to transaction matrix.  The stand-in letters in the “ServiceLevel” column represent specific Azure services, such as Compute, Networking, Data + Storage, Web + Mobile, etc.

## Step 2: Run SPADE Algorithm

We will use the SPADE (Sequential Pattern Discovery using Equivalence classes) algorithm for temporal market basket analysis, which is invoked with the cspade function.  This method, recursive by length of sequence, is detailed in Figure 2 below.  For example, in the first pass we look at sequences of length 1 (i.e., we determine which individual Azure Services appear in our transaction data).  Based on the most frequent single-length sequences (e.g., “A” appears more often than “D”), we observe two types of two-element sequences.  First, we observe two-element temporal sequences (“B → A” requires “B” to be purchased before “A”).  Next, we observe two-element item groupings (“AB” requires “A” and “B” to exist at a certain time simultaneously).  Then, based on the most frequent length-two outputs, we move on to finding three-element sequences (e.g., “D → B → A”) and three-element item groupings (e.g., “ABF”).  This continues until we reach a user-defined maximum length, or until we reach a length at which we can no longer find frequent outputs.

Going forward, we will refer to the term “itemset” as any collection of items purchased by the customer, so an itemset can consist of item groupings (which could be a single item) or temporal sequences.

Figure 2: SPADE Algorithm frequent sequence generation (original image from [5]).

We’ve used the word “frequent” several times in the SPADE algorithm description. What exactly does this mean? There are three different terms that quantify frequency for MBA generally, and a best-case scenario involves high values in all three metrics: support, confidence, and lift.  Here are their definitions:

Support(}): The fraction of transactions that contain an itemset {ɑ}.  High support values correspond to commonly-found itemsets that are applicable to many transactions.

Confidence(} → }): The likelihood that the sequential rule {ɑ} → {β} actually occurs among transactions containing itemset {ɑ}, under the constraint that itemset {ɑ} is transacted prior to {β}.  Confidence can be interpreted as a conditional probability given a pre-existing transaction with itemset {ɑ}; high confidence implies a high likelihood that {β} occurs in a future transaction.  Technically, confidence is defined as the ratio: Support({ɑ} and {b}) / Support({ɑ}).

Lift({ɑ} → {β}): The degree of improvement from applying the sequential rule {ɑ} → {β} over observing itemsets {ɑ} and {β} independently occurring in the wild.  The lift ratio is defined by Support({ɑ} and {β}) / (Support({ɑ}) * Support({β})). Note that if itemsets {ɑ} and {β} were independently occurring events, then the denominator would equate to the numerator, and the lift value would be 1.  High lift – greater than 1 – implies that the presence of pre-existing itemset {ɑ} has increased the probability that {β} will occur in a future transaction; this can be interpreted as a higher degree of dependence between itemsets {ɑ} and {β}.  Conversely, a low lift value – less than 1 – implies a negative dependence between itemsets {ɑ} and {β}. This metric tells you not just what is popular with all customers, but what is most useful for a particular customer given their history (e.g., a low lift value could be interpreted as {β} acting as a substitute good for {ɑ}).

Returning to our example, recall that our input (trans_matrix created in Step 1) is the month-over-month sequence of Azure services purchased.  With just one line of code, the cspade function returns frequent sequences mined in descending order of support.  Further, to cut down on runtimes, we can define a minimum support value to be output – in the below code, we use a minimum of 0.3, which in this case gives us 18 different sequences to observe.  We will see how confidence and lift play a role in Step 3 of this tutorial.

We can view results in the data frame s1.df thus created, and look at the following statistics by calling summary on s1.  Specifically, we see:

“(1) the list of the most frequent individual items (A,B, ..),

(2) the list of the most frequent set of items that occur in events (referred to as elements),

(3) the distribution of the sizes of the set of items,

(4) the distribution of the number of events in a sequence (referred to as sequence length),

(5) the minimum, maximum, mean and median support values, and

(6) the set of frequent sequences mined, ordered by its support value” [6].

Figure 3: Sample output from the cspade algorithm shows 18 frequent itemsets, as well as statistics on these values (e.g., item A appears 11 out of 18 times overall as part of various itemsets, but appears by itself in a transaction as {A} only 8 times).  Note that the comma delimiter between sets within each sequence implies a temporal sequence as seen in Figure 2.

## Step 3: Find and Interpret Temporal Rules

It takes just one more line of code to convert this set of sequences into a set of rules.  Specifically, the “strong association rules” generated must satisfy minimum support and confidence values, and the left-hand side of the rule must occur at a time before the rule’s right-hand side.  We have set the lower confidence bound to 0.5 in this example; a common default is a minimum of 0.8.

This returns the data frame shown in Figure 4.  We can interpret the “rule” column as showing which bundles of Azure services imply consumption of additional services in coming months.  Further, for all rules, we can compare the three previously-defined metrics: support, confidence, and lift.

Figure 4: Sample output from the ruleInduction method.

How do we interpret this?  We can say, for example, every single time that we have seen service D purchased by a customer, the same customer will always buy both B and F simultaneously at a later time.  This is what a confidence score of 1.0 implies in Row 2 of Figure 4.  Going back to Figure 1, we see that this is the case for user IDs 1 and 4.  Meanwhile, the lift value of 1.0 implies that the probability of D being purchased, and the probability of both B and F being purchased afterwards, are independent of each other.  So, we cannot claim that buying both B and F is dependent on having previously bought item D, but also, these events do not appear negatively dependent (i.e., it is also not the case that buying D previously makes it relatively unnecessary for the user to later buy B and F, as would be the case for the rules having a lift value of 0.5).  While this output is fairly straightforward, we can do some simple cleaning to parse out the “before” and “after” steps of the rule, and sort by the metrics deemed to be most important for the specific use case.

Figure 5: Readable output in Excel stored in TemporalRules.csv. The values stored in FreqItemGroupingsInRules.csv are a list containing items: D; B; F; B,F; and A.

Now, our two main results are stored in the csv files, FreqItemGroupingsInRules.csv and TemporalRules.csv. Taken together, all itemsets that we care about will be defined by either the frequent item groupings or temporal rules.  But, how do we apply these to our business case?

First, the frequent item groupings themselves are useful domain knowledge.  Even if the sequential ordering of B and F are ignored, knowing that items B and F are frequently purchased together raises some business opportunities: is there a reason that they are sold separately?  Do customers have some need for both items that would be served better by rolling them into one product?

Second, the temporal rules allow us to propose better-targeted recommendations over time to customers.  For example, assume that we see the rule stating that “<{D}> => <{B,F}>“ with high confidence and reasonable lift (both scoring 1.0 in our example).  For a customer who has previously purchased item D, we can recommend that they now buy the bundle of two items, B and F.  In a future timestep, suppose our customer loves this suggestion and has now purchased B and F simultaneously.  Now, we can consult the temporal rules again and find the rule “<{B,F}> => <{A}>” (which we choose due to the correct starting bundle of goods, as well as the high confidence).  Now, we can further recommend that our customer adopt item A at a future time.  In fact, we can parse these rules more easily using the cleaned Excel output shown in Figure 5 – our recommendation path is shown in Row 4.  This style of targeted personalized recommendations is helpful to both our customers and our salespeople, and future recommendations will become increasingly accurate as more data are collected to contribute to the confidence and lift calculations.

## Things to Consider

The above steps show a computationally efficient method to find temporal rules; this is far faster than a brute-force method, such as finding frequent itemsets in each timestep using the arules package, and then manually comparing them across time.  However, depending on the support, confidence, and lift cut-offs desired, it is possible for even the arulesSequences method to take significant amounts of computation time. Hence, if the main desired outcome is expected to be one of the highest-scoring rules, it may be prudent to start from a high threshold frequency cut-off before testing lower values.

Aside from frequency cut-off parameters, it is worth considering the product aggregation level at which results are measured.  Let us assume that we have products or services being sold that are organized in a hierarchy.  It will take less time to run the SPADE algorithm on the least-granular level since the algorithm will see more repeated services being sold to different customers.  If there is a desired minimum cut-off that takes too long to run on the specific product levels you desire, consider running the above code on a higher, more aggregate level of the product hierarchy, and then drill down in that set to determine which products are best to recommend in future timesteps. For example, at a higher product level within Azure Services (such as Analytics, Compute, Web + Mobile, etc.), suppose we find that we should recommend an Analytics product to a specific customer.  Upon further examination, perhaps we find that Analytics-based temporal rules on more granular products indicate to recommend Machine Learning tools as opposed to Data Lake products. Hence, we are still able to make a reasonable suggestion for lower-level products without spending excessive computation time on the entire dataset.

Lastly, the above sequential pattern mining code may not be directly applicable if you: (1) care about the quantity of items being bought at any given point in time (since we simply observe the presence or absence of an itemset in this tutorial), or (2) have data that are irregular over time, but aim to predict a recommendation for a specific time interval in the future. In the former case, encoding a repeated item with separate item names (i.e., one name for each unit purchased) can allow quantity to be expressed.  For the latter case, it is a best practice to use regular time intervals; this can be done by bucketing purchases by, e.g., month (if predictions are to be made over a monthly timeframe). Depending on sales structure, it may be necessary to either interpolate or report zero purchases in intermediate timesteps.

In summary, given the wide usage of Excel by finance teams tracking historical purchases, this approach offers an effective tool for providing insights on product or service adoption over time.  Sequential pattern mining as implemented here can potentially be harnessed to recommend deals to a company’s salespeople, find what customers are likely to consume next, and discover which product bundles remain popular over time.  Let us know what use cases you have found for this method!

### References

1. Srikant, R., Agrawal, R., Apers, P., Bouzeghoub, M., Gardarin, G. (1996) “Mining sequential patterns: Generalizations and performance improvements”, Advances in Database Technology — EDBT '96 vol. 1 no. 17.
2. Aileen P. Wright, Adam T. Wright, Allison B. McCoy, Dean F. Sittig, The use of sequential pattern mining to predict next prescribed medications, Journal of Biomedical Informatics, Volume 53, 2015, Pages 73-80, ISSN 1532-0464, https://doi.org/10.1016/j.jbi.2014.09.003.
3. Song SJ., Huang Z., Hu HP., Jin SY. (2004) A Sequential Pattern Mining Algorithm for Misuse Intrusion Detection. In: Jin H., Pan Y., Xiao N., Sun J. (eds) Grid and Cooperative Computing – GCC 2004 Workshops. GCC 2004. Lecture Notes in Computer Science, vol 3252. Springer, Berlin, Heidelberg.
4. Package ‘arulesSequences’ documentation: https://cran.r-project.org/web/packages/arulesSequences/arulesSequences.pdf.
5. J. Zaki. (2001). SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal, 42, 31–60.
6. Data Mining Algorithms in R: https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Sequence_Mining/SPADE.
7. Analyzing Transaction Data like a Data Scientist: https://rpubs.com/Mahsa_A/Part4_AnalyzeTransactionData.