**Pareto's Playground**, and kindly contributed to R-bloggers)

Our behaviour is often very variable and reducing it to single number such as an average might be comforting but ultimately misleading. For instance, I generally use 500MB of data on my cellphone on a monthly basis but this can be as little as 200 MB and on occasion well beyond 1GB. This line of thinking would suggest my behaviour is best modelled as a distribution with a peak around 500MB and a mean just above 500MB. Latent Class Analysis attempts to recover these distributions of behaviour for groups of customers.

I am going to simulate a cellphone dataset where the segments represent different distributions of cellphone usage. The advantage of simulation is that we know the true process that the data was generated so we can check if the model gets the answer right. Simulation also allows us to test how sensitive our model is to sample size (will not cover this though) and it removes the gremlins/detail of real data which often take up 80% of the time to resolve/understand.

The Figure 1 below shows the 5 segments hidden in the data we will simulate:

From the Figure 1, you should see that 10% of customers fall into the ‘low usage’ segment. Our segmentation base consists of 6 variables. All variables have 3 levels denoting intensity of usage – Low, Medium, High. A ‘low usage’ customer is expected to have a ‘Low’ cellphone spend 90% of the time, ‘Medium’ spend 8% of the time and ‘High’ spend 2% of the time. It is important to note the segments have overlapping distributions which makes deciding the distribution a customer belongs to difficult a exercise.

The analysis below was implemented using R. Snippets of code have been included for those familiar with the language to give a more concrete understanding.

I have set the the seed to ensure the numbers are reproducible but this is not entirely necessary. We generate a sample of 5000 cellphone customers with 10% in ‘low usage’, 20% in ‘medium usage – receive calls’, etc.

We tabulate our customer sample into a table of counts.

In this section, we translate the distribution tables in Figure 1 into probability matrices for each segment.

The above code creates our simulated data set. To simplify the simulation each question is simulated under conditions of independence. Non-independence simulation is possible but we would have to specify every conditional distribution, that is 4379 parameters rather than 95.

A bit of housekeeping, converting the customer data into a dataframe and adding labels.

We finally run our latent class model for 5 segments using the polytomous variable Latent Class Analysis package (poLCA). We repeat the run 10 times to ensure we are converging to a global solution rather than a local one. The default iterations are not enough for convergence so I have increased this to 100 000 to reach convergence.

**So did poLCA recover our segments?**

Well, it is a bit of a mixed bag.

The estimation of the size of the segments with least overlap is within 1% of the truth – ‘Low usage’ and ‘High usage’ segments. It under estimates the size of the ‘data usage’ segment by 15% and over estimates the ‘medium usage – made calls’ segment by 20%.

The uncertainty plot indicates that for about 17% (800) of customers it was less 50% confident that the customer belonged in the segment with the highest probability of membership. This helps explain why the penetration estimates were so off.

Thanks to simulation we can estimate whether Latent Class Analysis classified the customer in the right segment. poLCA had a 68% accuracy rate looking at the confusion table above. Random guessing would be 20% accurate and classifying everyone into the largest segment is 40% accurate. Human intuition would probably get the segments with the least overlap right (15%) and classifying the rest into the largest segment would push accuracy to around 55%.

The table shows the accuracy of the model in estimating the 95 parameters for the simulation model. Hopeful it is clear for above summary stats that untangling overlapping behaviours is a tough endeavour. There is a loss of information when moving from the true distribution to the observed behaviour which makes going backwards hard. Latent Class Analysis does at least give us an idea of how uncertain the model is and is possibly the best we can do without being omnipotent.

Final thoughts …

- Should we simplify the problem by merging the ‘data usage’ segment and ‘medium usage – made calls’ segment, this will increase our accuracy but we would become less specific?
- Did not really cover how to determine the best number of segments when you don’t know the true number of segments.
- Will a larger sample size improve matters?
- How do we deal with the stability of the segments especially with growing prevalence of over-the-top content (OTT) in the cellphone industry?
- How do other segmentation algorithms compare such as K-means?

**leave a comment**for the author, please follow the link and comment on their blog:

**Pareto's Playground**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...