The Lady Loves Statistics

[This article was first published on Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Paulin Shek

Working at Mango is generally busy, fun, but also at times, quite surreal. Lunchtime conversation amongst the consultants can get quite animated, especially when Andy Nicholls, the Head of Consultancy, finds a topic that he has a strong opinion on.

Today, it was chocolate.

In Andy’s mind, chocolate falls into two categories: “Chocolate I would eat after swimming” and “Chocolate I would not eat after swimming”.

I severely questioned his expertise, so now it is up to me to prove him wrong. So off to the supermarket I went, for my all-important “research”.

chocolate for blog
Andy’s classification of these was as follows – * indicates chocolate bars that Andy was uncertain about.

table 1

Now for my part: I collected data from the chocolate bars, collecting fields such as Weight, Calories, Protein etc. which were readily found on the wrapper. I also made flags for whether a chocolate bar has nuts, wafer, caramel etc.

##               Weight Calories  Fat Carb Protein Salt HasNuts HasWafer
## KitKat Chunky   40.0      516 25.6 65.1     5.4 0.18       0        1
## Boost           48.5      515 28.5 58.5     5.9 0.30       0        0
## Dairy Milk      45.0      530 30.5 56.5     7.5 0.23       0        0
## Galaxy          42.0      546 32.4 56.0     6.7 0.25       0        0
## Twix            50.0      495 24.0 64.6     4.5 0.44       0        0
## Picnic          48.4      485 22.5 61.0     7.7 0.53       1        1
##               HasCaramel HasHoneycomb HasNougat
## KitKat Chunky          0            0         0
## Boost                  1            0         0
## Dairy Milk             0            0         0
## Galaxy                 0            0         0
## Twix                   1            0         0
## Picnic                 1            0         0

Already I was realising that I did not have enough chocolate for the number of fields that I’d collected! Also, did I really want Kinder Buenos with hasNuts=TRUE? I had originally expected this field to be some indicator of the “bulkiness” of the chocolate bar, and Kinder Buenos did not fit with this expectation. In the end, I decided to change the field of hasPeanuts, which excluded Kinder Buenos but kept things like Snickers and Picnic bars. I was very quickly realising the heuristic nature of clustering.

normalise <- function(x){ (x - mean(x))/var(x)}
choc <- apply(chocolates, 2, normalise)

Next, I ran the standard k-means clustering algorithm from the stats package. I decided to try 2, 3 and 4 clusters, because I was not convinced that Andy’s binary classification made sense, but also due to the small sample, I couldn’t try too many either.

table 2

We could interpret the orange coloured cells to denote “Swim bars”, and this stays consistent regardless of how many clusters we set the k-means algorithm to.

It is interesting to see that Wispas and Wispa Golds are never in the same category, which is not the case in Andy’s categorisation. However, it makes intuitive sense to others- another consultant Aimee said “Well, if I wanted to eat a Wispa, a Wispa Gold would not be a suitable substitute”. So there we go.

Taking K=3, we can see that the Crunchie has been identified as an outlier, but then it is hard to make sense of the blue group from 4 clusters!

I decided that perhaps normalising every column was not the best idea- the binary values definitely needed to be adjusted, but I was less certain about things like salt. I tried running the cluster analysis again, starting again from the original data, but adjusting the binary variables (HasCaramel etc.) to be much bigger, multiplying them by a scalar that’s on par with the other values in the data.

table 3

This time K=2 looks less reasonable, with Picnic and Snickers bars not grouped with the Mars bar. K=3 looks a bit better- the white group could almost be a “plain chocolate” group.

I think I’ve found the perfect classification with K=4 though. The blue group are clearly the “Wafer” group, all the chocolate only bars are picked out by the purple group, and them Crunchie and Wispa Golds are put together as the “super sweet” group.

So, I’ve found the perfect classification, (which is clearly better than Andy’s!). I can stop and reward myself with one of my many chocolate bars now!

At a glance of the data, I could already tell that the 1, 0 values were going to be too small and hence get ignored by the clustering. So, I decided to normalise all the values.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)