Site icon R-bloggers

Text Mining on Wine Description

[This article was first published on François Husson, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here is an example of text mining with correspondence analysis.
Within the context of research into the characteristics of the wines from Chenin vines in the Loire Valley (French wines), a set of 10 dry white wines from Touraine were studied: 5 Touraine Protected Appellation of Origin (AOC) from Sauvignon vines, and 5 Vouvray AOC from Chenin vines.
These wines were described by 12 professionals. The instructions were: for each wine, give one or more words which, in your opinion, characterises the sensory aspects of the wine. This data was brought together in a table with the wines as rows and the columns as words, where the general term Xij is the number of times that a word j was associated with a wine i (data are available here).

This contingency table has been analysed using Correspondence Analysis (CA) to provide an image summarising the diversity of the wines. Prior to the analysis, the words which are used the least frequently are suppressed and a number of “neighbouring” words were grouped together (for example, sweet, smooth, and syrupy, all of which refer to the same perception, that of the sweet taste of the wine).

CA is implemented using the following commands:

library(FactoMineR)
wine = read.table("http://factominer.free.fr/bookV2/wine.csv",
     header=TRUE,row.names=1,sep=";",check.names=FALSE)
res.ca = CA(wine,col.sup=11,row.sup=31)
summary(res.ca)

We can comment the graph saying that there are 3 poles of wines:

Once these three poles are established, we can go on to qualify the dimensions. The first distinguishes the Sauvignons from the Chenin wines based on freshness and flavour. The second opposes the cask-aged Chenin wines (with an oak flavor) with that containing residual sugar (with a sweet flavour).

Having determined these outlines, the term lack of character, which was only used for wines 6 and 8, seems to appear in the right place, i.e., far from the wines which could be described as flavoursome, whether the flavour be due to the Sauvignon vines or from being aged in oak casks.

Finally, this plane offers an image of the Touraine white wines, according to which the Sauvignons are similar to one another and the Chenins are more varied. From a viticulturist’s point of view, this analysis identifies the marginal characteristics of the Chenin vine. In practice, this vine yields rather varied wines which seem particularly different from the Sauvignons as they are somewhat similar and rather typical.

You can find a complete decription of this data in the book Exploratory Multivaraite Data Analysis by Example Using R (Husson, Lê, Pagès).

Here are some materials: a video on another example of text mining, a video to better understand the CA method, and this video to see how to run CA with the R package FactoMineR.

You can also enroll in this MOOC.


To leave a comment for the author, please follow the link and comment on their blog: François Husson.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.