… or what I did on my summer vacation…
Just got back from the Elder Research Two Day Course “Tools for Discovering Patterns in Data“. It was a great course that (while not R specific) provides a great overview of Data Mining tools and techniques and insight into current applications in a wide variety of industries.
|Dr. Elder is a coauthor of a book available online (and provided with the course) called “Handbook of Statistical Data Analysis and Data Mining Applications.” This book contains a wealth of practical examples and tutorials (most using the Statsoft Statistica software). It has a decidedly practical emphasis that allows you to see how algorithms are used to discern patterns in the data and to evaluate and compare how effective they are with specific data sets. Functional areas covered in the tutorials include aviation safety, movie box office receipts, customer services, credit scoring, automobile brand review, quality control, business administration in a medical industry, psychological evaluation, dentistry and profit analysis. This is very helpful for those who prefer to work from the concrete to the general (rather than being provided mathematical abstractions that you then apply to specific situations). They might also be helpful for showing a business user why data mining matters and what value it brings to a business or organization.
The conference covered many of the same topics discussed in Introduction to Data Mining by Tan, Steinbach and Kumar. However, there were many more concrete examples and applications of techniques in specific areas of finance, industry, government and education. A section of the book on ensemble methods is included in a larger section simply titled “Classification: Alternative Techniques”. Dr. Elder went into greater detail on these topics and demonstrated the effectiveness of combining multiple models into a single model that is usually more accurate than the best of the individual component classifiers. It seems that different classifiers “see” certain parts of data sets better than others, and that combining classifiers results in a final analysis in which the best (most accurate) elements of each classification are retained while the worst aspects are largely ignored. By combining classifiers and manipulating the training set and input features a more accurate final model can be obtained.
More detail about ensemble methods is available in another book coauthored by Dr. Elder entitled Ensemble Methods in Data Mining. This book goes into greater detail about how and when to use ensembling and includes some examples in R. The use of multiple classification techniques raises a number of interesting issues – on the one hand they seem to work in practice, but there use makes it more difficult to trace how a final combined model is constructed from the original data set. This has raised some interesting issues about the definition of complexity and the quest for simple accurate models.
Dr. Andrew Fast presented on Text Mining and Social Network Analysis – and provided some valuable insights into these rapidly developing fields. There were also a number of software demos and time to interact with other members of Elder Research staff and conference participants.
The conference took place in Charlottesville VA which is a great setting with many historical and recreational attractions nearby.
So that’s what I did on my summer vacation…