by Shaheen Gauher, PhD, Data Scientist at Microsoft
In machine learning, the problem of classification entails correctly identifying to which class or group a new observation belongs, by learning from observations whose classes are already known. In what follows, I will build a classification experiment in Azure ML Studio to predict wine quality based on physicochemical data. Several classification algorithms will be applied on the data set and the performance of these algorithms will be compared. I will also present a tutorial on how to do similar exercise using MRS (Microsoft R Server, formerly Revolution R Enterprise). I will use wine quality data set from the UCI Machine Learning Repository. The dataset contains quality ratings (labels) for a 1599 red wine samples. The features are the wines’ physical and chemical properties (11 predictors). We want to use these properties to predict the quality of the wine. The experiment is shown below and can be found in the Cortana Intelligence Gallery.
There are several classification algorithms available in Azure ML viz. Multiclass Decision Forest, Multiclass Decision Jungle, Multiclass Logistic regression, Multiclass Neural Network and One-vs-All Multiclass which creates a multiclass classification model from an ensemble of binary classification models. Each of these algorithms have their advantages. The Decision Forest consists of an ensemble of randomly trained decision trees. The ensemble models in general provide better coverage and accuracy than single decision trees. Building multiple random decision trees and training them independently improves generalization and resilience to noisy data. Decision Jungles are a recent extension to decision forests. They require less memory and have considerably improved generalization. Given sufficient number of hidden layers and nodes, neural networks can approximate any function. However, they can be computationally expensive due to a number of hyperparameters. Multiclass Logistic Regression is an extension of Logistic Regression and predicts the probability of an outcome. The best practice for finding which algorithm will perform best is to try them!
The original data had several labels with some of the labels having very few instances. Using Execute R Script module as shown below, I relabel the data as Low, Med and High reducing it to a multi-class classification problem with three classes.
In the section below I will show some results from four Multiclass Classification modules available in Azure ML Studio.
Decision Forest has the highest accuracy of all the algorithms for this data while Neural Network performs the worst. Using the permutation feature importance module in the experiment we also see that the attributes 'alcohol', 'sulphates' and 'volatile acidity' have the highest predictive power and contribute the most to wine quality.
In the sections below, I will provide a hands on tutorial to build classification models using Microsoft R Server. I will continue to use the wine data set as above. For help with MRS functions please refer to the MRS documentation. (MRS can be downloaded for academic use here.)
I will start by downloading the data from the UCI repository. After relabeling the data into Low Med High, I will convert it to an object of class "RxXdfData" using RevoScaleR function rxDataStep(). The RevoScaleR package comes installed with MRS and contains all of the functions for data handling and preparation.
The quality ratings for the wine are in the column 'quality'. I will make a new column 'factorQuality' from the column quality using rxFactors() function forcing the label column to be categorical. Using rxDataStep() with varsToKeep I will then remove the column quality. I will also rename the label column 'factorQuality' as 'LabelsCol' to make the code generic.
Next I will create a col called 'splitcol' to use for splitting the data. Using rxSplit() function we can split the data for training and testing.
Before applying the classification algorithm, I will create a formula for modelling by collecting the names of all the features on the right of ~ and the label on the left.
Next I will apply three different classification algorithms available in MRS and train the models using training data.
Lastly I will compute the accuracy of each model and check how they perform on the test data.
The Decision Tree had the best accuracy for the trained model. However on test data, Boosted Tree had a better accuracy. The entire code can be downloaded from here. The corresponding Jupyter notebook can be found here.