There was a time where statistician had to manually crunch number when they wanted to fit their data to a model. Since this process was so long, those statisticians usually did a lot of preliminary work researching other model who worked in the past or looking for studies in other scientific field like psychology or sociology who can influence their model with the goal to maximize their chance to make a relevant model. Then they would create a model and an alternative model and choose the one which seem more efficient.
Now that even an average computer give us incredible computing power, it’s easy to make multiple models and choose the one that best fit the data. Even though it is better to have good prior knowledge of the process you are trying to analyze and of other model used in the past, coming to a conclusion using mostly only the data help you avoid bias and help you create better models.
In this set of exercises, we’ll see how to apply the most used error metrics on your models with the intention to rate them and choose the one that is the more appropriate for the situation.
Most of those errors metrics are not part of any R package, in consequence you have to apply the equation I give you on your data. Personally, I preferred to write a function which I can easily
use on everyone of my models, but there’s many ways to code those equations. If your code is different from the one in the solution, feel free to post your code in the comments.
Answers to the exercises are available here.
We start by looking at error metrics we can use for regression model. For linear regression problem, the more used metrics are the coefficient of determination R2 which show what percentage of variance is explained by the model and the adjusted R2 which penalize model who use variables that doesn’t have an effective contribution to the model (see this page for more details). Load the
attitude data set from the package of the same name and make three linear models with the goal to explain the
rating variable. The first one use all the variables from the dataset, the second one use the variable
advance as independent variables and the third one use only the
advance variables. Then use the
summary() function to print R2 and the adjusted R2.
Another way to measure how your model fit your data is to use the Root Mean Squared Error (RMSE), which is defined as the square root of the average of the square of all the error made by your model. You can find the mathematical definition of the RMSE on this page.
Calculate the RMSE of the prediction made by your three models.
The mean absolute error (MAE) is a good alternative to the RMSE if you don’t want to penalise too much the large estimation error of your model. The MAE is given by the equation
The mathematical definition of the MAE can be found here.
Calculate the MAE of the prediction made by the 3 models.
Sometime some prediction error hurt your model than other. For example, if you are trying to predict the financial loss of a business over a period of time, under estimation of the loss would
put the business at risk of bankruptcy, while overestimation of the loss will result in a conservative model. In those case, using the Root Mean Squared Logarithmic Error (RMSLE) as an error
metric is useful since this metric penalize under estimation. The RMSLE is given by the equation given on this page.
Calculate the RMSLE of the prediction made by the three models.
Now that we’ve seen some examples of error metrics which can be used in a regression context, let’s see five examples of error metrics which can be used when you perform clustering analysis. But
first, we must create a clustering model to test those metrics on. Load the iris dataset and apply the kmeans algorithm. Since the iris dataset has three distinct
labels, use the kmeans algorithm with three centers. Also, use set the maximum number of iterations at 50 and use the “Lloyd” algorithm. Once it’s done, take time to rearrange the labels of your
prediction so they are compatible with the factors in the iris dataset.
- Avoid model over-fitting using cross-validation for optimal parameter selection
- Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-linear kernels.
- And much more
Print the confusion matrix of your model.
The easiest way to measure how well your model did categorize the data, is to calculate the accuracy, the recall and the precision of your results. Write three functions which return those individual values and calculate those metrics for your models.
The F-measure summarize the precision and recall value of your model by calculating the harmonic mean of those two values.
Write a function which return the F-measure of your model and compute twice this measure for your data: once with a parameter of 2 and then with a parameter of 0.5.
The Purity is a measure of the homogeneity of your cluster: if all your cluster regroup object of the same class, you’ll get a purity score of one and if there’s no majority class in any of the cluster, you’ll get a purity score of 0. Write a function which return the purity score of your model and test it on your predictions.
The last error metrics we’ll see today is the Dunn index, which indicate if the clusters are compact and separated. You can find the mathematical definition of the Dunn index here. Load the
cvalid package and use the
dunn() on your model, to compute the Dunn index of your classification. Note that this function take an integer vector representing the cluster partitioning as parameter.