Model Evaluation 2

December 22, 2016
By

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

We are committed to bringing you 100% authentic exercise sets. We even try to include as different datasets as possible to give you an understanding of different problems. No more classifying Titanic dataset. R has tons of datasets in its library. This is to encourage you to try as many datasets as possible. We will be comparing two models by checking their accuracy, Area under the curve, ROC performance etc.

It will be helpful to go over Tom Fawcett’s research paper on ‘An introduction to ROC analysis’

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Run the following code. If you do not have ROCR package installed, you can use install.packages() command to install it.


library(ROCR)
library(caTools)
library(caret)
data("GermanCredit")
df1=GermanCredit
df1$Class=ifelse(df1$Class=="Bad",1,0)
set.seed(100)
spl=sample.split(df1$Class,SplitRatio = 0.7)
Train1=df1[spl==TRUE,]
Test1=df1[spl==FALSE,]
model1=glm(Class~.,data=Train1,family = binomial)
pred1=predict(model1,Test1)
table(Test1$Class,pred1>0.5)

Exercise 2

Using the confusion matrix, please state what is the accuracy of this model?

Exercise 3

Great. Now let’s see the ROC curve of the model. Use this code below and then use plot() command to plot ROCRperf2


ROCRpred1=prediction(pred1,Test1$Class)
ROCRperf1=performance(ROCRpred1,"tpr","fpr")

The plot above gives us an idea of the performance of the model. Is this a a good or bad model? State reasons

Exercise 4

use the summary function on the model to see the summary. Note that if there are more stars next to a feature, then it is highly corelated with our target variable.

Exercise 5

Although we found out the accuracy of the model in Q2, it is still not the best measure. A better measure is area under the curve. AUC takes account of class distribution in the model and is in the range of 0 to 1. 1 being the best and 0 being the worse. It can also be taken as a probability score. If the AUC is 0.70 then that means there is a 0.7 chance of the model to predict positive.

Insert the code below to obtain AUC. What is the AUC score? Is it better than the accuracy obrained at Q2?

auc= performance(ROCRpred1,measure="auc")
[email protected][[1]]


Exercise 6

Now create another model called model2 and include 11 variables that have atleast a star next to their name.Hint: use the summary() command and intercept does not count.

Exercise 7

Now predict the target variable using the Test1 sample using model2 and store it in pred2.

Exercise 8
Use the table() command to get the confusion matrix. Note the accuracy.

Exercise 9
What is the auc of model2?

Exercise 10
Is model2 better than model 1? If so, then why?

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)