(This article was first published on ** Win-Vector Blog » R**, and kindly contributed to R-bloggers)

We have often been asked “why is there no Kindle edition of Practical Data Science with R on Amazon.com?” The short answer is: there is an edition you can read on your Kindle: but it is from the publisher Manning (not Amazon.com).

The long answer is: when Amazon.com supplies a Kindle edition readers have to deal with the following:

- Amazon.com digital right management locking the material to a single format and Amazon.com devices/readers.
- Careless mechanical re-formatting of the book material yielding either poor rendering or re-packaging of PDFs that you can only zoom and scan across (and not true re-flow of text).

Some readers don’t like this and (rightly) complain. Some of the best books in our field have the occasional 1-star review due to a throughly frustrated Amazon Kindle customer. As an author you wish reviews were faceted with completely separate and mandatory sub-scores for vendor experience, price, delivery, print-quality, ebook-rendering, relevance to particular reader, and finally book quality (instead of a single rating perceived as “book quality”). But from a buyer’s point of view: rating an item low that has given you a bad experience is completely legitimate (be it for print quality, or the utility of the eBook rendering).

Practical Data Science R does have an e-copy. For our book when we say e-copy we mean:

- An electronic copy available without any intrusive digital right management (beyond requiring registration for initial download and a watermark). These are maximally useful copies as you can search them, print them, and place them on arbitrary devices.
- Unlimited downloads and re-downloads of your copies.
- e-copy available in three formats: PDF, ePub, and Kindle. And you can download all three.
- e-copies are produced and inspected by the actual book editors during the production of the book (not a later mechanical transcription).

We offer readers more than one way to get an good e-copy. Though not all customers are aware of all the options.

- Each new standard copy (though
*not*the international discount reprint) offers an access code that gives single-user rights to an e-copy. This is true for any new standard edition (be it sold by Manning, Amazon, or any other bookseller). Note: used copies may have already consumed codes and discount international editions do not include codes (so if somebody is re-selling you a book you will want to check if it includes an unused code). - Manning itself sells e-copies where for a single discounted price you again get access to non-DRM “e-copy” editions (again giving you all of PDF, ePub, and Kindle). We know some readers do not want a physical book, and expect a discounted e-only option.
- Manning books are often available through Safari online, so you or your enterprise may already have some (restricted online) access through Safari.

In conclusion.

Manning reserved the right to be the only seller of e-only editions of Practical Data Science with R. For a full legitimate e-only copy you must go through them. Manning includes a free e-copy code in all new standard editions of the book. Wherever you buy a legitimate new copy of the standard edition you get the same e-rights as bonus. Used copies and discount international editions have their roles, but may not have a e-copy included (someone may have consumed the right on a used copy, and the discount international edition doesn’t include a code).

Obviously the customers and readers get to decide what is of value to them. This describes the options we were able to supply.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Win-Vector Blog » R**.

R-bloggers.com offers

We start with installing the latest version of Jeff Gentry's twitteR package from GitHub, which makes the OAuth authentication handshake procedure very comfortable:

```
devtools::install_github("geoffjentry/twitteR")
library(twitteR)
```

Next, we have to authenticate our Twitter app. Note that this requires that we already registered our app (for free) at https://dev.twitter.com/ and stored the credentials (API key and secret as well as access token and secret) locally. An excellent way of storing your credentials locally as R environment variables has been described by @JennyBryan at http://stat545-ubc.github.io/bit003_api-key-env-var.html, which we follow here. To do so, we first write the tokens in an R object as name-value pairs (it goes without saying that the keys presented here are fictional):

```
credentials <- c(
"twitter_api_key=rN3Td2zZADLWZBN9Pj7X2eBN",
"twitter_api_secret=abcqBpUzE7BQ65QJ6BRzpUzjyaRCfwn3ndrUUcqDWfhCN7Fj",
"twitter_access_token=9287465372-6ckQsXGP83eaXCsQHFQFx5pUNhmYYqknnCwWScVk8n7L",
"twitter_access_token_secret=ZHUxEW5fefntdyWBBB95fuXY5umZzWXdtPKtjUEP9GDcJs6w"
)
```

In order to write the keys to a local .Renviron file in the default working directory, we write:

```
fname <- paste0(normalizePath("~/"),".Renviron")
writeLines(credentials, fname)
```

We have to do this only once to retrieve the keys in later R sessions. The benefit of this approach is that we do not have to store the keys in actual R code which we plan to publish. To see if this worked, we can retrieve the file again and inspect its content:

```
browseURL(fname)
```

After reloading R, we can use `Sys.getenv()`

to retrieve the keys again and feed them to twitteR's `setup_twitter_oauth()`

function, which takes care of the entire handshake procedure from start to finish:

```
api_key <- Sys.getenv("twitter_api_key")
api_secret <- Sys.getenv("twitter_api_secret")
access_token <- Sys.getenv("twitter_access_token")
access_token_secret <- Sys.getenv("twitter_access_token_secret")
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)
```

Now that we have access to @RDataCollection's Twitter account, we collect information about the account and the number of followers:

```
user <- getUser("RDataCollection")
user$getFollowersCount()
```

```
## 122
```

At the moment, the account has 122 followers, which gives each of them a chance of approximately 2.5 percent to win one of the books. In order to draw from the list of followers, we extract their screennames using twitteR's internal methods:

```
user_followers <- user$getFollowers()
followers_n <- length(user_followers)
followers_screennames <- vector()
for (i in 1:followers_n) {
followers_screennames[i] <- user_followers[[i]]$screenName
}
```

To be fair, we exclude the author's accounts from the tombola:

```
authors <- c("simonsaysnothin", "christianrubba", "marvin_dpr", "jonas_nijhuis", "phdwhipbot")
followers <- setdiff(followers_screennames, authors)
```

Finally we are ready to draw the three winners!

```
sample(followers, 3)
```

Well, don't worry – we won't perform the actual draw before tomorrow evening. Best of luck in winning one of the books, and a peaceful 4th Advent Sunday everyone!

]]>
(This article was first published on ** Automated Data Collection with R Blog - rstats**, and kindly contributed to R-bloggers)

Two weeks ago, we announced to raffle off three hardcover versions of our ADCR book among all followers of our Twitter account @RDataCollection. Tomorrow is closing day, so it is high time to present the drawing procedure, which, as a matter of course, is conducted with R.

We start with installing the latest version of Jeff Gentry's *twitteR* package from GitHub, which makes the OAuth authentication handshake procedure very comfortable:

```
devtools::install_github("geoffjentry/twitteR")
library(twitteR)
```

Next, we have to authenticate our Twitter app. Note that this requires that we already registered our app (for free) at https://dev.twitter.com/ and stored the credentials (API key and secret as well as access token and secret) locally. An excellent way of storing your credentials locally as R environment variables has been described by @JennyBryan at http://stat545-ubc.github.io/bit003_api-key-env-var.html, which we follow here. To do so, we first write the tokens in an R object as name-value pairs (it goes without saying that the keys presented here are fictional):

```
credentials <- c(
"twitter_api_key=rN3Td2zZADLWZBN9Pj7X2eBN",
"twitter_api_secret=abcqBpUzE7BQ65QJ6BRzpUzjyaRCfwn3ndrUUcqDWfhCN7Fj",
"twitter_access_token=9287465372-6ckQsXGP83eaXCsQHFQFx5pUNhmYYqknnCwWScVk8n7L",
"twitter_access_token_secret=ZHUxEW5fefntdyWBBB95fuXY5umZzWXdtPKtjUEP9GDcJs6w"
)
```

In order to write the keys to a local *.Renviron* file in the default working directory, we write:

```
fname <- paste0(normalizePath("~/"),".Renviron")
writeLines(credentials, fname)
```

We have to do this only once to retrieve the keys in later R sessions. The benefit of this approach is that we do not have to store the keys in actual R code which we plan to publish. To see if this worked, we can retrieve the file again and inspect its content:

```
browseURL(fname)
```

After reloading R, we can use `Sys.getenv()`

to retrieve the keys again and feed them to *twitteR*'s `setup_twitter_oauth()`

function, which takes care of the entire handshake procedure from start to finish:

```
api_key <- Sys.getenv("twitter_api_key")
api_secret <- Sys.getenv("twitter_api_secret")
access_token <- Sys.getenv("twitter_access_token")
access_token_secret <- Sys.getenv("twitter_access_token_secret")
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)
```

Now that we have access to @RDataCollection's Twitter account, we collect information about the account and the number of followers:

```
user <- getUser("RDataCollection")
user$getFollowersCount()
```

```
## 122
```

At the moment, the account has 122 followers, which gives each of them a chance of approximately 2.5 percent to win one of the books. In order to draw from the list of followers, we extract their screennames using *twitteR*'s internal methods:

```
user_followers <- user$getFollowers()
followers_n <- length(user_followers)
followers_screennames <- vector()
for (i in 1:followers_n) {
followers_screennames[i] <- user_followers[[i]]$screenName
}
```

To be fair, we exclude the author's accounts from the tombola:

```
authors <- c("simonsaysnothin", "christianrubba", "marvin_dpr", "jonas_nijhuis", "phdwhipbot")
followers <- setdiff(followers_screennames, authors)
```

Finally we are ready to draw the three winners!

```
sample(followers, 3)
```

Well, don't worry – we won't perform the actual draw before tomorrow evening. Best of luck in winning one of the books, and a peaceful 4th Advent Sunday everyone!

To **leave a comment** for the author, please follow the link and comment on his blog: ** Automated Data Collection with R Blog - rstats**.

R-bloggers.com offers

(This article was first published on ** Wiekvoet**, and kindly contributed to R-bloggers)

Based on The DO loop, since I wanted a fractal Christmas tree and there is no point in inventing what has been made already. Besides, this is not the first time this year that I used R to do what has been done in SAS.# Each row is a 2x2 linear transformation

# Christmas tree

L <- matrix(

c(0.03, 0, 0 , 0.1,

0.85, 0.00, 0.00, 0.85,

0.8, 0.00, 0.00, 0.8,

0.2, -0.08, 0.15, 0.22,

-0.2, 0.08, 0.15, 0.22,

0.25, -0.1, 0.12, 0.25,

-0.2, 0.1, 0.12, 0.2),

nrow=4)

# ... and each row is a translation vector

B <- matrix(

c(0, 0,

0, 1.5,

0, 1.5,

0, 0.85,

0, 0.85,

0, 0.3,

0, 0.4),

nrow=2)

prob = c(0.02, 0.6,.08, 0.07, 0.07, 0.07, 0.07)

# Iterate the discrete stochastic map

N = 1e5 #5 # number of iterations

x = matrix(NA,nrow=2,ncol=N)

x[,1] = c(0,2) # initial point

k <- sample(1:7,N,prob,replace=TRUE) # values 1-7

for (i in 2:N)

x[,i] = crossprod(matrix(L[,k[i]],nrow=2),x[,i-1]) + B[,k[i]] # iterate

# Plot the iteration history

png('card.png')

par(bg='darkblue',mar=rep(0,4))

plot(x=x[1,],y=x[2,],

col=grep('green',colors(),value=TRUE),

axes=FALSE,

cex=.1,

xlab='',

ylab='' )#,pch='.')

bals <- sample(N,20)

points(x=x[1,bals],y=x[2,bals]-.1,

col=c('red','blue','yellow','orange'),

cex=2,

pch=19

)

text(x=-.7,y=8,

labels='Merry',

adj=c(.5,.5),

srt=45,

vfont=c('script','plain'),

cex=3,

col='gold'

)

text(x=0.7,y=8,

labels='Christmas',

adj=c(.5,.5),

srt=-45,

vfont=c('script','plain'),

cex=3,

col='gold'

)

To **leave a comment** for the author, please follow the link and comment on his blog: ** Wiekvoet**.

R-bloggers.com offers

(This article was first published on ** YGC » R**, and kindly contributed to R-bloggers)

When I need to annotate nucleotide substitutions in the phylogenetic tree, I found that all the software are designed to display the tree but not annotating it. Some of them may support annotating the tree with specific data such as bootstrap values, but they are restricted to a few supported data types. It is hard/impossible to inject user specific data.
Read More: 815 Words Totally
To **leave a comment** for the author, please follow the link and comment on his blog: ** YGC » R**.

R-bloggers.com offers

(This article was first published on ** R is my friend » R**, and kindly contributed to R-bloggers)

After successfully navigating the perilous path of CRAN submission, I’m pleased to announce that NeuralNetTools is now available! From the description file, the package provides visualization and analysis tools to aid in the interpretation of neural networks, including functions for plotting, variable importance, and sensitivity analyses. I’ve written at length about each of these functions (see here, here, and here), so I’ll only provide an overview in this post. Most of these functions have remained unchanged since I initially described them, with one important change for the Garson function. Rather than reporting variable importance as -1 to 1 for each variable, I’ve returned to the original method that reports importance as 0 to 1. I was getting inconsistent results after toying around with some additional examples and decided the original method was a safer approach for the package. The modified version can still be installed from my GitHub gist. The development version of the package is also available on GitHub. Please use the development page to report issues.

The package is fairly small but I think the functions that have been included can help immensely in evaluating neural network results. The main functions include:

`plotnet`

: Plot a neural interpretation diagram for a neural network object, original blog post here# install, load package install.packages(NeuralNetTools) library(NeuralNetTools) # create model AND <- c(rep(0, 7), 1) OR <- c(0, rep(1, 7)) binary_data <- data.frame(expand.grid(c(0, 1), c(0, 1), c(0, 1)), AND, OR) mod <- neuralnet(AND + OR ~ Var1 + Var2 + Var3, binary_data, hidden = c(6, 12, 8), rep = 10, err.fct = 'ce', linear.output = FALSE) # plotnet par(mar = numeric(4), family = 'serif') plotnet(mod, alpha = 0.6)

`garson`

: Relative importance of input variables in neural networks using Garson’s algorithm, original blog post here# create model library(RSNNS) data(neuraldat) x <- neuraldat[, c('X1', 'X2', 'X3')] y <- neuraldat[, 'Y1'] mod <- mlp(x, y, size = 5) # garson garson(mod, 'Y1')

`lekprofile`

: Conduct a sensitivity analysis of model responses in a neural network to input variables using Lek’s profile method, original blog post here# create model library(nnet) data(neuraldat) mod <- nnet(Y1 ~ X1 + X2 + X3, data = neuraldat, size = 5) # lekprofile lekprofile(mod)

A few other functions are available that are helpers to the main functions. See the documentation for a full list.

All the functions have S3 methods for most of the neural network classes available in R, making them quite flexible. This includes methods for `nnet`

models from the nnet package, `mlp`

models from the RSNNS package, `nn`

models from the neuralnet package, and `train`

models from the caret package. The functions also have methods for numeric vectors if the user prefers inputting raw weight vectors for each function, as for neural network models created outside of R.

Huge thanks to Hadley Wickham for his packages that have helped immensely with this process, namely devtools and roxygen2. I also relied extensively on his new web book for package development. Any feedback regarding NeuralNetTools or its further development is appreciated!

Cheers,

Marcus

To **leave a comment** for the author, please follow the link and comment on his blog: ** R is my friend » R**.

R-bloggers.com offers

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

Release 0.6.7 of digest package is now on CRAN and in Debian.

Jim Hester was at it again and added murmurHash. I cleaned up several sets of pedantic warnings in some of the source files, updated the test reference out, and that forms version 0.6.7.

CRANberries provides the usual summary of changes to the previous version.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Thinking inside the box **.

R-bloggers.com offers

(This article was first published on ** Maximize Productivity with Industrial Engineer and Operations Research Tools**, and kindly contributed to R-bloggers)

Statistical Analysis and Data Mining are considered the hottest skills on LinkedIn for 2014. According to their report from analyzing jobs and recruiters on the LinkedIn website. I would say its safe to say that it will continue to be hot for 2015.If you are looking to get your skills honed up I would suggest looking at ComputerWorld's Beginners Guide to R. It looks like a complete tutorial and is indexed rather well.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Maximize Productivity with Industrial Engineer and Operations Research Tools**.

R-bloggers.com offers

(This article was first published on ** A HopStat and Jump Away » Rbloggers**, and kindly contributed to R-bloggers)

I've been doing some classification with logistic regression in brain imaging recently. I have been using the ROCR package, which is helpful at estimating performance measures and plotting these measures over a range of cutoffs.

The `prediction`

and `performance`

functions are the workhorses of most of the analyses in ROCR I've been doing. For those who haven't used `ROCR`

before, the format of the `prediction`

function is:

prediction(predictions, labels, label.ordering = NULL)

where `predictions`

are some predicted measure (usually continuous) for the “truth”, which are the `labels`

. In many applications, `predictions`

are estimated probabilities (or log odds) and the `labels`

are binary values. Both arguments can take a vector, matrix, or data.frame for prediction, but `dim(predictions)`

must equal `dim(labels)`

.

In this post, I'll go through creating `prediction`

and `performance`

objects and extracting the results.

Let's show a simple example from the `prediction`

help file, that uses a prediction and label vector (i.e. not a matrix). We see the data is some continuous prediction and binary label:

library(ROCR) data(ROCR.simple) head(cbind(ROCR.simple$predictions, ROCR.simple$labels), 5)

[,1] [,2] [1,] 0.6125478 1 [2,] 0.3642710 1 [3,] 0.4321361 0 [4,] 0.1402911 0 [5,] 0.3848959 0

Now, let's makde the prediction object and show its contents:

pred <- prediction(ROCR.simple$predictions,ROCR.simple$labels) class(pred)

[1] "prediction" attr(,"package") [1] "ROCR"

slotNames(pred)

[1] "predictions" "labels" "cutoffs" "fp" "tp" [6] "tn" "fn" "n.pos" "n.neg" "n.pos.pred" [11] "n.neg.pred"

We see the the returned result of `prediction`

is an object of class `prediction`

, which an S4 object with a series of slots. Let's look at the length of each slot and the class:

sn = slotNames(pred) sapply(sn, function(x) length(slot(pred, x)))

predictions labels cutoffs fp tp tn 1 1 1 1 1 1 fn n.pos n.neg n.pos.pred n.neg.pred 1 1 1 1 1

sapply(sn, function(x) class(slot(pred, x)))

predictions labels cutoffs fp tp tn "list" "list" "list" "list" "list" "list" fn n.pos n.neg n.pos.pred n.neg.pred "list" "list" "list" "list" "list"

We see that each slot has length 1 and is a list.

Let's use the `ROCR.hiv`

dataset to show how this works if more than one set of predictions and labels are supplied. Here we pass a list of (10) predictions and a list of labels to the `prediction`

function:

data(ROCR.hiv) manypred = prediction(ROCR.hiv$hiv.nn$predictions, ROCR.hiv$hiv.nn$labels) sapply(sn, function(x) length(slot(manypred, x)))

predictions labels cutoffs fp tp tn 10 10 10 10 10 10 fn n.pos n.neg n.pos.pred n.neg.pred 10 10 10 10 10

sapply(sn, function(x) class(slot(manypred, x)))

predictions labels cutoffs fp tp tn "list" "list" "list" "list" "list" "list" fn n.pos n.neg n.pos.pred n.neg.pred "list" "list" "list" "list" "list"

We see that all the slots are still lists, but now they have length (10), corresponding to the (10) predictions/labels. We would get the same result if the 2 arguments were matrices, but that would require all predictions and labels to have the same length. Using a list of predictions/labels is a bit more flexible.

From the help file of `performance`

, the syntax for this function is:

performance(prediction.obj, measure, x.measure="cutoff", ...)

We see that the first argument is a `prediction`

object, and the second is a `measure`

. If you run `?performance`

, you can see all the performance measures implemented.

We will do example of some commonly estimated measures: receiver operating characteristic (ROC) curves, accuracy, area under the curve (AUC), and partial AUC (pAUC).

We will do an ROC curve, which plots the false positive rate (FPR) on the x-axis and the true positive rate (TPR) on the y-axis:

roc.perf = performance(pred, measure = "tpr", x.measure = "fpr") plot(roc.perf) abline(a=0, b= 1)

At every cutoff, the TPR and FPR are calculated and plotted. The smoother the graph, the more cutoffs the predictions have. We also plotted a 45-degree line, which represents, on average, the performance of a Uniform(0, 1) random variable. The further away from the diagonal line, the better. Overall, we see that we see gains in sensitivity (true positive rate, (> 80%)), trading off a false positive rate (1- specificity), up until about 15% FPR. After an FPR of 15%, we don't see significant gains in TPR for a tradeoff of increased FPR.

The same can be done if you have many predictions and labels:

many.roc.perf = performance(manypred, measure = "tpr", x.measure = "fpr") plot(many.roc.perf, col=1:10) abline(a=0, b= 1)

Essentially, the `plot`

function on a `performance`

object with multiple predictions and labels will loop over the lists and plot the ROC for each one.

Overall, we see the performance of each prediction is similar. The pROC package, described in the conclusion, can test the performance between ROC curves.

** Advanced:** If you want to see how performance objects are plotted, use

`getMethod("plot", signature = c(x="performance",y="missing"))`

and `ROCR:::.plot.performance`

.You may only want to accept a false positive rate of a certain level, let's say 10%. The function `pROC`

below will only keep values less than or equal to the FPR you set.

pROC = function(pred, fpr.stop){ perf <- performance(pred,"tpr","fpr") for (iperf in seq_along(perf@x.values)){ ind = which(perf@x.values[[iperf]] <= fpr.stop) perf@y.values[[iperf]] = perf@y.values[[iperf]][ind] perf@x.values[[iperf]] = perf@x.values[[iperf]][ind] } return(perf) }

Let's use this on the simple cases and plot the partial ROC curve:

proc.perf = pROC(pred, fpr.stop=0.1) plot(proc.perf) abline(a=0, b= 1)

Thus, if we can only accept a FPR of 10%, the model is only giving 50% sensitivity (TPR) at 10% FPR (1-specificity).

In some applications of ROC curves, you want the point closest to the TPR of (1) and FPR of (0). This cut point is “optimal” in the sense it weighs both sensitivity and specificity equally. To deterimine this cutoff, you can use the code below. The code takes in BOTH the `performance`

object and `prediction`

object and gives the optimal cutoff value of your predictions:

opt.cut = function(perf, pred){ cut.ind = mapply(FUN=function(x, y, p){ d = (x - 0)^2 + (y-1)^2 ind = which(d == min(d)) c(sensitivity = y[[ind]], specificity = 1-x[[ind]], cutoff = p[[ind]]) }, perf@x.values, perf@y.values, pred@cutoffs) } print(opt.cut(roc.perf, pred))

[,1] sensitivity 0.8494624 specificity 0.8504673 cutoff 0.5014893

Now, there is a `cost`

measure in the ROCR package that you can use to create a `performance`

object. If you use it to find the minimum cost, then it will give you the same cutoff as `opt.cut`

, but not give you the sensitivity and specificity.

cost.perf = performance(pred, "cost") pred@cutoffs[[1]][which.min(cost.perf@y.values[[1]])]

[1] 0.5014893

The output from `opt.cut`

and a `performance`

object with measure `cost`

are NOT equivalent if false positives and false negatives are not weighted equally. The `cost.fn`

and `cost.fp`

arguments can be passed to `performance`

, corresponding to the cost of a false negative and false positive, respectively. Let's say false positives are twice as costly as false negatives, and let's get a cut point:

cost.perf = performance(pred, "cost", cost.fp = 2, cost.fn = 1) pred@cutoffs[[1]][which.min(cost.perf@y.values[[1]])]

[1] 0.5294022

Thus, we have a different “optimal” cut point with this changed cost function. In many real-life applications of biomarkers, the cost of a false positive and false negative are not the same. For example, missing someone with a disease based on a test may cost a hospital $1,000,000 in lawsuits, but treating someone who did not have the disease may cost $100,000 in treatments. In that case, the cost of a false negative is 10 times that of a false positive, strictly in monetary measures. No cost analysis is this simple and is usually based on many factors, but most analyses do not have equal cost for a false positive versus a false negative.

The code is the same for the optimal cutoff for the multiple prediction data:

print(opt.cut(many.roc.perf, manypred))

[,1] [,2] [,3] [,4] [,5] sensitivity 0.8076923 0.8205128 0.7692308 0.8205128 0.7564103 specificity 0.7902622 0.7827715 0.8501873 0.8164794 0.8464419 cutoff -0.5749773 -0.5640632 -0.4311301 -0.5336958 -0.4863360 [,6] [,7] [,8] [,9] [,10] sensitivity 0.7820513 0.7948718 0.7820513 0.7435897 0.7435897 specificity 0.8089888 0.8314607 0.8089888 0.8352060 0.8501873 cutoff -0.5364402 -0.4816705 -0.5388664 -0.4777073 -0.4714354

Another cost measure that is popular is overall accuracy. This measure optimizes the correct results, but may be skewed if there are many more negatives than positives, or vice versa. Let's get the overall accuracy for the simple predictions and plot it:

acc.perf = performance(pred, measure = "acc") plot(acc.perf)

What if we actually want to extract the maximum accuracy and the cutoff corresponding to that? In the `performance`

object, we have the slot `x.values`

, which corresponds to the `cutoff`

in this case, and `y.values`

, which corresponds to the accuracy of each cutoff. We'll grab the index for maximum accuracy and then grab the corresponding cutoff:

ind = which.max( slot(acc.perf, "y.values")[[1]] ) acc = slot(acc.perf, "y.values")[[1]][ind] cutoff = slot(acc.perf, "x.values")[[1]][ind] print(c(accuracy= acc, cutoff = cutoff))

accuracy cutoff 0.8500000 0.5014893

Hooray! Then you can go forth and threshold your model using the `cutoff`

for (in hopes) maximum accuracy in your test data.

Again, we will do the same with many predictions and labels, but must iterate over the results (using a `mapply`

statement):

many.acc.perf = performance(manypred, measure = "acc") sapply(manypred@labels, function(x) mean(x == 1))

[1] 0.226087 0.226087 0.226087 0.226087 0.226087 0.226087 0.226087 [8] 0.226087 0.226087 0.226087

mapply(function(x, y){ ind = which.max( y ) acc = y[ind] cutoff = x[ind] return(c(accuracy= acc, cutoff = cutoff)) }, slot(many.acc.perf, "x.values"), slot(many.acc.perf, "y.values"))

[,1] [,2] [,3] [,4] [,5] [,6] accuracy 0.86376812 0.881159420 0.8666667 0.8724638 0.8724638 0.8753623 cutoff 0.02461465 -0.006091327 0.2303707 -0.1758013 0.1251976 -0.2153779 [,7] [,8] [,9] [,10] accuracy 0.8753623 0.8724638 0.8637681 0.86376812 cutoff -0.2066697 0.1506282 0.2880392 0.06536471

We see that these cutoffs are not the same as those using the `opt.cut`

from above. This is due to the the fact that the proportion of positive cases is much less than 50%.

The area under curve summarizes the ROC curve just by taking the area between the curve and the x-axis. the Let's get the area under the curve for the simple predictions:

auc.perf = performance(pred, measure = "auc") auc.perf@y.values

[[1]] [1] 0.8341875

As you can see, the result is a scalar number, the area under the curve (AUC). This number ranges from (0) to (1) with (1) indicating 100% specificity and 100% sensitivity.

As before, if you only want to accept a fixed FPR, we can calculate a partial AUC, using the `fpr.stop`

argument:

pauc.perf = performance(pred, measure = "auc", fpr.stop=0.1) pauc.perf@y.values

[[1]] [1] 0.02780625

Now, we see the pAUC to be **much** lower. It is of note that this value can range from (0) to whatever `fpr.stop`

is. In order to standardize it to (1), you can divide it by `fpr.stop`

to give a ([0, 1]) measure:

pauc.perf@y.values = lapply(pauc.perf@y.values, function(x) x / 0.1) pauc.perf@y.values

[[1]] [1] 0.2780625

Although this measure is more comparable to the full AUC measure, it is still low. Note, there is no “one” cutoff for AUC or pAUC, as it measures the performance over all cutoffs. Also, plotting functions for scalar outcome measures (such as AUC) do not work for `performance`

objects. The code for the multiple predictions is the same.

manypauc.perf = performance(manypred, measure = "auc", fpr.stop=0.1) manypauc.perf@y.values = lapply(manypauc.perf@y.values, function(x) x / 0.1) manypauc.perf@y.values

[[1]] [1] 0.500048 [[2]] [1] 0.5692404 [[3]] [1] 0.5182944 [[4]] [1] 0.5622299 [[5]] [1] 0.5379814 [[6]] [1] 0.5408624 [[7]] [1] 0.5509939 [[8]] [1] 0.5334678 [[9]] [1] 0.4979353 [[10]] [1] 0.4870354

Note, use `sapply`

instead of `lapply`

if you want the result to be a vector.

For ROC analysis the ROCR package has good methods and many built in measures. Other packages, such as the pROC package, can be useful for many functions and analyses, especially testing the difference between ROC and pROC curves. In some ways, you may want to use `proc`

admissibly over ROCR, especially because (when I checked Dec 18, 2014) the ROCR package was orphaned. But if you are working in ROCR, I hope this give you some examples of how to fit the objects and extract the results.

To **leave a comment** for the author, please follow the link and comment on his blog: ** A HopStat and Jump Away » Rbloggers**.

R-bloggers.com offers