Occupancy model fit & AUC

July 30, 2013
By

(This article was first published on Ecology in silico, and kindly contributed to R-bloggers)

Occupancy models are used to understand species distributions while accounting for imperfect detection. In this post, I’ll demonstrate a method to evaluate the performance of occupancy models based on the area under a receiver operating characteristic curve (AUC), as published last year by Elise Zipkin and colleagues in Ecological Applications.

Suppose we are to fit a multi-year occupancy model for one species. We will evaluate the fit based on how well the model predicts occupancy in the final year of the project. Start by simulating some data (for details on the structure of these simulated data, refer to this post and references therein):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# define convenience functions
logit <- function(x) {
  log(x/(1 - x))
}

antilogit <- function(x) {
  exp(x)/(1 + exp(x))
}

# initialize parameters
nsite <- 100
nyear <- 5
nrep <- 3

# yearly Pr(colonization)
p_beta = 0.5
beta <- logit(p_beta)

# yearly Pr(persistence)
p_rho = 0.6
rho <- logit(p_rho)

# assume there are 2 site level covariates: covariate and treatment
# response to covariate
p_alpha1 <- .6
alpha1 <- logit(p_alpha1)

# response to treatment 
p_alpha2 <- .5
alpha2 <- logit(p_alpha2)

# interaction between covariate & treatment
p_alpha3 <- .2
alpha3 <- logit(p_alpha3)

# Site-specific covariates
covariate <- rnorm(nsite, 0, 1)
treat <- rbinom(nsite, size=1, prob=.5)

# initial occupancy states
z0 <- rbinom(nsite, 1, .5)

# subsequent occupancy
z <- array(dim = c(nsite, nyear))
lpsi <- array(dim = c(nsite, nyear))
psi <- array(dim = c(nsite, nyear))
for (j in 1:nsite) {
  for (t in 1:nyear) {
    if (t == 1) {
      # logit psi
      lpsi[j, t] <- beta + rho * z0[j] +
          covariate[j] * alpha1 + treat[j] * alpha2 +
          alpha3 * covariate[j] * treat[j]
      psi[j, t] <- antilogit(lpsi[j, t])
      z[j, t] <- rbinom(1, 1, psi[j, t]) # true occupancy state
    } else {
      lpsi[j, t] <- beta + rho * z[j, t - 1] +
        covariate[j] * alpha1 + treat[j] * alpha2 +
        alpha3 * covariate[j] * treat[j]
      psi[j, t] <- antilogit(lpsi[j, t])
      z[j, t] <- rbinom(1, 1, psi[j, t])
    }
  }
}

# detection probability
p <- 0.6

# observations
x <- array(dim = c(nsite, nyear, nrep))
for (j in 1:nsite) {
  for (t in 1:nyear) {
    for (k in 1:nrep) {
      x[j, t, k] <- rbinom(1, 1, p * z[j, t])
    }
  }
}

# Subset the data to exclude the final year, which we'll use for model validation
xsub <- x
xsub[, nyear, ] <- NA

For illustration, I included a strong interaction between treatment and the continuous site level covariate (could be elevation, area, etc). As such, a measure of model fit such as AUC ought to identify a saturated model as the best fitting. Handily, AUC is a derived parameter, and common occupancy model parameters can be used to estimate a posterior. To generate a posterior AUC, we need predicted occupancy probabilities ($\psi$) and realized occupancy states ($Z$) in the final year. Predicted occupancy probabilities can be produced using data from previous years, and realized occupancy states are assumed to be represented by the posterior for $Z$ generated from a single-year model, fit to the data from the final year of the study.

Fitting a saturated model

Begin by modeling occupancy probabilities as a function of both covariates and their interaction, predicting $\psi$ in the final year:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# model specification
cat("
    model{
    #### priors
    p_beta ~ dbeta(1, 1)
    beta <- log(p_beta / (1 - p_beta))
    p_rho ~ dbeta(1, 1)
    rho <- log(p_rho / (1 - p_rho))
    p_rho0 ~ dbeta(1, 1)
    rho0 <- log(p_rho0 / (1 - p_rho0))
    p ~ dbeta(1, 1)
    pa1 ~ dbeta(1, 1)
    pa2 ~ dbeta(1, 1)
    pa3~ dbeta(1, 1)
    a1 <- log(pa1 / (1 - pa1))
    a2 <- log(pa2 / (1 - pa2))
    a3 <- log(pa3 / (1 - pa3))
    
    #### occupancy model
    for (j in 1:nsite) {
      z0[j] ~ dbern(rho0)
      logit(psi[j, 1]) <- beta + rho * z0[j] + 
        a1 * covariate[j] + a2 * treat[j] + 
        a3 * covariate[j] * treat[j]
      z[j, 1] ~ dbern(psi[j, 1])
      for (t in 2:nyear) {
        logit(psi[j, t]) <- beta + rho * z[j, t-1] + 
          a1 * covariate[j] + a2 * treat[j] + 
          a3 * covariate[j] * treat[j]
        z[j, t] ~ dbern(psi[j, t])
      }
    }
    
    #### observation model
    for (j in 1:nsite){
      for (t in 1:nyear){
        mu[j, t] <- z[j, t] * p
        for (k in 1:nrep){
          x[j, t, k] ~ dbern(mu[j, t])
        }
      }
    }
    }
    ", fill=TRUE, file="saturated.txt")

# bundle data
data <- list(x = xsub, nrep = nrep, nsite = nsite, nyear = nyear, covariate=covariate, treat=treat)

# initial values
zinit <- array(dim = c(nsite, nyear))
for (j in 1:nsite) {
  for (t in 1:nyear) {
    zinit[j, t] <- max(xsub[j, t, ])
  }
}

inits <- function() {
  list(z = zinit)
}

# parameters to monitor
params <- c("p", "beta", "rho", "psi", "pa1", "pa2", "pa3")

# use JAGS
require(rjags)

# build model
nchain <- 3
ocmod <- jags.model(file = "saturated.txt", inits = inits,
                    data = data, n.chains = nchain)

# specify MCMC settings and start sampling
nburn <- 2000
niter <- 2000
nthin <- 2
update(ocmod, n.iter = nburn)
out <- coda.samples(ocmod, n.iter = niter,
                    thin=nthin, variable.names = params)

Now that we have our posteriors for $\psi$ at each site in the final year, we can fit a single-year model to the final year’s data to estimate $Z$.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
xlast <- x[,nyear,]

# model specification
cat("
    model{
    #### priors
    p_beta ~ dbeta(1, 1)
    beta <- log(p_beta / (1 - p_beta))
    p ~ dbeta(1, 1)
    pa1 ~ dbeta(1, 1)
    pa2 ~ dbeta(1, 1)
    pa3~ dbeta(1, 1)
    a1 <- log(pa1 / (1 - pa1))
    a2 <- log(pa2 / (1 - pa2))
    a3 <- log(pa3 / (1 - pa3))
    
    #### occupancy model
    for (j in 1:nsite) {
      logit(psi[j]) <- beta + 
        a1 * covariate[j] + a2 * treat[j] + 
        a3 * covariate[j] * treat[j]
      z[j] ~ dbern(psi[j])
    }
    
    #### observation model
    for (j in 1:nsite){
      mu[j] <- z[j] * p
      for (k in 1:nrep){
        x[j, k] ~ dbern(mu[j])
      }
    }
    }
    ", fill=TRUE, file="saturated2.txt")

# bundle data
data2 <- list(x = xlast, nrep = nrep,
              nsite = nsite, covariate=covariate,
              treat=treat)

# initial values
zinit <- array(dim = c(nsite))
for (j in 1:nsite) {
    zinit[j] <- max(xlast[j, ])
}

inits <- function() {
  list(z = zinit)
}

# parameters to monitor
params2 <- c("p", "beta", "pa1", "pa2", "pa3", "z")

# build model
ocmod2 <- jags.model(file = "saturated2.txt", inits = inits,
                    data = data2, n.chains = nchain)

# start sampling
update(ocmod2, n.iter = nburn)
out2 <- coda.samples(ocmod2, n.iter = niter,
                    thin=nthin, variable.names = params2)

To set up the data for AUC calculations, produce site by iteration arrays for $\psi$ and $Z$:

1
2
3
4
5
6
7
8
9
10
11
12
require(ggmcmc)
ggd <- ggs(out)
ggd2 <- ggs(out2)
nstore <- niter / nthin
psi.pred <- array(dim=c(nsite, nstore*nchain))
Z.est <- array(dim=c(nsite, nstore*nchain))
for (j in 1:nsite){
  indx <- which(ggd$Parameter == paste("psi[", j, ",", nyear, "]", sep=""))
  psi.pred[j, ] <- ggd$value[indx]
  indx <- which(ggd2$Parameter == paste("z[", j, "]", sep=""))
  Z.est[j,] <- ggd2$value[indx]
}

Now generate the posterior for AUC and store data on the true and false positive rates to produce ROC curves.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
require(ROCR)
AUC1 <- rep(NA, nstore * nchain)
fpr <- array(dim=c(nstore * nchain, nsite + 1))
tpr <- array(dim=c(nstore * nchain, nsite + 1))
for (i in 1:(nstore * nchain)){
  psi.vals <- psi.pred[, i]
  Z.vals <- Z.est[, i]
  pred <- prediction(psi.vals, factor(Z.vals, levels=c("0", "1")))
  perf <- performance(pred, "auc")
  AUC1[i] <- perf@y.values[[1]]
  perf <- performance(pred, "tpr","fpr")
  fpr[i, ] <- perf@x.values[[1]]
  tpr[i, ] <- perf@y.values[[1]]
}
require(reshape2)
fprm <- melt(fpr, varnames=c("iter", "site"))
tprm <- melt(tpr, varnames = c("iter", "site"))
ROC1 <- data.frame(fpr = fprm$value,
                   tpr = tprm$value,
                   iter = rep(fprm$iter, 2))

Fitting a simpler model

Having fitted a saturated model, we can now fit a simpler model that includes only main effects:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
cat("
    model{
    #### priors
    p_beta ~ dbeta(1, 1)
    beta <- log(p_beta / (1 - p_beta))
    p_rho ~ dbeta(1, 1)
    rho <- log(p_rho / (1 - p_rho))
    p_rho0 ~ dbeta(1, 1)
    rho0 <- log(p_rho0 / (1 - p_rho0))
    p ~ dbeta(1, 1)
    pa1 ~ dbeta(1, 1)
    pa2 ~ dbeta(1, 1)
    a1 <- log(pa1 / (1 - pa1))
    a2 <- log(pa2 / (1 - pa2))
    
    #### occupancy model
    for (j in 1:nsite) {
      z0[j] ~ dbern(rho0)
      logit(psi[j, 1]) <- beta + rho * z0[j] + 
        a1 * covariate[j] + a2 * treat[j] 
      z[j, 1] ~ dbern(psi[j, 1])
      for (t in 2:nyear) {
        logit(psi[j, t]) <- beta + rho * z[j, t-1] + 
          a1 * covariate[j] + a2 * treat[j] 
        z[j, t] ~ dbern(psi[j, t])
      }
    }
    
    #### observation model
    for (j in 1:nsite){
      for (t in 1:nyear){
        mu[j, t] <- z[j, t] * p
        for (k in 1:nrep){
          x[j, t, k] ~ dbern(mu[j, t])
        }
      }
    }
    }
    ", fill=TRUE, file="mainef.txt")


# bundle data
data <- list(x = xsub, nrep = nrep, nsite = nsite, nyear = nyear, covariate=covariate, treat=treat)

# initial values
zinit <- array(dim = c(nsite, nyear))
for (j in 1:nsite) {
  for (t in 1:nyear) {
    zinit[j, t] <- max(xsub[j, t, ])
  }
}

inits <- function() {
  list(z = zinit)
}

# parameters to monitor
params <- c("p", "beta", "rho", "psi", "pa1", "pa2")

# build model
ocmod <- jags.model(file = "mainef.txt", inits = inits,
                    data = data, n.chains = nchain)

# specify MCMC settings and start sampling
update(ocmod, n.iter = nburn)
out <- coda.samples(ocmod, n.iter = niter,
                    thin=nthin, variable.names = params)

# Fit a single-year model for the final year to estimate Z ##
# model specification
cat("
    model{
    #### priors
    p_beta ~ dbeta(1, 1)
    beta <- log(p_beta / (1 - p_beta))
    p ~ dbeta(1, 1)
    pa1 ~ dbeta(1, 1)
    pa2 ~ dbeta(1, 1)
    a1 <- log(pa1 / (1 - pa1))
    a2 <- log(pa2 / (1 - pa2))
    
    #### occupancy model
    for (j in 1:nsite) {
      logit(psi[j]) <- beta + 
        a1 * covariate[j] + a2 * treat[j]
      z[j] ~ dbern(psi[j])
    }
    
    #### observation model
    for (j in 1:nsite){
      mu[j] <- z[j] * p
      for (k in 1:nrep){
        x[j, k] ~ dbern(mu[j])
      }
    }
    }
    ", fill=TRUE, file="mainef2.txt")

# bundle data
data2 <- list(x = xlast, nrep = nrep,
              nsite = nsite, covariate=covariate,
              treat=treat)

# initial values
zinit <- array(dim = c(nsite))
for (j in 1:nsite) {
  zinit[j] <- max(xlast[j, ])
}

inits <- function() {
  list(z = zinit)
}

# parameters to monitor
params2 <- c("p", "beta", "pa1", "pa2", "z")

# build model
ocmod2 <- jags.model(file = "mainef2.txt", inits = inits,
                     data = data2, n.chains = nchain)

# start sampling
update(ocmod2, n.iter = nburn)
out2 <- coda.samples(ocmod2, n.iter = niter,
                     thin=nthin, variable.names = params2)


# Generating AUC posterior...
ggd <- ggs(out)
ggd2 <- ggs(out2)
nstore <- niter / nthin
psi.pred <- array(dim=c(nsite, nstore*nchain))
Z.est <- array(dim=c(nsite, nstore*nchain))
for (j in 1:nsite){
  indx <- which(ggd$Parameter == paste("psi[", j, ",", nyear, "]", sep=""))
  psi.pred[j, ] <- ggd$value[indx]
  indx <- which(ggd2$Parameter == paste("z[", j, "]", sep=""))
  Z.est[j,] <- ggd2$value[indx]
}

AUC2 <- rep(NA, nstore * nchain)
fpr <- array(dim=c(nstore * nchain, nsite + 1))
tpr <- array(dim=c(nstore * nchain, nsite + 1))
for (i in 1:(nstore * nchain)){
  psi.vals <- psi.pred[, i]
  Z.vals <- Z.est[, i]
  pred <- prediction(psi.vals, factor(Z.vals, levels=c("0", "1")))
  perf <- performance(pred, "auc")
  AUC2[i] <- perf@y.values[[1]]
  perf <- performance(pred, "tpr","fpr")
  fpr[i, ] <- perf@x.values[[1]]
  tpr[i, ] <- perf@y.values[[1]]
}

fprm <- melt(fpr, varnames=c("iter", "site"))
tprm <- melt(tpr, varnames = c("iter", "site"))
ROC2 <- data.frame(fpr = fprm$value,
                   tpr = tprm$value,
                   iter = rep(fprm$iter, 2))

Comparing models

How well did our models predict occupancy in the final year of the study, and was one better than the other? We can answer this question by inspecting posteriors for AUC (larger values are better), and the ROC curves.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
p1 <- ggplot(ROC1, aes(x=fpr, y=tpr, group=iter)) +
  geom_line(alpha=0.05, color="blue") +
  geom_abline(intercept = 0, slope = 1, linetype="dashed") +
  ylab("True positive rate") +
  xlab("False positive rate") + theme_bw()

p2 <- ggplot(ROC2, aes(x=fpr, y=tpr, group=iter)) +
  geom_line(alpha=0.05, color="red") +
  geom_abline(intercept = 0, slope = 1, linetype="dashed") +
  ylab("True positive rate") +
  xlab("False positive rate") + theme_bw()

p3 <- ggplot(data.frame(AUC = c(AUC1, AUC2),
              Model = rep(c("Saturated model", "Main effects model "),
                          each=length(AUC1))),
       aes(x=AUC, fill=Model, color=Model)) +
  geom_density(alpha=.8) + theme_bw() + ylab("Posterior density") +
  xlab("Area under receiver operating characteristic curve (AUC)") +
  scale_colour_manual(values=c("red", "blue")) +
  scale_fill_manual(values=c("red", "blue")) +
  theme(legend.position = "top") +
  theme(legend.title=element_blank()) +
  ggtitle("Posterior AUC and ROC curve comparison")

require(gridExtra)
grid.arrange(p3, arrangeGrob(p2, p1, ncol=2), ncol=1)

As expected, the model that generated the data fits better than the model that excludes the strong interaction term. Note that AUC reflects the accuracy of model predictions, and does not penalize model complexity.

To leave a comment for the author, please follow the link and comment on his blog: Ecology in silico.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.