Cross-validation and Information Criteria using JAGS

551 views

Skip to first unread message

Jelle van Zweden

unread,

Jul 20, 2018, 10:41:31 AM7/20/18

to hmecology: Hierarchical Modeling in Ecology

Dear all,

I'm fairly new to this topic, but in the near future I will have to use hierarchical models as well as GLMMs to estimate trends in population sizes. One of the things I would like to work on is the implementation of covariates, which means that I'll use information criteria and/or cross-validation for model selection, to see which covariates should remain in my models.

Now, I used one of Marc Kéry's simulated datasets (Kéry 2010 - Chapter 21) to try to implement Bayesian p-values as well as several different information criteria (DIC, WAIC, Posterior Predictive Loss (D)) and cross-validation, but I have a hard time figuring out if I'm doing it correctly. Does anyone have experience with these methods? Can anyone tell me if and where I'm making mistakes? Many thanks for your help.

Best, Jelle

###########################################################################################################

If I run the code below, it takes about 60 minutes to produce the following table:

fold	bay.p	c.hat	bay.p.cv	c.hat.cv	DIC	WAIC	D	non.convergence	n.params.estimated
1	0.31	1.08	0.00	4.37	867	925	6174	12	1215
2	0.47	1.02	0.00	8.54	882	920	6163	9	1215
3	0.37	1.05	0.00	4.28	896	999	6346	11	1215
4	0.36	1.05	0.00	4.17	918	914	6191	17	1215
5	0.45	1.03	0.00	6.56	910	971	6296	19	1215
6	0.37	1.05	0.00	4.49	922	1015	6221	13	1215
7	0.31	1.07	0.02	3.32	912	940	6242	33	1215
8	0.25	1.10	0.01	3.18	894	1007	6340	18	1215
9	0.36	1.05	0.00	5.58	910	913	6168	10	1215
10	0.36	1.06	0.00	19.75	908	903	5885	19	1215
average	0.36	1.06	0.00	6.42	901.9	950.5	6202.6	16.1

###########################################################################################################

library(R2jags)

#######################

### Data generation ###

#######################

n.site <- 200 # number of sites

vege <- sort(runif(n = n.site, min = -1.5, max =1.5)) # vegetation cover

alpha.lam <- 2 # intercept

beta1.lam <- 2 # linear effect of vegetation

beta2.lam <- -2 # quadratic effect of vegetation

lam <- exp(alpha.lam + beta1.lam*vege + beta2.lam*(vege^2)) # lambda

par(mfrow = c(2,1))

plot(vege, lam, main = "", xlab = "", ylab = "Expected abundance", las = 1)

N <- rpois(n = n.site, lambda = lam) # simulation of abundance

table(N) # distribution of abundances across sites

sum(N > 0) / n.site # empirical occupancy

plot(vege, N, main = "", xlab = "Vegetation cover", ylab = "Realized abundance")

points(vege, lam, type = "l", lwd = 2)

par(mfrow = c(2,1))

alpha.p <- 1 # intercept

beta.p <- -4 # linear effect of vegetation

det.prob <- exp(alpha.p + beta.p * vege) / (1 + exp(alpha.p + beta.p * vege)) # detection probability

plot(vege, det.prob, ylim = c(0,1), main = "", xlab = "", ylab = "Detection probability")

expected.count <- N * det.prob # model without error

plot(vege, expected.count, main="", xlab="Vegetation cover",

ylab="Apparent abundance", ylim = c(0, max(N)), las = 1)

points(vege, lam, type = "l", col = "black", lwd = 2) # Truth

n.survey <- 3 # number of replicate counts per site

count <- array(dim = c(n.site, n.survey)) # array to fill in simulated counts

for(j in 1:n.survey){

count[,j] <- rbinom(n=n.site, size=N, prob=det.prob) # simulation of counts

}

site = 1:n.site # site.id

max.count <- apply(count, 1, max) # maximum count per site

naive.analysis <- glm(max.count ~ vege + I(vege^2), family = poisson) # simple glm, assusuming perfect detection

summary(naive.analysis)

lin.pred <- naive.analysis$coefficients[1] + naive.analysis$coefficients[2]*vege + naive.analysis$coefficients[3] * (vege*vege)

# linear predictor of naive analysis

par(mfrow = c(1,1))

plot(vege, max.count, main = "", xlab = "Vegetation cover",

ylab = "Abundance or count or detection probability*max.N", ylim = c(0,max(N)), las = 1)

points(vege, lam, type = "l", col = "black", lwd = 2, lty=2) # true abundance

points(vege, exp(lin.pred), type = "l", col = "red", lwd = 2) # prediction of naive analysis

points(vege, det.prob*max(N), type="l", col="blue", lty=3) # detection probability * maximum abundance

###################################################################################

### Binomial mixture model, including cross-validation and information criteria ###

###################################################################################

# Model definition

bin.mix <- function(){

# Priors

alpha.lam ~ dunif(-10, 10)

beta1.lam ~ dunif(-10, 10)

beta2.lam ~ dunif(-10, 10)

alpha.p ~ dunif(-10, 10)

beta.p ~ dunif(-10, 10)

# Likelihood

for (i in indices) {

N[i] ~ dpois(lambda[i])

log(lambda[i]) <- alpha.lam + beta1.lam*vege[i] + beta2.lam*vege[i]*vege[i]

for (j in 1:n.survey) {

lp[i,j] <- alpha.p + beta.p * vege[i]

p[i,j] <- exp(lp[i,j]) / (1 + exp(lp[i,j]))

}

for (i in train.indices){

for (j in 1:n.survey){

count[i,j] ~ dbin(p[i,j], N[i])

}

totalN <- sum(N[])

# Goodness-of-fit measures

for (i in indices) {

for (j in 1:n.survey) {

count.sim[i,j] ~ dbin(p[i,j], N[i])

exp.count[i,j] <- p[i,j] * N[i]

resid.actual[i,j] <- pow((count[i,j]-exp.count[i,j]),2) / (exp.count[i,j] + 0.5)

resid.sim[i,j] <- pow((count.sim[i,j]-exp.count[i,j]),2) / (exp.count[i,j] + 0.5)

ppd[i,j] <- pow(p[i,j],count[i,j]) * pow((1-p[i,j]),(N[i]-count[i,j]))

}

fit.actual <- sum(resid.actual[train.indices,])

fit.sim <- sum(resid.sim[train.indices,])

c.hat <- (fit.actual+0.001) / (fit.sim+0.001)

bay.p <- step(fit.sim - fit.actual)

fit.cv.actual <- sum(resid.actual[test.indices,])

fit.cv.sim <- sum(resid.sim[test.indices,])

c.hat.cv <- (fit.cv.actual+0.001) / (fit.cv.sim+0.001)

bay.p.cv <- step(fit.cv.sim - fit.cv.actual)

}

# Model settings

nc <- 3

nb <- 5000

ni <- 25000

nt <- 4

params <- c("totalN", "alpha.lam", "beta1.lam", "beta2.lam", "alpha.p", "beta.p", "fit.actual", "fit.sim",

"c.hat", "bay.p", "fit.cv.actual", "fit.cv.sim", "bay.p.cv", "c.hat.cv", "ppd", "count.sim")

# Shuffle dataset for cross-validation

count.initial <- max.count+1 # initial value for Bayesian model

indices <- seq(1:n.site)

set.seed(1)

shuffled.indices <- sample(indices)

shuffled.sites <- site[shuffled.indices]

shuffled.vege <- vege[shuffled.indices]

shuffled.count <- count[shuffled.indices,]

shuffled.count.initial <- count.initial[shuffled.indices]

# Run model including 10-fold cross-validation

start.time = Sys.time() # set timer

results.model <- list() # list to save results of each fold

results.ICs <- data.frame(matrix(NA, ncol=8, nrow=10)) # data frame to save results of information criteria

colnames(results.ICs) <- c("bay.p", "c.hat", "bay.p.cv", "c.hat.cv", "DIC", "WAIC", "D", "non.convergence")

for (i in 1:10) {

test.indices <- ((i-1)*round(0.1*n.site)+1):(i*round(0.1*n.site))

train.indices <- indices[-test.indices]

inits <- function(){list(N=shuffled.count.initial, alpha.lam=rnorm(1, 0, 1), alpha.p=rnorm(1, 0, 1),

beta.p=rnorm(1, 0, 1), beta1.lam=rnorm(1, 0, 1), beta2.lam=rnorm(1, 0, 1))}

jags.data <- list(indices=indices, train.indices=train.indices, test.indices=test.indices, n.survey=n.survey,

vege=shuffled.vege, count=shuffled.count)

out <- jags(jags.data, inits, params, bin.mix, n.chains=nc, n.iter=ni, n.burn = nb, n.thin=nt)

bay.p <- out$BUGSoutput$mean$bay.p

c.hat <- out$BUGSoutput$mean$c.hat

bay.p.cv <- out$BUGSoutput$mean$bay.p.cv

c.hat.cv <- out$BUGSoutput$mean$c.hat.cv

DIC <- out$BUGSoutput$DIC

lppd.sum <- -2*sum(log(apply(out$BUGSoutput$sims.list$ppd,2,mean)))

pD.2 <- sum(apply(log(out$BUGSoutput$sims.list$ppd), 2, function(x) sd(x)*sd(x)))

WAIC <- lppd.sum + pD.2

count.sim.mean <- apply(out$BUGSoutput$sims.list$count.sim,2,mean)

sum1 <- sum((count-count.sim.mean)^2)

sum2 <- sum(apply(out$BUGSoutput$sims.list$count.sim,2,function(x) sd(x)*sd(x)))

D <- sum1 + sum2

non.convergence <- length(which(out$BUGSoutput$summary[,8]>1.1))

results.model[[i]] <- list(summary=data.frame(out$BUGSoutput$summary))

results.ICs[i,] <- c(bay.p, c.hat, bay.p.cv, c.hat.cv, DIC, WAIC, D, non.convergence)

}

end.time = Sys.time()

elapsed.time = round(difftime(end.time, start.time, units='mins'), dig = 2)

cat(paste(paste('Posterior models computed in ', elapsed.time, sep=''), ' minutes\n\n', sep=''))

results.ICs

Jelle van Zweden

unread,

Jul 23, 2018, 6:13:50 AM7/23/18

to hmecology: Hierarchical Modeling in Ecology

Hi all,

So, I realized that my post of last Friday might have been a bit too broad in its questions, so let me specify a bit more what my exact questions are. Some help or answers on any of these issues is appreciated.

1) Is the way in which I did the cross-validation correct, with estimating the counts using only train.indices,

for (i in train.indices){

for (j in 1:n.survey){

count[i,j] ~ dbin(p[i,j], N[i])

}

and then calculating bay.p and c.hat for both train.indices and test.indices? Does this reflect the within-sample vs. cross-sample validation correctly?

2) I found the way to calculate the WAIC and D information criteria in two papers by Hooten & Hobbs 2015 and Gelman et al. 2014, but I'm not entirely sure I understood the mathematical formulas correctly. As it is now, I estimate ppd (pointwise posterior distribution) and count.sim (the replicate "perfect" dataset), which form the basis for these criteria, for each visit j and each site i, after which WAIC and D are based on the sum of differences over all visits and sites. However, these papers are not very specific on the topic of multiple visits and only specify that values should be summed over sites. Would it perhaps be better to first take an average over visits and then sum differences over sites? Or doesn’t that matter so much? Or does anyone have some code on how to calculate these for N-mixture and/or occupancy models?

3) In the papers mentioned above, the authors use the mathematical expression E(y_i|y) and I interpreted that as what I call count.sim, i.e. the replicate “perfect” dataset that is produced by drawing from a binomial distribution, but perhaps this should be the expected count, exp.count, as calculated with p[i,j] * N[i] instead. Which one should it be?

4) Is any of the above different between N-mixture models, occupancy models and non-hierarchical logistic regressions? That is, I wonder if N represents the number of individuals, as in N-mixture models, you might get slightly different formulas than when it effectively represents the number of visits, as in occupancy. Does this have consequences for the calculation of cross-validation measures or information criteria?

5) In the table “results ICs” I produced, the null model actually has the lowest values for DIC, WAIC, and D, which I found somewhat surprising. Is this because I miscalculated something, or is this something common for these models/measures and should you just compare between different models with covariates and not with the null model?

6) Does anyone specifically have experience with model selection using N-mixture and occupancy models to estimate trends over years? Any typical order or plan to go about? I’m asking because there are several measures, such as information criteria or indicator variable selection, with or without cross-validation, but I’m going to have to do this for several hundreds of species and any extra step might take a lot of time, so I thought that maybe someone has already gone through several methods and found one way to work well.

Sorry if this is a bit much, but I thought this is the right group to ask and any ideas are welcome.

Cheers,

Jelle

John Clare

unread,

Jul 23, 2018, 3:15:49 PM7/23/18

to hmecology: Hierarchical Modeling in Ecology

You might check out this paper (if the link doesn't work, Broms et al. 2016 Model Selection and Assessment for Multispecies occupancy models), which might help you work through some of the issues raised in bullets 2, 3, and 6.

One major difference between logistic regression and N-mix/Occu models is that the latter (as normally implemented in the BUGS language) incorporate latent variables within the likelihood. Apparently (https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1002/ecs2.1477), WAIC performs better if the likelihood used to estimate pD and the lppd marginalizes over the latent state variables.

Re. 6), I don't know if there are any complete model selection comparisons for estimating trends (one paper I can think of that compares indicator variable selection and out of sample performance with simulated data is https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1002/ecy.2166). I guess one major consideration that might help limit the scope of any comparison might be whether you are eventually planning to either choose a single "best" model, or average across several models.

Jelle van Zweden

unread,

Jul 24, 2018, 10:20:26 AM7/24/18

to hmecology: Hierarchical Modeling in Ecology

Thanks for the pointers, John! I'll definitely check out those papers. I think I'm going for a single "best" model per species for now, but perhaps model averaging is actually not such a bad idea, since this might result in better comparisons between species..