gmultmix models in JAGS; code tweaks welcomed

267 views
Skip to first unread message

Jessi Brown

unread,
Jul 24, 2013, 1:38:58 PM7/24/13
to unma...@googlegroups.com
Hello, folks.

I have been trying to replicate the gmultmix N-mixture models in JAGS, for a removal design with both primary and secondary periods. I may have succeeded in the optimal case where there are no missing data (thanks to Richard Chandler for the idea of the sequential binomial trials for the removal part), but things rapidly fall apart once I introduce a missing data point or two (either for the catch data or covariates). Does anyone have any idea of how to get the code fully generalizable to allow missing data? And any suggestions to optimize the code for speed? It is darn slow... which is not surprising, I'm fairly new to WinBUGS and brand-new with JAGS.

Thanks in advance for any advice or pointers.
cheers, Jessi Brown

Code appended below:

# my attempt at making JAGS code for
# the Chandler et al. 2011 temporary
# emigration N-mixture models
# removal design with secondary samples

library(unmarked)
library(rjags)
load.module("glm")

# generating some data

# Data dimensions
nSites <- 200
nSecondary <- 3
nPrimary <- 6

# Parameters
expit <- function(x) {
    1/(1+exp(-x))
}

# dummy variable to assign sites to one of two areas
area <- round(runif(nSites, 0, 1))

# super-population calculation
lambda0 <- log(200)  #5.298317
lambda1 <- log(100)-log(200) # area1 sites are less well-populated, -0.6931472
lambda <- exp(lambda0 + lambda1*area)        # mean abundance in primary period 1

# linear availability for capture; max at -2 temp
temp <- array(NA, c(nSites, nPrimary))
for(t in 1:nPrimary){
    temp[,t] <- runif(nSites, -2, 2)
}
phi0 <- -1.5459
phi1 <- -1.1422
phi <- expit(phi0 + phi1*temp)       # availability for capture

# unimodal detection probability; max at 0 time, each secondary sample
# follows in sequence
time <- array(NA, c(nSites, nSecondary, nPrimary))
for(t in 1:nPrimary){
    time[,1,t] <- runif(nSites, -2, 2)
    time[,2,t] <- time[,1,t] + abs(rnorm(nSites, 0.1, 0.05))
    time[,3,t] <- time[,2,t] + abs(rnorm(nSites, 0.1, 0.05))
    }
time2 <- time*time
p0 <- -2.1969
p1 <- 0
p2 <- -0.95069
p <- expit(p0 + p1*time + p2*time*time)    # detection probability

# Super-population size
set.seed(458)
M <- rpois(nSites, lambda)

# Population size during each primary sampling period
N <- matrix(NA, nSites, nPrimary)

# Observed counts
y <- array(NA, c(nSites, nSecondary, nPrimary))

for(i in 1:nSites) {
    for(j in 1:nPrimary) {
        N[i,] <- rbinom(nPrimary, M[i], phi[i,j])
        for(k in 1:nSecondary) {
            y[i,1,] <- rbinom(nPrimary, N[i,j], p[i,k,j])
            Nleft1 <- N[i,j] - y[i,1,j]
            y[i,2,] <- rbinom(nPrimary, Nleft1, p[i,k,j])
            Nleft2 <- Nleft1 - y[i,2,j]
            y[i,3,] <- rbinom(nPrimary, Nleft2, p[i,k,j])
            }
        }
}

plot(1:nPrimary, colSums(N), xlab="Primary period", ylab="Population available", type="o")

#############################################################
#
#  on to JAGS model
#  attempted removal model
#
#############################################################

sink(file="temp_em_covs.txt")
cat("
model {
lambda.int ~ dnorm(0, 0.001)
lambda.beta1 ~ dnorm(0, 0.001)
phi.int ~ dnorm(0, 0.001)
phi.beta1 ~ dnorm(0, 0.001)
p.int ~ dnorm(0, 0.001)
p.beta1 ~ dnorm(0, 0.001)
p.beta2 ~ dnorm(0, 0.001)

for(i in 1:nSites) {
  M[i] ~ dpois(lambda[i])
  log(lambda[i]) <- lambda.int + lambda.beta1*area[i]
  for(j in 1:nPrimary) {
      N[i,j] ~ dbin(phi[i,j], M[i])
      logit(phi[i,j]) <- phi.int + phi.beta1*temp[i,j]
      for(k in 1:nSecondary) {
          logit(p[i,k,j]) <- p.int + p.beta1*time[i,k,j] + p.beta2*time2[i,k,j]
     }
      y[i,1,j] ~ dbin(p[i,1,j], N[i,j])
      y[i,2,j] ~ dbin(p[i,2,j], N[i,j]-y[i,1,j])
      y[i,3,j] ~ dbin(p[i,3,j], N[i,j]-y[i,1,j]-y[i,2,j])
      }
  }
}

", fill=TRUE)
sink()

# Format data and create function to initiate parameters
Mst <- apply(y, 1, function(x) {
             if(all(is.na(x)))
                 return(2)
             else
                 return(sum(x, na.rm = TRUE) + 30)
         })
              
dat.cov <- list(nSites=nSites, nPrimary=nPrimary, nSecondary=nSecondary, y=y, area=area, temp=temp, time=time, time2=time2)
init.cov <- function() list(lambda.int=rnorm(1), lambda.beta1=rnorm(1), phi.int=rnorm(1), phi.beta1=rnorm(1), p.int=rnorm(1), p.beta1=rnorm(1), p.beta2=rnorm(1), M=Mst)
pars.cov <- c("lambda.int", "lambda.beta1", "phi.int", "phi.beta1", "p.int", "p.beta1", "p.beta2")

# Compile the model.
jm.cov <- jags.model("temp_em_covs.txt", dat.cov, init.cov, n.chains=2, n.adapt=5000)
# seems very sensitive to the initial values for Mst

# Draw samples from the posterior
jc.cov <- coda.samples(jm.cov, pars.cov, n.iter=10000)
# View the Markov chains
plot(jc.cov, ask=T)
summary(jc.cov) # coefficients generally in ballpark especially when run longer
# MCMC diagnostics
gelman.diag(jc.cov)    # Should be ~ <1.1
autocorr.plot(jc.cov)  # If samples are autocorrelated, you need more iters
crosscorr(jc.cov)      # Correlation among parameters

# try putting an NA in or two; uncomment to choose which data gets NA
# "fix" by just putting a zero back in
# time[1,1,5] <- NA  
y[1,1:3,5] <- NA
# temp[1,5] <- 0  

Mst <- apply(y, 1, function(x) {
             if(all(is.na(x)))
                 return(2)
             else
                 return(sum(x, na.rm = TRUE) + 30)
         })
          
dat.cov <- list(nSites=nSites, nPrimary=nPrimary, nSecondary=nSecondary, y=y, area=area, temp=temp, time=time, time2=time2)
init.cov <- function() list(lambda.int=rnorm(1), lambda.beta1=rnorm(1), phi.int=rnorm(1), phi.beta1=rnorm(1), p.int=rnorm(1), p.beta1=rnorm(1), p.beta2=rnorm(1), M=Mst)
pars.cov <- c("lambda.int", "lambda.beta1", "phi.int", "phi.beta1", "p.int", "p.beta1", "p.beta2")

# Compile the model.
jm.cov <- jags.model("temp_em_covs.txt", dat.cov, init.cov, n.chains=2, n.adapt=5000)
# apparently can't have NAs in the time or temp covariate
# sometimes will run if NAs are in the response array y

Dan Linden

unread,
Jul 24, 2013, 2:23:04 PM7/24/13
to unma...@googlegroups.com
Hi Jessi,

I would definitely recommend Marc Kéry's "Bayesian Population Analysis with WinBUGS" book (with Michael Schaub) if you are serious about running these models in JAGS.  The book provides WinBUGS code but they include an R script with everything converted to JAGS as well (changes are minor).  They'll have tips on optimizing these models and where you can sometimes gain efficiencies.

As for the missing data, there should be no problem with having NAs in the response variable.  Covariates definitely need values though and one solution is to standardize the covariate values for which measurements exist and then set all the NAs to 0 (effectively assigning missing values the mean).  There are some implications of doing this that are beyond the scope of my comment - I believe some discussion can be found in the Program MARK book.  But so long as the proportion of NAs in the covariate are relatively small it shouldn't be a problem.

Jessi Brown

unread,
Jul 25, 2013, 1:15:14 PM7/25/13
to unma...@googlegroups.com

Thanks for the advice, Dan.

Although that book is high up on my wishlist, I don't have it yet nor am able to find it in the few university libraries that are available to me. But I found the example code on their website. While there are definitely some helpful ideas, there isn't an exact analogue. It looks like most times they went the route of filling in with the mean for missing values, which I could certainly try.

Using the overall mean might make the most sense for one case of missing values - when the surveyors either failed to take that measurement that day (probably equipment failure). However, there are many more cases where the surveys just weren't run so of course there are no covariates to go with. I guess I'll have to take a close look at the real data and decide how bad things might be.

However, I am a bit puzzled about how to approach the problem of missing categorical covariates. What is the mean of multiple categories? Should I assign those with missing values to categories randomly? Or choose one level as the sacrificial one... or maybe make a new variable level for "not recorded?" Or two categories, "not recorded" and "survey not performed?"

Still, thanks for the thoughts so far. Any more ideas out there?

Thanks, Jessi Brown

Andrew Cox

unread,
Jul 25, 2013, 3:55:41 PM7/25/13
to unma...@googlegroups.com
Hi Jessi,
Unfortunately I don't have any thoughts on how to deal with points with missing surveys.  My instinct is to delete the entire point from the analysis unless it's 1) prohibitive with regard to sample sizes 2) removes non-random subsets of samples.  

As to the question of categorical variables, if you don't have too many levels, I wonder if you could treat them as dummy variables. So if there is a habitat category with three levels (forest, grassland, and shrub), data would look like this:

Forest  Grassland
1           0
0           1
0           0
0.33      0.33

Where the first row is forest, second is grassland, third is shrub, and fourth is the "mean."
The last line is how you "control" for categorical variables in SAS when producing estimates from GLMs, but I'm not certain whether it will work on the front end.  Just a thought - perhaps someone smarter than me will chime in.

- Andrew

Shannon Knapp

unread,
Jul 26, 2013, 1:06:57 PM7/26/13
to unma...@googlegroups.com
Let me preface this by saying that this is in *no way* my area of expertise, but I suggest you look into the literature for survey sampling and in particular issues of non-response.  I don't think the "mean" (e.g. 1/3 for a categorical variable with 3 levels) would be wise.  Methods discussed include multiple imputation and the EM algorithm.  However, these are (or can be) computationally intensive, so it may be problematic to layer one computationally intensive method on another, but something you shoudl consider.

Jeffrey Royle

unread,
Jul 26, 2013, 6:04:10 PM7/26/13
to unma...@googlegroups.com
hi all,
 I agree with Shannon on this, you should generally avoid setting the missing covariates to the mean value. (in the case of a categorical variable, this would probably be the modal value). 
 If there are very few missing values then this is probably not a terrible thing to do, but otherwise I suggest putting a prior distribution on the covariate and, in this case, BUGS/JAGS will do imputation automatically by filling in the missing values with draws from the posterior distribution.   For a categorical random variable you could do this simply by putting:  x[i] ~ dcat(probs)
or similar, and then specifying a prior distribution for the elements of "probs".
For example, if the variable has 3 levels then you might do:

for(i in 1:3){
alpha[i] ~ dgamma(1,1)
probs[i]<- alpha[i]/sum(alpha[])
}

which I think imposes a Dirichlet prior distribution on "probs".
In this way, the analysis will properly account for uncertainty about the missing covariate values.

For likelihood methods this is less easy to do. As Shannon said, see the standard methods for EM and similar methods.

regards
andy



--
You received this message because you are subscribed to the Google Groups "unmarked" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Reply all
Reply to author
Forward
0 new messages