Extreme RAM usage in multi-species occupancy model

328 views
Skip to first unread message

domini...@gmail.com

unread,
Oct 6, 2017, 5:31:49 AM10/6/17
to hmecology: Hierarchical Modeling in Ecology

Dear all,


I am trying to run a fairly simple multi-species occupancy model based on detection/non-detection data where species are modelled as random effects. My primary aim is to estimate species richness at each of the sampling sites. I have a total of 529 sites and a pool of 556 species. The number of repeated surveys varies for each site, ranging from 2 – 651. Detection probability is modelled as a function of two survey specific covariates (Julian day and survey duration) and one site specific covariate (road density). Occupancy probability is modelled simply as the mean. The structure of the model below is based on those in chapter 11 of the AHM book.


model {

    

    ## Ecological model 

    for(k in 1:nspec){             # Loop over species

       logit(psi[k]) <- lpsi[k]

       for (i in 1:npen) {            # Loop over sites

       z[i,k] ~ dbern(psi[k])           

      }

    }

    

    ## Observation model 

    for(k in 1:nspec){               # Species

      for (i in 1:npen) {              # Sites

       for (j in 1:survey[i]){         # Index specific survey replicates for each site

    

        logit(p[i,j,k]) <- lp[k] + beta1[k]*Ju[i,j] + beta2[k]*Hr[i,j] + beta3[k]*Rd[i]

        mup[i,j,k] <- z[i,k] * p[i,j,k]

       Y[i,j,k] ~ dbern(mup[i,j,k])


    #Note: Y is a huge 3D array with dimensions: 529 sites x 651 replicates x 556 species

          }

       }

    }

    

    ## Derived quantities

    for(k in 1:nspec){                   # Loop over species

       Nocc.fs[k] <- sum(z[,k])      # Num of occupied sites

    }

    

    for (i in 1:npen) {                 # Loop over sites

       Nsite[i] <- sum(z[i,])         # Species richness at each site

    }


    ## Priors ##

    for(k in 1:nspec){                    # Loop over species

        lpsi[k] ~ dnorm(mu.lpsi, tau.lpsi)

        lp[k] ~ dnorm(mu.lp, tau.lp)

        beta1[k] ~ dnorm(mu.beta1, tau.beta1)

        beta2[k] ~ dnorm(mu.beta2, tau.beta2)

        beta3[k] ~ dnorm(mu.beta3, tau.beta3)


    }

    

    mu.psi ~ dunif(0,1)

    mu.lpsi <- logit(mu.psi)

    sd.lpsi ~ dunif(0,5)

    tau.lpsi <- pow(sd.lpsi, -2)

    

    mu.p ~ dunif(0,1)

    mu.lp <- logit(mu.p)

    sd.lp ~ dunif(0,5)

    tau.lp <- pow(sd.lp, -2)

    

    mu.beta1 ~ dnorm(0, 0.1)      # Ju - Julian Day 

    tau.beta1 <- pow(sd.beta1, -2)

    sd.beta1 ~ dunif(0,5)

     

    mu.beta2 ~ dnorm(0, 0.1)      # Hr - Survey duration

    tau.beta2 <- pow(sd.beta2, -2)

    sd.beta2 ~ dunif(0,5)

     

    mu.beta3 ~ dnorm(0, 0.1)      # Rd - Road density 

    tau.beta3 <- pow(sd.beta3, -2)

    sd.beta3 ~ dunif(0,5)

  

    }


The model does not run to completion and after reading this post I think the problem I’m having is referred to as a “memory leak”. Essentially as I start to run the model the RAM usage surges to maximum capacity until my operating system is destroyed (PC specs: 2.13GHz Intel Xeon CPU with 32GB of RAM). 


My data set is obviously very large and no doubt part of the problem. I have run the model using a subset of 100 sites and it does work, although it runs painfully slowly. I’m using JAGS 4.2.0 and running it through jagsUI 1.4.4.


Is there a different way this model could be written so that the process is more efficient? Or have I made a rudimentary error somewhere along the line? I have limited experience with Bayesian models in general so any help or insight would be greatly appreciated!

Kery Marc

unread,
Oct 6, 2017, 5:50:01 AM10/6/17
to domini...@gmail.com, hmecology: Hierarchical Modeling in Ecology

Hello Dominic,

 

Exactly these months I have been wrestling with a similar problem with a community model in JAGS, with a dynamic model with only two primary periods (= Atlas periods) and 124 species, but with a total of 3500 sites. The model doesn’t run at all on my powerful desktop and even on a supercomputer we have so far not been able to run it. So one of the things that I tried (today !) was to drop the random-effects assumptions about the species-specific effects, i.e., turning the model back to a fixed-effects model and now it seems to run on my desktop, at least for short trial runs. I think this is not ideal, as treating species-level params as random would be the natural way, but treating them as fixed is not wrong and will allow me to obtain species richness estimates that are valid, though perhaps I would get more precision with the usual random-effects version of the model.

 

Best regards  -- Marc

--
*** Three hierarchical modeling email lists ***
(1) unmarked: for questions specific to the R package unmarked
(2) SCR: for design and Bayesian or non-bayesian analysis of spatial capture-recapture
(3) HMecology (this list): for everything else, especially material covered in the books by Royle & Dorazio (2008), Kéry & Schaub (2012), Kéry & Royle (2015)
---
You received this message because you are subscribed to the Google Groups "hmecology: Hierarchical Modeling in Ecology" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hmecology+...@googlegroups.com.
To post to this group, send email to hmec...@googlegroups.com.
Visit this group at https://groups.google.com/group/hmecology.
To view this discussion on the web visit https://groups.google.com/d/msgid/hmecology/e20c3b00-6caf-43f9-93f9-9a1cb93c3821%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dominic Henry

unread,
Oct 6, 2017, 6:32:07 AM10/6/17
to Kery Marc, hmecology: Hierarchical Modeling in Ecology
Hi Marc,

Thanks for your reply. Interesting to hear that you are facing a similar issue! 

I haven't tried my model on a high performance computer but maybe I should see if I have any luck. I had thought about dropping the random effects but I only wanted to go down that route as a last resort. I guess a sacrifice in precision is worth it if I can get an actual estimate of species richness!

Regards,

Dominic

To unsubscribe from this group and stop receiving emails from it, send an email to hmecology+unsubscribe@googlegroups.com.



--
Dominic Henry

Centre for Statistics in Ecology, Environment and Conservation 
Statistical Sciences Department
University of Cape Town
    

kkellner

unread,
Oct 6, 2017, 9:35:33 AM10/6/17
to hmecology: Hierarchical Modeling in Ecology
Hi Dominic,

Are you saving parameters 'p' or 'mup' to your final JAGS output? If you are and you don't really need them (or can summarize them by site/species within the model) that might be one way to reduce the memory footprint, so JAGS isn't storing hundreds or thousands of replicates of those giant arrays in memory (and in the final output object). If you are saving these giant arrays to output, I would also suggest using jags.basic() from jagsUI or just use rjags directly so that you don't have to wait for jagsUI to calculate individual stats for thousands of individual array values.

Another approach that might help has to do with your variable number of surveys per site (2-651). If most sites are closer to the lower end of that range, you have many y-array values that are missing (eg if site i was sampled 10 times, y[i,11:651,1:nspecies] contain no useful information). I *think* the missing values would still use RAM, though.

You could try splitting your y-matrix into several "bins" grouped by how many survey replicates there were to try to work around this. So the first bin y1 could contain sites that were sampled a max of 10 times, the second y2 a max of 100 times, and so on. Depending on the distribution of # of surveys among sites, this may greatly decrease the amount of useless array space you are storing. This would require some effort to sort and break apart the observation data before running the model, and potentially more work re-cominbing things afterwards.

I wrote up some dummy code for the observation model illustrating what I mean:

#Input data
nspecies <- 20 #Total number of species
nsites <- 10 #Total number of sites
nreps <- c(2,6,7,3,5,12,14,75,32,45) #by site

#site group 1: 2-10 replicates
#The first 5 sites fall in this group
ind.grp1 <- c(1,5) #Indexes for first group
y1 <- matrix(data1,dim=c(5,10,nspecies)
#5 sites in this group so 1st dimension is 5, max surveys is 10 so second is 10

#site group 2: 11-100 replicates
#Say 5 sites fall in this group
ind.grp2 <- c(6,10) #Indexes for second group
y2 <- matrix(data2,dim=c(5,100,nspecies)
#Also 5 sites in this group, and 2nd dim is 100 because max surveys is 100


model {
   
    #Observation model

    for (k in 1:nspecies){
        alpha.p[k] ~ rnorm(sp.pmean,sp.ptau) #Random species intercept
    }
   
    #Group 1: max 10 replicate obs
    for (k in 1:nspecies){
        for (i in ind.grp1[1]:ind.grp1[2]){
            for (j in 1:nreps[i]){

                logit(p1[i,j,k]) <- alpha.p[k] + .....
                mup1[i,j,k] <- z[i,k] * p1[i,j,k]
                y1[i,j,k] ~ dbern(mup1[i,j,k])
            }
        }
    }

    #Group 2: 11-100 replicate obs
    for (k in 1:nspecies){ #Technically don't need to separate the species loop in half, but I did it for clarity
        for (i in ind.grp2[1]:ind.grp2[2]){
            for (j in 1:nreps[i]){

                logit(p2[i,j,k]) <- alpha.p[k] + .....
                mup2[i,j,k] <- z[i,k] * p2[i,j,k]
                y2[i,j,k] ~ dbern(mup2[i,j,k])
            }
        }
    }

}

Caveat: I haven't tested this myself.

Ken
To unsubscribe from this group and stop receiving emails from it, send an email to hmecology+...@googlegroups.com.

Andrea Goijman

unread,
Oct 6, 2017, 10:12:45 AM10/6/17
to kkellner, hmecology: Hierarchical Modeling in Ecology
Hi Dominic,

Im not an expert, but I allow myself into the discussion because I ran into this problem too. I have a huge database from a bird monitoring program going on for 10 years. To solve that issue, I created separate models for diiferent species' groups (e.g. raptors, passerines, columbiformes, etc.), and it worked perfectly, using random effects. Of course couldnt estimate the derivated parameter N, but in my case was not central.

Cheers!

Andrea

To unsubscribe from this group and stop receiving emails from it, send an email to hmecology+unsubscribe@googlegroups.com.

To post to this group, send email to hmec...@googlegroups.com.
Visit this group at https://groups.google.com/group/hmecology.

For more options, visit https://groups.google.com/d/optout.



--
---
Andrea Paula Goijman, Ph.D
Cell 54 11 6293 6214

Kery Marc

unread,
Oct 6, 2017, 10:28:36 AM10/6/17
to kkellner, hmecology: Hierarchical Modeling in Ecology

Dear Ken,

 

It’s true that with jags.basic, one cannot run a job in parallel ?

 

Thanks and best regards  -- Marc

Richard Chandler

unread,
Oct 6, 2017, 10:35:19 AM10/6/17
to Kery Marc, kkellner, hmecology: Hierarchical Modeling in Ecology
Hi Marc and everyone,

FWIW, you can run 'rjags' in parallel using the 'parallel' package. I also agree with the earlier comment that it's a good idea to avoid creating unnecessary nodes (especially big arrays) because JAGS will save them in memory. Still, JAGS can be a memory hog. 

Richard

To unsubscribe from this group and stop receiving emails from it, send an email to hmecology+unsubscribe@googlegroups.com.

--
*** Three hierarchical modeling email lists ***
(1) unmarked: for questions specific to the R package unmarked
(2) SCR: for design and Bayesian or non-bayesian analysis of spatial capture-recapture
(3) HMecology (this list): for everything else, especially material covered in the books by Royle & Dorazio (2008), Kéry & Schaub (2012), Kéry & Royle (2015)
---
You received this message because you are subscribed to the Google Groups "hmecology: Hierarchical Modeling in Ecology" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hmecology+unsubscribe@googlegroups.com.

To post to this group, send email to hmec...@googlegroups.com.
Visit this group at https://groups.google.com/group/hmecology.

For more options, visit https://groups.google.com/d/optout.



--
Richard Chandler
Assistant Professor
Warnell School of Forestry and Natural Resources
University of Georgia

kkellner

unread,
Oct 6, 2017, 10:39:18 AM10/6/17
to hmecology: Hierarchical Modeling in Ecology
You can run parallel chains in jagsUI::jags.basic() the same way you do with jagsUI::jags(). The jags.basic() is identical other than that it doesn't calculate summary stats for all the nodes at the end and just returns the "raw" rjags output.

If you are having memory issues though, running chains in parallel is likely to make things worse  because you have to store multiple copies of the large arrays in RAM at the same time (for each core).

Ken

--
*** Three hierarchical modeling email lists ***
(1) unmarked: for questions specific to the R package unmarked
(2) SCR: for design and Bayesian or non-bayesian analysis of spatial capture-recapture
(3) HMecology (this list): for everything else, especially material covered in the books by Royle & Dorazio (2008), Kéry & Schaub (2012), Kéry & Royle (2015)
---
You received this message because you are subscribed to the Google Groups "hmecology: Hierarchical Modeling in Ecology" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hmecology+...@googlegroups.com.
To post to this group, send email to hmec...@googlegroups.com.
Visit this group at https://groups.google.com/group/hmecology.

Kery Marc

unread,
Oct 6, 2017, 12:11:55 PM10/6/17
to kkellner, hmecology: Hierarchical Modeling in Ecology

Dear Ken,

 

And “update” will also work when the model has first been fit using “jags.basic”, right ?

kkellner

unread,
Oct 6, 2017, 12:53:44 PM10/6/17
to hmecology: Hierarchical Modeling in Ecology
Marc,

That's correct, there's an update method for jags.basic objects (jagsUI:::update.jagsUIbasic). To use it, when you first call jags.basic() you need to set argument 'save.model=TRUE', which will save the model object along with the rjags output so the update can be done correctly.

Ken
Message has been deleted

Maxwell Joseph

unread,
Oct 7, 2017, 12:18:15 AM10/7/17
to hmecology: Hierarchical Modeling in Ecology
It seems like the p, mup, and Y arrays potentially have a lot of entries that are irrelevant to the likelihood, since the number of repeat surveys for each site ranges from 2 - 651.

It may be worth unrolling these arrays so that if you have N surveys in total, then you are only representing p and Y once for each survey X species combination, so that both p and Y are matrix of with N rows and nspec columns, where N is the total number of surveys and nspec is the number of species. The likelihood in JAGS would then be something like:

for (i in 1:N) {        # loop over surveys, where each i refers to a site X survey combination
  for (k in 1:nspec) {  # loop over species
    logit(p[i, k]) <- lp[k] + beta1[k]*Ju[i] + beta2[k]*Hr[i] + beta3[k]*Rd[i]
    Y[i, k] ~ dbern(p[i, k] * z[site[i], k]
  }
}

This will require that you restructure your detection covariates so that they are also vectors instead of 2d arrays, and you'll need an index "site", with length N that identifies the site corresponding to survey i = 1, 2, ..., N. The advantage of this approach is that you'll only have elements in your arrays for p and Y that are relevant for the likelihood. Essentially this approach allows you to represent the ragged array structure that results from having a variable number of repeat surveys, without the added cost of having many NA entries in the 3d arrays for p, mup, and Y.

Alternatively, you might look into implementing this in Stan, where you'll marginalize z out of the model, ending up with a lower dimensional parameter space that may require less RAM.

domini...@gmail.com

unread,
Oct 7, 2017, 2:18:34 AM10/7/17
to hmecology: Hierarchical Modeling in Ecology
Hi Ken,

Thanks for your feedback. I have already been using jags.basic() and have only been saving the parameters of interest. I normally change the monitored parameters and run the model again if I need other outputs. As you've noted, running JAGS in parallel results in the computer crashing three times as quickly. 

The distribution of the number of replicate surveys is highly right-skewed so I think you're approach to splitting the Y matrix may be promising. It would substantially reduce the number of empty Y array values. I'm going to try re-formulate the model based on your suggestion. I'll let you know how it goes! 

Cheers,

Dominic

Kery Marc

unread,
Oct 7, 2017, 3:23:41 AM10/7/17
to domini...@gmail.com, hmecology: Hierarchical Modeling in Ecology
Dear Dominic et al.

not sure whether Ken's proposed solution is identical to what Maxwell proposed: but the latter solution is indeed one of the first things I would do with such incredibly inbalanced data. We are customarily running dynamic occupancy models for single species when we produce trend estimates for all Swiss breeding birds every year, based on opportunistic data. A typical data set for one species might have, say, 5000 sites and 100 days. If we organize the data in a regular array, then literally more than 95% of the entries are missing values, which must be estimated when the model is specified for this data arrangement. Indeed, we can't fit the models in JAGS in that way on our computers. However, when we string out the data into simple vectors, get rid of all NAs, and add data columns for the year and the day, we can run these models quite easily (i.e., within hours or a couple of days).

Best regards  --- Marc





From: hmec...@googlegroups.com [hmec...@googlegroups.com] on behalf of domini...@gmail.com [domini...@gmail.com]
Sent: 07 October 2017 08:18
To: hmecology: Hierarchical Modeling in Ecology
Subject: Re: Extreme RAM usage in multi-species occupancy model

kkellner

unread,
Oct 7, 2017, 10:34:06 AM10/7/17
to hmecology: Hierarchical Modeling in Ecology
Maxwell's solution is better than mine because it should completely get rid of all "unimportant" entries in the array, whereas mine would just reduce them.

It might take a bit longer to setup / reformat covariates but given how long the model takes to run it's probably worthwhile!

Ken

Jesse Whittington

unread,
Oct 7, 2017, 10:46:39 AM10/7/17
to Kery Marc, domini...@gmail.com, hmecology: Hierarchical Modeling in Ecology
Another option with ragged arrays is to change the observation data and associated covariates into long format and use indices for species, site.id, and the latent state.  I have not compared speeds to know if it is faster.

for (i in 1:n.obs){  # loop over all observations that are in long format
  logit(p[i]) <- lp[species.id[i]] + beta1[species.id[i]] * Ju[i] + beta2[species.id[i]]*Hr[i] + beta3[species.id[i]]*Rd[i]
  Y[i] ~ dbern(p[i] * z[sited.id[i], species.id[i]])
}

Jesse

To unsubscribe from this group and stop receiving emails from it, send an email to hmecology+unsubscribe@googlegroups.com.

--
*** Three hierarchical modeling email lists ***
(1) unmarked: for questions specific to the R package unmarked
(2) SCR: for design and Bayesian or non-bayesian analysis of spatial capture-recapture
(3) HMecology (this list): for everything else, especially material covered in the books by Royle & Dorazio (2008), Kéry & Schaub (2012), Kéry & Royle (2015)
---
You received this message because you are subscribed to the Google Groups "hmecology: Hierarchical Modeling in Ecology" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hmecology+unsubscribe@googlegroups.com.

To post to this group, send email to hmec...@googlegroups.com.
Visit this group at https://groups.google.com/group/hmecology.

Morgan Tingley

unread,
Oct 8, 2017, 10:52:03 AM10/8/17
to hmecology: Hierarchical Modeling in Ecology
[This might be a sub-thread]
I am interested that in all of this discussion, no one has mentioned that perhaps Dominic should reduce his dataset?  If your sampling is so unbalanced that your number of replicate surveys varies from 2 to 651, then this can cause problems of its own (not just in run-time). If you have a few rare sites, where the number of repeats is orders of magnitude larger than the median number of repeats, then assuming closure over all of these samples (e.g., closure over 651 samples) may not be valid, and more critically, it may bias the likelihood to be weighted entirely by a few, hyper-visited sites. If closure is equally met across all sites, and some just happened to be hyper-visited, then it might be OK. What I find is far more common in ecology, however, is that closure is relaxed with the more samples you have, so that a site with 2 revisits has a different practical definition of closure than a site with 651 revisits. Closure affects the probability of detection, and so a very unbalanced dataset may have biased detectability probabilities that are being defined entirely by the hyper-visited sites.

How to avoid this? When I've encountered it in the past, I've reduced my dataset either through filtering rules or random sub-sampling. In Dominic's case, if the median number of re-visits is k* (e.g., k* = 10), then I might reduce my hyper-visited sites down to something much closer to k* visits.  This would certainly help with memory and fitting through exponentially reducing the dimensionality of the fitted matrices, but it will also reduce bias in your fitted parameters. This could (should?) also be used if you chose Max's strategy of vectorizing it.

-- 

Morgan W. Tingley

Assistant Professor

Dept of Ecology and Evolutionary Biology

University of Connecticut

Storrs, CT 06269

Dominic Henry

unread,
Oct 9, 2017, 5:10:38 AM10/9/17
to Morgan Tingley, hmecology: Hierarchical Modeling in Ecology
Hi Morgan,

Thanks for your insightful comments. You've raised a valid point and perhaps up until now I haven't devoted enough thought to bias that may be introduced as a result of the temporal structure of my data. To clarify, I'm working with South African bird atlas data. Each site is a 5' x 5' grid cell (pentad) and surveys have been conducted since 2007. Each atlas card that is submitted represents a repeat survey of a pentad. The pentads with a very large number of repeat surveys are situated nearby populated areas, while pentads with low sample sizes are more remote. As it stands I have extracted a 3 year temporal subset to work with. 

In theory I agree that increasing the level of standardisation will result in a more robust analytical approach, but this comes with a number of trade-offs. Sample size in particular can obviously be affected. In one sense it may actually be easier to deal with the 'hyper-visited' sites because restricting the temporal window would still result in a sufficient number of repeat samples. How do we deal the converse situation though? Lets say for instance a pentad has a total of 3 repeat visits over the 3 years (one per year). Data from these sites may not meet the closure assumption so we may have to drop them from the analysis. Given that we are trying to calculate species richness over a large geographic area this may result in a substantial loss in coverage.

I've read your 2013 Ecology and 2012 Global Change papers and see you were able to make use of data collected over multi-year periods because your particular methods were robust to violations of closure. But I'm unsure about how closure relates to fairly large spatial scale of my sampling units as well as the temporal window...

I would be interested to know your thoughts and how you would approach this. 

Regards,

Dominic


--
*** Three hierarchical modeling email lists ***
(1) unmarked: for questions specific to the R package unmarked
(2) SCR: for design and Bayesian or non-bayesian analysis of spatial capture-recapture
(3) HMecology (this list): for everything else, especially material covered in the books by Royle & Dorazio (2008), Kéry & Schaub (2012), Kéry & Royle (2015)
---
You received this message because you are subscribed to a topic in the Google Groups "hmecology: Hierarchical Modeling in Ecology" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hmecology/RS17FGi0U60/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hmecology+unsubscribe@googlegroups.com.

To post to this group, send email to hmec...@googlegroups.com.
Visit this group at https://groups.google.com/group/hmecology.

For more options, visit https://groups.google.com/d/optout.

Morgan Tingley

unread,
Oct 11, 2017, 9:54:40 AM10/11/17
to Dominic Henry, hmecology: Hierarchical Modeling in Ecology
Hi Dominic,

I'm sure others may have opinions on this, and the thread is now veering from RAM usage to design of occupancy models / analyses, but I assume that's OK. 

So this is obviously a tricky situation with no easy answer. In general, I'm not in love with using occupancy-type models on Atlas data because your detectability depends so much on where you surveyed within a grid cell (which is a challenging observation process to model versus, e.g., weather conditions, observer, etc.). Quantification of effort seems key both in terms of duration and extent. Here in Connecticut we are just starting a 2nd state Bird Atlas, however, so these challenges will soon be my challenges in the near future. Maybe you'll have figured it out for me?

As for the variable nature of surveyor effort across space and time with an Atlas (typical of most cit-sci data), it's important to recognize the trade-offs. I don't recommend throwing away data from sparsely-visited cells, but I do believe that throwing away data from heavily-visited cells is often necessary if you're trying to visit an occupancy-type model (where repeat visits are the sole method of fitting detectability).

I'll elaborate on my previous email. While the hyper-visited sites do give you really good resolution, it's still critical to remember that unbalanced re-visit designs can do unexpected and undesired things with occupancy models. Think of it like a species accumulation curve: after 651 surveys, you can be pretty confident of which species are present and which species are absent. If you have 651 surveys per cell, then you might not even need occupancy modeling at all (in that cell) because absence is nearly perfectly known. The model recognizes this: if the probability of detection of a species on a survey, p = 0.3, then if you only survey a grid once, then the probability of a false absence is 0.7*psi; but if you survey 651 times without a single detection, then the probability of false absence is psi * (0.7)^651, which is essentially zero. But now imagine that there are normal / typical closure violations in your data (e.g., birds fly around), and on one survey out of 651, an observer was surveying near the border of a grid cell and a new species from outside the grid cell flew overhead. Now you have a species that is seen only once out of 651 times. The likelihood of that capture history is = psi * (p)(1-p)^650. This capture history is only 'likely' if p is close to zero. Because you're raising (1-p) to the 650th power, it also heavily skews the overall likelihood calculation. In effect, this likelihood for this single cell will unbalance your overall likelihood, even if you have 100 other grid cells that are each visited 3 times. If you have no covariates of p, then the single cell will force p to approximately 0 for the whole model, which will force psi to be poorly constrained (and veering toward 1). If you do have covariates of p, then they are likely to be over-fit.  Because this is a problem with any unbalanced visit design, it is why I suggested sub-sampling heavily-visited sites to improve overall model fit (which will also normalize the effects of closure violations across the model). I have not done the simulations necessary to tell you how 'unbalanced' is 'too unbalanced', but I imagine one could look at the ratio of median number of visits / max number of visits and come up with some sort of guidance (to anyone reading this: please feel free to borrow this idea and do it!).  I also admit this this explanation is based solely on my experience / intuition and this is probably a well-known modeling principle that I'm just blanking on.

- Morgan



Dominic Henry

unread,
Oct 12, 2017, 4:56:53 AM10/12/17
to Morgan Tingley, Maxwell Joseph, hmecology: Hierarchical Modeling in Ecology
Hi Morgan, 

Thanks for taking the time to explain this further. I follow what you are saying and now understand a  bit more about the potential consequences of unbalanced design. Looking at the atlas cards in a bit more detail revealed that there are actually many species that were only recorded once over hundreds of surveys. In this case randomly selecting a subset of the data is probably a good idea. I would be interested to compare the difference in the model outputs between the two approaches. If I have the time I think I may run the occupancy models on both the full and subset data. 

On the technical side of things, following Maxwelll's suggestion, I restructured the model and data format to remove the redundant cell entries. Doing so resulted in a 96% decrease in the number of matrices and vector elements! I'm happy to report that the model now runs within the limits of available RAM. Thanks for the help, Maxwell.

Cheers,

Dominic

Karine Princé

unread,
Oct 12, 2017, 2:51:02 PM10/12/17
to Dominic Henry, Maxwell Joseph, Morgan Tingley, hmecology: Hierarchical Modeling in Ecology
Hi all,

I apologize for joining this discussion quite late, and I think Morgan pointed out (and explained well) a very important aspect that need to be first considered in Dominic's case.

Nevertheless, regarding the computational issues encountered by many of us when running occupancy models with JAGS (and Marc also mentioned about it earlier in that discussion), I thought I could bring some new elements for further discussion, maybe further developments (or critics, if some of you have already tried it). Recently, there has been several attempts (successful it seems ;)) by Olivier Gimenez to try fitting dynamic occupancy models with ADMB and more recently with TMB (check out his work here). Results sound promising! 
I plan on chatting about the details and a way to apply it to my own models with Olivier, but I was wondering if anyone has heard of or tried other similar attempts, developments, etc. and how difficult it would be to jump from JAGS to TMB (similar to C++ coding it seems).

~cheers
Karine

----------------------------------------------------
Karine Princé, Ph.D.
Postdoctoral Research Associate

--
*** Three hierarchical modeling email lists ***
(1) unmarked: for questions specific to the R package unmarked
(2) SCR: for design and Bayesian or non-bayesian analysis of spatial capture-recapture
(3) HMecology (this list): for everything else, especially material covered in the books by Royle & Dorazio (2008), Kéry & Schaub (2012), Kéry & Royle (2015)
---
You received this message because you are subscribed to the Google Groups "hmecology: Hierarchical Modeling in Ecology" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hmecology+...@googlegroups.com.
To post to this group, send email to hmec...@googlegroups.com.
Visit this group at https://groups.google.com/group/hmecology.

For more options, visit https://groups.google.com/d/optout.

--
~cheers
Karine

----------------------------------------------------
Karine Princé, Ph.D.
Postdoctoral Research Associate
Actuellement en contrat avec Pole Emploi
Currently looking for research position / funding

Kery Marc

unread,
Oct 12, 2017, 3:48:02 PM10/12/17
to Karine Princé, Dominic Henry, Maxwell Joseph, Morgan Tingley, hmecology: Hierarchical Modeling in Ecology
Salut Karine

here is my (relatively) short answer about ADMB/TMB.

I gather that both are wonderful pieces of software in the hands of a statistician. Presumably they are also wonderful for biologists IF you own code that has been written exactly for your modeling problem. However, if you as a biologist need to adapt the model to some not-so-standard situation (which is very often), then you will almost certainly be lost with ADMB/TMB unless you can cajole some statistician to write the code for you.

The great advantage of software that uses more or less the original BUGS language (i.e., WinBUGS, OpenBUGS, JAGS and apparently also NIMBLE, but not STAN) is that YOU as a biologist can do a host of extensions to your model yourself, without having to run to a statistician for help. The price to pay, sometimes, is that you must wait longer for the analysis to run.

Best regards  --- Marc



From: hmec...@googlegroups.com [hmec...@googlegroups.com] on behalf of Karine Princé [karine...@gmail.com]
Sent: 12 October 2017 20:50
To: Dominic Henry
Cc: Maxwell Joseph; Morgan Tingley; hmecology: Hierarchical Modeling in Ecology

Subject: Re: Extreme RAM usage in multi-species occupancy model

Dusit Ngoprasert

unread,
Oct 12, 2017, 4:08:08 PM10/12/17
to hmec...@googlegroups.com

I agree with you Marc!


For more options, visit https://groups.google.com/d/optout.

-- 
..............................................
ดุสิต งอประเสริฐ
Dusit Ngoprasert

Conservation Ecology Program
King Mongkut's University of Technology Thonburi
E-mail: dusi...@kmutt.ac.th
Tel. +66 2 470 7571
Fax. +66 2 452 3455
http://cons-ecol-kmutt.weebly.com/
Reply all
Reply to author
Forward
0 new messages