Problems with MSOM Spatial predictions

73 views
Skip to first unread message

Thiago Bicudo

unread,
Aug 1, 2024, 11:28:45 AMAug 1
to spOccupancy and spAbundance users
Hello Jeff, thank you so much for this group. I'm trying to run predictions using a spatial MSOM output, but I'm always getting the error of vector memory exhausted (limit reached?). 

I'm trying to run the predictions to a hexgrid (55973 hexagons) to the area. The model was fitted using spMsPGOcc with 5 neighbors. My data has 97 bird species and 77 sampled sites, with one month of sampling. 

I tried to use n.omp.threads with 10 (I have 14 cores available in my notebook 36gb RAM) but nothing changed. Do you have any idea how to proceed with it? 

Thank you so much

Thiago Bicudo, PhD Ecology
Quantitative Ecologist



Jeffrey Doser

unread,
Aug 1, 2024, 12:59:42 PMAug 1
to Thiago Bicudo, spOccupancy and spAbundance users
Hi Thiago,

Thanks for the question! The predict() function output will consist of the predicted spatial random effects for each species at each site for each MCMC iteration (w.0.samples), the predicted occurrence probability for each species at each site for each MCMC iteration, and the predicted occurrence state for each species at each site for each MCMC iteration. Given you're predicting across 55,973 locations, that object will be very large and as you observed will likely exhaust the RAM on your machine.

The amount of RAM needed will depend on (1) the number of species, (2) the number of locations, and (3) the number of MCMC iterations. Thinking about it from that perspective, one simple way to reduce the size of the objects is to not save an overly large amount of MCMC iterations (i.e., save just enough such that everything has converged). The more MCMC samples you save, the larger the objects will be (both for the model fit object and prediction). However, having 55,973 locations is a lot regardless of the number of locations, so the way to get around the RAM errors is to do the prediction in small groups of cells one at a time, and then only extracting what you need from the prediction (e.g., the medians/means and credible intervals, as opposed to saving all MCMC samples). Here is an outline of how to do this:

  1. Create an empty array to hold the median occupancy probability for each of the 55,973 cells and all 97 species.
  2. Split your 55,973 grid cells into something like 200 groups of 280 cells each (or reduce the number of cells in each group further in order to get under the RAM limits on your machine).
  3. Use the predict() function to predict at each of the 200 groups individually.
  4. From the resulting predict() object, extract the posterior median occupancy for each of those cells (and credible intervals, sds, and other things you want) and save them into their corresponding locations in the array you created in step 1. If you save more than just the median, you'll also want to create separate objects to store those components (or add an additional dimension to the array).
  5. Remove the prediction object from the workspace and clear out the memory to decrease RAM constraints.
  6. Do steps 3-5 for all cells.

I've attached a script that I used to predict across a large number of cells in a paper I'm working on. This won't run for you since you don't have the necessary data/objects and there might be some slight differences as a result of using a different model (this was with a multi-season model so I save an occupancy probability value for each site and year, as opposed to you having to do it for site and species), but take a look at the code in the "Do the prediction" section. There you'll see the code to do exactly what I lay out above (notice I save also a couple of quantiles to form a credible interval in addition to the median). Let me know if you have any questions.

Cheers,

Jeff

--
You received this message because you are subscribed to the Google Groups "spOccupancy and spAbundance users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spocc-spabund-u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/spocc-spabund-users/982483f8-c346-4b2a-949d-f8926fd175a7n%40googlegroups.com.


--
Jeffrey W. Doser, Ph.D.
Assistant Professor
Department of Forestry and Environmental Resources
North Carolina State University
Pronouns: he/him/his
prediction-split-example.R

Jeffrey Doser

unread,
Aug 2, 2024, 5:32:23 AMAug 2
to Thiago Bicudo, spOccupancy and spAbundance users
Hi Thiago,

Thanks for clarifying what you're doing. There are two differences between what you're doing in your code. First, you are saving the full prediction object from each chunk in the "predictions" list. This means that all MCMC samples for all components of the prediction object (spatial random effect, occurrence probability, binary occurrence) are all saved. In my code, I only saved the median of the MCMC samples as well as a few quantiles to form credible intervals. This leads to a large difference in the size of the resulting object. In your approach, because you save the full prediction object each time, you are not reducing the overall size of the object that needs to be stored, but rather are just doing it bit by bit. You likely will not be able to save the full posterior distributions for all cells, so you'll need to choose what you want to save. Second, you are not pre-allocating the list "predictions" and instead it is growing as you do a prediction for each group of cells. This will take up more memory then if you pre-allocate an array at the beginning of the loop. In my code, I created an array at the beginning and then filled up that array as I did the prediction.

So, if you wanted to say the 2.5% quantile, median, and 97.5% quantile for occurrence prob for each species, you could modify your code to look something like this:

chunk_size <- 200 # Number of hexagons
num_chunks <- ceiling(nrow(hex_cov) / chunk_size)
n.omp.threads <- 5

# Here I'm showing how you can store the 2.5% quantile, median, and 97.5% quantile for occurrence prob for each species.
# You can make more arrays if you want to store other things.
psi.pred.quantiles <- array(NA, dim = c(3, N, nrow(hex_cov))
for (i in 1:num_chunks) {
  start_row <- (i - 1) * chunk_size + 1
  end_row <- min(i * chunk_size, nrow(hex_cov))
  hex_cov_chunk <- hex_cov[start_row:end_row, ]
  coords_chunk <- coords.0[start_row:end_row, ]
 
  pred_chunk <- predict(
    out_bestmodel,
    hex_cov_chunk,
    coords_chunk,
    n.omp.threads = n.omp.threads
  )
# Store the quantiles in the psi.pred.quantiles object. Notice I have removed the list you were creating.
psi.pred.quantiles[, , start_row:end_row] <- apply(pred_chunk$psi.0.samples, c(2, 3), quantile, c(0.025, 0.5, 0.975))
 
  # Clear memory
  rm(pred_chunk)
  gc()
}


Hope that helps!

Jeff

On Thu, Aug 1, 2024 at 2:31 PM Thiago Bicudo <bicu...@gmail.com> wrote:
Hello Jeff, thanks for the reply. Indeed is a very large area. Because we wanted to use the results of our ecological models to predict species occurrence across the entire area of interest,  we generated a grid of hexagons with an internal area of 3.14 hectares across the entire study area. 

Comparing your code with mine, I think that we are both doing the same approach. I used chunks to split the data and perform the predictions: 
chunk_size <- 200 # Number of hexagons
num_chunks <- ceiling(nrow(hex_cov) / chunk_size)
n.omp.threads <- 5

predictions <- list()
for (i in 1:num_chunks) {
  start_row <- (i - 1) * chunk_size + 1
  end_row <- min(i * chunk_size, nrow(hex_cov))
  hex_cov_chunk <- hex_cov[start_row:end_row, ]
  coords_chunk <- coords.0[start_row:end_row, ]
 
  pred_chunk <- predict(
    out_bestmodel,
    hex_cov_chunk,
    coords_chunk,
    n.omp.threads = n.omp.threads
  )
 
  predictions[[i]] <- pred_chunk
 
  # Clear memory
  rm(pred_chunk)
  gc()
}

# Combine predictions from all chunks
out_pred <- do.call(rbind, predictions)


But, this is not working yet, and I don't know why. Indeed I used a big sample size to run the models:
Samples per Chain: 125000
Burn-in: 100000
Thinning Rate: 25
Number of Chains: 3
Total Posterior Samples: 3000


I will keep trying different approaches here. If I get something working I keep you informed. 

Thank you so much.

Thiago Bicudo

unread,
Aug 2, 2024, 11:31:28 AMAug 2
to Jeffrey Doser, spOccupancy and spAbundance users
Thiago Bicudo, PhD Ecology
Quantitative Ecologist




El jue, 1 ago 2024 a las 12:59, Jeffrey Doser (<jwd...@ncsu.edu>) escribió:

Thiago Bicudo

unread,
Aug 15, 2024, 3:27:41 PMAug 15
to spOccupancy and spAbundance users

Hello Jeff,

I hope you're doing well. I wanted to provide you with an update and seek your help once again.

I ran the code following the same approach as yours, but it seems that the prediction maps are not being constructed correctly. We believe this issue is related to how we calculated the predictions using chunks.

Since this is a model with a spatial component, each time a chunk is processed and the previous one is cleared, it seems that the spatial component might be restarting (possibly) randomly from that new chunk. This is causing a wave-like effect in the prediction maps, where the waves correspond to the chunks. Due to the aggregation of points in certain regions, the spatial part of the model appears to be overly dominant compared to the environmental part.

I'm attaching one of the maps produced and the the code used to generate the predictions is at the end of this e-mail. 

I hope there’s an alternative solution for this, and I’m curious to hear your thoughts on the matter.

Once again, thank you so much for your help.

Best regards,
Thiago

CODE:
chunk_size <- 2000 # change according to computer processing power

num_chunks <- ceiling(nrow(hex_cov) / chunk_size)
n.omp.threads <- 13


# Psi array
psi.pred.mean <- array(NA,
                       dim = c(nrow(hex_cov), length(spp_namesID$spCodes)))

psi.pred.sd <- array(NA,
                     dim = c(nrow(hex_cov), length(spp_namesID$spCodes)))

psi.pred.quantiles <- array(NA,
                            dim = c(nrow(hex_cov),
                                    length(spp_namesID$spCodes),
                                    2))

# Richness array
z.pred.mean <- array(NA,
                     dim = c(nrow(hex_cov), 1))

z.pred.sd <- array(NA,
                   dim = c(nrow(hex_cov), 1))

z.pred.quantiles <- array(NA,
                          dim = c(nrow(hex_cov),
                                  2))


for (i in 1:num_chunks) {
  start_row <- (i - 1) * chunk_size + 1
  end_row <- min(i * chunk_size, nrow(hex_cov))
 
  # Extract chunk of data

  hex_cov_chunk <- hex_cov[start_row:end_row, ]
  coords_chunk <- coords.0[start_row:end_row, ]
 
  # Predict for the chunk

  pred_chunk <- predict(
    out_bestmodel,
    hex_cov_chunk,
    coords_chunk, # Only when working with spatial models
    n.omp.threads = n.omp.threads
  )
 
  # Psi mean
  psi.pred.mean[start_row:end_row, ] <- apply(pred_chunk$psi.0.samples,
                                              c(2, 3),
                                              mean)
 
  # Psi sd
  psi.pred.sd[start_row:end_row, ] <- apply(pred_chunk$psi.0.samples,
                                            c(2, 3),
                                            sd)
 
  # Psi quantiles
  psi.pred.quantiles[start_row:end_row, , ] <- apply(pred_chunk$psi.0.samples,
                                                     c(2, 3),
                                                     quantile,
                                                     c(0.025, 0.975))
 
  # Z mean
  z.pred.mean[start_row:end_row, ] <- apply(pred_chunk$z.0.samples,
                                            c(1, 3),
                                            sum) %>%
    apply(., 2, mean)
 
  # Z sd
  z.pred.sd[start_row:end_row, ] <- apply(pred_chunk$z.0.samples,
                                          c(1, 3),
                                          sum) %>%
    apply(., 2, sd)
 
  # Z quantiles
  z.pred.quantiles[start_row:end_row, ] <- apply(pred_chunk$z.0.samples,
                                                 c(1, 3),
                                                 sum) %>%
    apply(., 2, quantile, c(0.025, 0.975))
}

# Add species names to columns of psi arrays
colnames(psi.pred.mean) <- spp_namesID$spCodes
colnames(psi.pred.sd) <- spp_namesID$spCodes
dimnames(psi.pred.quantiles) <- list(NULL, spp_namesID$spCodes, c("2.5%", "97.5%"))
Amazona autumnalis_white.png

Jeffrey Doser

unread,
Aug 18, 2024, 5:56:54 AMAug 18
to Thiago Bicudo, spOccupancy and spAbundance users
Hi Thiago, 

Apologies for my delay. The pattern you're seeing in the prediction output is indeed a bit weird, but it should not be a result of predicting in different chunks. When doing prediction in a spatial model in spOccupancy/spAbundance, the predictions of the spatial random effect (and the resulting occupancy probability) at an individual prediction location depends only on (1) the location of the prediction location; and (2) the proximity of the prediction location to the locations used to fit the model (and the values of the spatial random effects at those data locations). The estimate of an individual prediction location does not depend on what other values you are predicting at. You can test this out quickly by predicting at the first location by itself, then predicting at the first chunk of locations altogether, and the values you get for the first location should be essentially identical (they will be slightly different due to Monte Carlo error). 

I do not see any obvious bug in the code you sent, but the bizarre pattern is something I've previously seen when accidentally messing up the the prediction coordinates and/or covariates in some way (e.g., having different orderings of the coordinates and covariates, having the incorrect ordering of columns in the coordinates, using a different coordinate system in the prediction coordinates than the one used to fit the model). So, I'd suggest taking a look at those few things, and if it doesn't work then if you go share with me all the objects/code needed to do run the prediction myself then I can dig a bit more into it. I'll also suggest making sure you have the latest version of the package installed from CRAN (can just reinstall with install.packages('spOccupancy')). There was a very early version of the package back in 2022 that had an error in the prediction when setting the n.omp.threads > 1. I don't believe that's what's going on here, but it would be good to eliminate it as an option. 

Kind regards,

Jeff

Thiago Bicudo

unread,
Sep 16, 2024, 2:55:39 PMSep 16
to Jeffrey Doser, spOccupancy and spAbundance users
Hi Jeff, sorry for the long delay. 

Thank you so much for your help. One way I found to make it possible was to increase the area of the hexagon and use the code without splitting it into chunks. I don't know what could generate those patterns, but I still think that was due to the chunks. If we take a look at the attached map, which uses the same sequence that I used for the chunks, we can see a similar pattern to what we found when plotting the predictions. I double checked the prediction coordinates and covariates and found no errors. 


Now I'm facing the same problem of memory exhaustion, with a different project, when trying to run the 
posterior predictive checks. It's again a spatial model with the following parameters: 
# 7.1 Setting model parameters ----------------------------------------------
batch.length <-     25 
n.batch     <-    10000 
n.burn       <-  225000 #
n.thin       <-     25
n.chains     <-      3

n.report       <- 1500
n.omp.threads  <-  13
n.neighbors    <-   5

The model has 95 sampling sites, with 143 bird species, and uses 5 environmental variables. 

Is there any way to avoid this error or an alternative? Is it possible to run in parallel?

Thank you so much for your help. 

Thiago Bicudo, PhD Ecology
Quantitative Ecologist

image.png

Jeffrey Doser

unread,
Sep 20, 2024, 12:24:09 PMSep 20
to spOccupancy and spAbundance users
Hi Thiago, 

Apologies for the delay. Glad to hear you figured out a way to get the predictions. I'm not sure what would be causing the problem with predicting in chunks as I have tested that out with some real and simulated data and don't find any problems, but if you're willing to send me all the code/data objects you had when you encountered the weird predictions, that could be helpful for me to try and diagnose if there's any bigger problem with the functions going on here. 

As far as memory problems with posterior predictive checks, that can happen when working with a large number of species like what you have. There is no way to parallelize the ppcOcc() function. There are two options you could try, both of which are clunky, but would allow for doing the checks with less memory: 
  1. Only do the posterior predictive check for a subset of the total MCMC samples that are saved in the model objects. This would require subsetting all the posterior samples in the model output object (e.g., beta.samples, alpha.samples) to have a smaller number of MCMC samples (e.g., take a random sample of 500 of the 3000 samples you saved). You would also have to change the "out$n.post" component of your model output object to the number of MCMC samples you want to do the posterior predictive check for. This would likely require creating a new list so you don't overwrite the existing model object. 
  2. Create a new list containing only the information needed for the ppcOcc() function to perform the posterior predictive check for one species at a time. I have included a script that does this using the sfMsPGOcc() function (approach would be the same for any other multi-species model function). 
Hope that helps!

Jeff
example-ppc-by-species.R
Reply all
Reply to author
Forward
0 new messages