Proper handling of "non-data nodes" in runCrossValidate( )

154 views
Skip to first unread message

Andrew

unread,
Dec 8, 2020, 2:15:36 PM12/8/20
to nimble-users
Hello,

runCrossValidate( ) is throwing me an error that I suspect results from missing data in the dependent variable of a dynamic multi-species occupancy model. In the dataset, only 5 of 6 possible replicate survey visits were conducted by technicians during the first season of data collection, whereas the second season contains data for all 6 replicate visits. Therefore, the missing 6th replicate contains NAs for all sites and species in the first season. The dependent variable, called Yc, is indexed by site, species, season, and replicate, respectively.

Specifically, runCrossValidate( ) is throwing this error: 
Error in FUN(X[[i]], ...) : 
  foldFunction is returning names of non-data nodes for fold 1
Calls: runCrossValidate -> lapply -> FUN

Here is my setup code for runCrossValidate( ), and the foldFunction is perhaps of greatest interest:
##Cross-validate model results
commNmix_FoldFunction <- function(i){
  foldNodes_i <- paste0('Yc[', i,', , , ]')  # will return 'Yc[1,,,]' for i = 1 (e.g., based on nsite)
  return(foldNodes_i)
}

##Define the loss function
#The function below computes the root mean squared error
RMSE_lossFunction <- function(simulatedDataValues, actualDataValues){
  dataLength <- length(simulatedDataValues) # simulatedDataValues is a vector
  SSE <- 0
  for(i in 1:dataLength){
    SSE <- SSE + (simulatedDataValues[i] - actualDataValues[i])^2
  }
  MSE <- SSE / dataLength
  RMSE <- sqrt(MSE)
  return(RMSE)
}

###

To use runCrossValidate( ), I simply follow the example provided in the NIMBLE package's user manual in R (see https://www.rdocumentation.org/packages/nimble/versions/0.10.0/topics/runCrossValidate).

What might the "non-data nodes" error be referring to, and how can it possibly be fixed or worked around?

Best,
Andrew


Chris Paciorek

unread,
Dec 10, 2020, 11:32:31 AM12/10/20
to Andrew, nimble-users
hi Andrew,

We don't consider missing observations to be data nodes (because they don't contribute to the likelihood and in an MCMC are treated as 'parameters' in the sense that they are sampled during the MCMC).
Furthermore, we can't use those nodes for cross-validation because the values are unknown. So you'd need to have your fold function avoid specifying any nodes that are missing data nodes as part of the fold.

-Chris

--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/c9758dc8-c365-42f6-b2b1-5fe5f9881917n%40googlegroups.com.

Andrew

unread,
Dec 10, 2020, 2:53:20 PM12/10/20
to nimble-users
Thanks for responding, Chris. Your reply confirms my suspicion: the NA records do inhibit the cross-validation. Conceptually, I understand what you recommend to do as a workaround, but I must admit that I do not know how to accurately implement such recommendations in practice.

The NA records are stored in the dependent variable at Yc[1:14, 1:26, 1, 6], for 1-14 sites, 1-26 species, season 1, and the missing 6th replicate.

How could I adapt my foldFunction to accommodate skipping over such records? Could if/then loops be used somehow?

commNmix_FoldFunction <- function(i){
  foldNodes_i <- paste0('Yc[',  i,'  ,   ,   , ]')  # will return 'Yc[1,  ,  , ]' for i = 1 (e.g., based on nsite)
  ##foldNodes_i <- paste0('Yc[',  i,'  ,   ,1, 6]') #I believe that this specification would only exclude the i-th site's 6th replicate during season 1 per fold; therefore, how can you ignore multiple records during 1 season in a foldFunction?
  return(foldNodes_i)
}

Before receiving your response today, I had considered simply inserting 0's for NAs during the first season, but that then inserts invalid ecological information into the process. I had also considered cross-validating the model over 1-5 replicate records only, but that then leaves out quite a bit of information from the 6th replicate visits recorded during season 2. I am stumped.

Chris Paciorek

unread,
Dec 10, 2020, 6:21:59 PM12/10/20
to Andrew, nimble-users
Yes, using an if statement to handle the special case should work, or any other R code that serves to omit the nodes that are not observed but would otherwise be included for any given value of "i". 

As I'm looking at this, it appears there might be a use for having the fold function take an optional additional argument (via R's ... functionality) that could be used to determine which nodes are valid data nodes. I'll bring this up with the development team. 

Andrew

unread,
Dec 30, 2020, 1:54:38 PM12/30/20
to nimble-users
Hi Chris,

Thank you for your previous suggestions. For the time being, I have been running cross-validation over the nodes containing data only (i.e. the first 5 [out of 6 total] replicate surveys during each season). This at least gives results. Do you also have an update on what the Development Team suggests for running cross-validation on models with datasets containing NAs in the dependent variable(s)?

On a different note, for a dynamic N-occupancy model (i.e. including sub-models for occupancy and abundance and associated measurement errors for y_occ ~ dbern(psi) and y_abund ~ dbin(N, p) input dependent data), is runCrossValidate( ) able to generate performance scores from a loss function on two processes, both occupancy and abundance? Specifically, this would yield an RMSE for the occupancy process as well as one for the abundance process.

To do so, I have been experimenting with the following code:
#Cross-validate model results
commNocc_FoldFunction <- function(i){
  foldNodes_i <- c(paste0('Yc_abund[', i,', , , ]'), paste0('Yc_occ[', i,', , , ]'))  # will return 'y[1,,,]' for i = 1 (e.g., based on nsite)
  ##foldNodes_i <- c(paste0('Yc_occ[', i,', , , ]'), paste0('Yc_abund[', i,', , , ]'))  # changing the order because the measurement error process for Yc_abund appears first in modelCode
  return(foldNodes_i)
}

# The function below computes the root mean squared error
RMSE_lossFunction <- function(simulatedDataValues, actualDataValues){
  dataLength <- length(simulatedDataValues) # simulatedDataValues is a vector
  SSE <- 0
  for(i in 1:dataLength){
    SSE <- SSE + (simulatedDataValues[i] - actualDataValues[i])^2
  }
  MSE <- SSE / dataLength
  RMSE <- sqrt(MSE)
  return(RMSE)
}

######### Here is where I exclude the 6th replicate surveys######### 
win.data$Yc_abund <- win.data$Yc_abund[,,,1:5]
win.data$Yc_occ <- win.data$Yc_occ[,,,1:5]
const$nrep <- dim(win.data$Yc_abund)[[4]]
#####################################################

model <- nimbleModel(code = modelCode, name = "modelCode",
                     constants = const, data = win.data,
                     inits = init)
Cmodel <- compileNimble(model)
modelConf <- configureMCMC(model)

cvSamples <- runCrossValidate(MCMCconfiguration = modelConf,
                               k = 5, foldFunction = commNocc_FoldFunction,
                               lossFunction = RMSE_lossFunction,
                               MCMCcontrol = list(niter = ni, nburnin = nb))

I have also received some puzzling errors from runCrossValidate( ), errors suggesting missing quotations, parentheses, or curly braces, such as:
Error: unexpected '}' in:
"  SSE <-
  }"

or

Error: object 'SSE' not found

or

Error: unexpected '}' in:
"  dataLength <- length(simulatedDataValues) # simulatedDataValues is a vector
  SSE <-  }"

or 

Error: unexpected end of input

###

Do these error messages indicate that runCrossValidate( ) cannot handle models with more than one dependent variable, or is my syntax for the foldFunction incorrect?

Chris Paciorek

unread,
Dec 31, 2020, 11:35:07 AM12/31/20
to Andrew, nimble-users
hi Andrew,

Regarding handling of NAs, I'd need more information on what your goal is in order to be able to respond usefully. Given one can't use an NA to calculate a loss, it's not clear to me what it means to include values with NAs in a held-out set. Do you just want nimble to be able to handle this for convenience - i.e., that nodes with NAs could be included in a fold but would be ignored? If so, that's something we could probably do quite easily, though the development team would have to discuss if it's something we want as part of the user interface.

Regarding your syntax errors, those indeed look odd. Would you be able to share a minimal reproducible example (off-list if you prefer) so I can look in more detail?

-chris

Andrew

unread,
Dec 31, 2020, 12:59:31 PM12/31/20
to nimble-users
Hi Chris,

Thanks for the feedback.

Regarding the NAs, yeah, it's mostly about adding user convenience to runCrossValidate( ), especially in light of observational studies. You're absolutely right that loss cannot be computed from non-data; that certainly wasn't my ask, and I apologize if my previous messages were somewhat unclear.

Following your suggestions from a couple of weeks ago, I attempted to code fold functions that would "ignore" or "skip over" NAs in the dependent variables, but each of those attempts were unsuccessful, which is likely due to my lack of complete understanding of NIMBLE's under-the-hood operations for runCrossValidate( ). For example, it wasn't clear to me how I could exclude NAs in either my fold function or in the simulatedDataValues or actualDataValues within my loss function. Instead, I decided to take a simpler approach and just leave out the NAs entirely by removing the 6th replicates in both seasons a priori, limiting model validation to an extent. Recall that NAs only occurred in my dependent variables during (missing) 6th replicate surveys in the first season.

About the syntax errors, I really don't know what went wrong, but it's no longer a problem after a re-run. I didn't make any changes to the input code (e.g., because I did not see any comma, parenthesis, or bracket problems), and no errors were ultimately thrown. I have no idea what occurred the first time. If you'd still like to see a reproducible, then I would be happy to forward you one off-list.

Best,
Andrew

Chris Paciorek

unread,
Jan 4, 2021, 12:17:43 PM1/4/21
to Andrew, nimble-users
Hi Andrew,

You shouldn't need to have any knowledge of nimble's under-the-hood operations for runCrossValidate to be able to set up the foldFunction and lossFunction input functions - our hope is that the help information on runCrossValidate() is sufficient. Namely that the foldFunction should return a character vector of (data) node names and the lossFunction a function that takes in two vectors and computes the loss.

 If you'd like us to delve into this further, it would be helpful to have a (as simple as possible) reproducible example where things go wrong for you.

-Chris

Chris Paciorek

unread,
Jan 4, 2021, 12:25:53 PM1/4/21
to Andrew, nimble-users
hi Andrew,

Regarding the question of multiple RMSE values, I don't think it's directly possible to return multiple RMSE values for each fold as we've set things up to expect that the lossFunction returns a scalar. However, if within each fold, the first "m" values in each vector input to the lossFunction correspond to counts and the remaining values correspond to presence/absence and "m" is the same for each fold, you could presumably have the lossFunction set up to combine the two component losses and return the overall loss for each fold. Does that make sense?

And if you really want them separate you could presumably run runCrossValidate()  twice, the first time where the hold-out is over occupancy data nodes and the second time where the hold-out is over abundance data nodes. That would involve two separate MCMC runs, one each time runCrossValidate is called. 

-chris

Andrew

unread,
Jan 4, 2021, 12:48:07 PM1/4/21
to nimble-users
Thank you for these additional recommendations, Chris. I don't think there's a need for the development team to dive further into the first issue (i.e. working fold and loss functions around non-data nodes). For simplicity, I will compute loss over the first 5 (rather than all 6) replicates of the input datasets.

Replying to your second message, the first suggestion is a bit challenging to follow. In the current model version, my occupancy inputs (i.e. with their indices) belong to an array called Yc_occ[ site, species, season, replicate ]; whereas abundance inputs (i.e. with their indices) belong to an array called Yc_abund[ site, species, season, replicate ]. Their dimensions are thus equivalent: 14 sites, 26 species, 2 seasons, and 5 replicates (i.e. when ignoring the 6th, which is missing during season 1). Would those input dimensions work with your first suggestion?

If not, then I could certainly run runCrossValidate( ) twice, along with their requisite MCMCs. In that regard, is it permissible to hold out a portion of the occupancy inputs, but not the abundance inputs simultaneously? In other words, should the following work to retrieve separate RMSEs:
#Cross-validate model results
occupancy_FoldFunction <- function(i){
  foldNodes_i <- paste0('Yc_occ[', i,', , , ]')  # will return 'y[1,,,]' for i = 1 (e.g., based on nsite)  return(foldNodes_i)
}

abundance_FoldFunction <- function(i){
  foldNodes_i <- paste0('Yc_abund[', i,', , , ]')  # will return 'y[1,,,]' for i = 1 (e.g., based on nsite)  return(foldNodes_i)
}

# The function below computes the root mean squared error
RMSE_lossFunction <- function(simulatedDataValues, actualDataValues){
  dataLength <- length(simulatedDataValues) # simulatedDataValues is a vector
  SSE <- 0
  for(i in 1:dataLength){
    SSE <- SSE + (simulatedDataValues[i] - actualDataValues[i])^2
  }
  MSE <- SSE / dataLength
  RMSE <- sqrt(MSE)
  return(RMSE)
}

######### Here is where I exclude the 6th replicate surveys######### 
win.data$Yc_abund <- win.data$Yc_abund[,,,1:5]
win.data$Yc_occ <- win.data$Yc_occ[,,,1:5]
const$nrep <- dim(win.data$Yc_abund)[[4]]
#####################################################

model <- nimbleModel(code = modelCode, name = "modelCode",
                     constants = const, data = win.data,
                     inits = init)
Cmodel <- compileNimble(model)
modelConf <- configureMCMC(model)

cvSamples_occ  <- runCrossValidate(MCMCconfiguration = modelConf,
                               k = 5, foldFunction =  occupancy_FoldFunction   ,
                               lossFunction = RMSE_lossFunction,
                               MCMCcontrol = list(niter = ni, nburnin = nb))

cvSamples_abund <- runCrossValidate(MCMCconfiguration = modelConf,
                               k = 5, foldFunction =  abundance_FoldFunction   ,
                               lossFunction = RMSE_lossFunction,
                               MCMCcontrol = list(niter = ni, nburnin = nb))

###

Based on your second suggestion, it seems as though holding out a portion of one of the dependent inputs (e.g., occupancy) and not the other (e.g., abundance) would not generate MCMC estimation problems for runCrossValidate( ).

Chris Paciorek

unread,
Jan 5, 2021, 12:01:31 PM1/5/21
to Andrew, nimble-users
hi Andrew,

The fold function says which nodes to hold out for a given fold and the values of those nodes are concatenated and passed along to the loss function. The key is making sure that your loss function knows which of the elements it receives are for occupancy and which for abundance so it can calculate the loss appropriately for those two pieces and combine the two values together. So based on what nodes your fold function is returning, you need to know how many of the values correspond to occupancy and how many to abundance and which of the values are for occupancy and which for abundance. Given the symmetry you seem to have, it might be as simple as the first half of the nodes are occupancy and the second half are abundance and so the loss function does one thing with the first half of the values and something else with the second half of the values. 

Regarding my suggestion to run runCrossValidate twice, I may have been a bit hasty with that suggestion. If you hold out only, say, occupancy, then the corresponding abundance is not held out and does inform the parameters. That may not be what you want. 

-chris

Andrew

unread,
Jan 5, 2021, 12:35:23 PM1/5/21
to nimble-users
Thanks, Chris. That makes a lot more sense to me now, and I was not previously aware that runCrossValidate( ) concatenates data from multiple nodes when they're simultaneously held out with a fold function. My datasets are, indeed, very congruent, and so I will begin testing whether I can use the first and second halves in different ways within my loss function or not. My first thought is to specify both hold outs in the fold function (i.e. occupancy and abundance, in that order) and then divide dataLength, simulatedDataValues, and actualDataValues in half, computing the occupancy and abundance RMSEs, respectively. Please let me know if that doesn't accurately follow your suggestion.

On your last note, that was my concern as well. I assumed that it would be the case that holding out one without the other would be incorrect, but it's also very helpful to hear your thoughts on the matter, helping validate my intuition.

Andrew

Andrew D

unread,
Jan 5, 2021, 2:32:01 PM1/5/21
to Chris Paciorek, nimble-users
Great; that also sounds like an excellent approach. I'll give that a shot. Thanks again, Chris. I really appreciate your time discussing this problem with me.

On Tue, Jan 5, 2021 at 2:27 PM Chris Paciorek <paci...@stat.berkeley.edu> wrote:
Yes, what you say follows my thoughts.

One other thought on a possibly useful hack here. If for some reason you wanted to hold out both occupancy and abundance together but be able to report the loss separately for occupancy and abundance, your loss function could ignore either the occupancy or abundance values, just returning the loss for one of them. You could then run runCrossValidate twice - once with a loss function that computes the occupancy+abundundance loss and once with a loss function that computes, say, only the loss for occupancy. Then you'd have in your hands the total loss and the two component pieces. Note that the fold function would still be set up to hold out the relevant occupancy and abundance values together. I haven't looked at how the random number seeds are being set in all of this - you'd want the two calls to runCrossValidate to produce the exact same MCMC output. 

-chris



--
Andrew J. Dennhardt
Graduate Research Assistant
Michigan State University
480 Wilson Road
Room 13, Natural Resources Bldg.
East Lansing, MI 48824
Phone: 517-355-3030
Fax: 517-432-1699
Mobile: 309-738-3228
Email: denn...@msu.edu
Reply all
Reply to author
Forward
0 new messages