R restarts after many iterations of Custom sampler

已查看 71 次
跳至第一个未读帖子

dirkdouw...@gmail.com

未读,
2021年8月4日 12:23:002021/8/4
收件人 nimble-users
Hi Everyone,

I am having an issue where R restarts somewhere between 20k-60k iterations when using my custom sampler. I am guessing this means there is some sort of numerical issue that occurs rarely. I managed to narrow it down to the following code chunk,

        #lp0 <- model$r[loc]*(log(model$r[loc])-log(model$r[loc]+model$mup[loc,ct]))
        lp0 <- dnbinom(x=0, size=model$r[loc], prob=       (model$r[loc]/(model$r[loc]+model$mup[loc,ct])), log=TRUE)
        q2 <- 1/(1+exp(lf[zt,1]+log(p[1,1])-lf[zt,2]-log(p[2,1])-lp0))
        q[zt,1:2] <<- c(1-q2,q2)

Here lp0 is the log probability that a negative binomial is 0. Either with the commented or uncommented line here it breaks. I know this is what is breaking it as the following does work, 

       
        lp0 <-  -model$mup[loc,ct]
        q2 <- 1/(1+exp(lf[zt,1]+log(p[1,1])-lf[zt,2]-log(p[2,1])-lp0))
        q[zt,1:2] <<- c(1-q2,q2)

Now lp0 is the Poisson log probability of a 0 which is just minus the mean. This works but obviously does not give correct Posterior as I am using a negative binomial model. 

In the setup I define q as, 

    q <- matrix(nrow=numnodes,ncol=2)
    q[1,1:2] <- c(-.99,-.99)

So, what I think is happening is that by numerical error q is being assigned the wrong type and this crashes R. Maybe it is take log(0) or something? I honestly don't know enough about C++ to understand how to fix this properly.

Thanks!

Perry de Valpine

未读,
2021年8月4日 13:39:242021/8/4
收件人 dirkdouw...@gmail.com、nimble-users
Hi Dirk,

I'm not sure on first glance what might be going on.  If the index loc is ever out of range for model$r, that could be the problem.  I'm pointing out that one because that is the one that occurs in triggering code and not in the non-triggering code.  I don't think numerical issues (NaNs or Infs, e.g.) alone should cause a problem that crashes R.

Here are some debugging ideas, which also came up in a thread yesterday.

If you do 

nimbleOptions(buildInterfacesForCompiledNestedNimbleFunctions = TRUE)

then you can find and inspect your sampler more easily.  Look at all samplers by 

Cmcmc$samplerFunctions # replace "Cmcmc" with your compiled MCMC
 
Pick out your custom sampler (use the name argument when creating it to make finding it easier).

Then you can inspect variables such as

Cmcmc$samplerFunctions[[ <your index> ]]$q

You can send control to R for debugging purposes by creating a nimbleRcall.  e.g. before the lp0 line you could call "my_debug_fn(model$r, loc, model$mup, ct)" where my_debug_fn is an R function that inspects the inputs and looks for anything malformed or invalid.  Or you could have this function record an output log file of all values so that you get a record of the state of the variables that occurred when it crashed.   But it's hard when you can't reproduce exactly when the problem occurs.

It's worth saying that if it is a memory glitch (something over-writing where it shouldn't), these can be painful to debug because the problem might occur somewhere else and then only be triggered (as a crash) where you are looking.  Or it might occur in one iteration but not be triggered until a later iteration.  However it sounds like you have your finger on a line of code that, when changed, removes the bug, so that is an excellent lead.

I hope that helps.

Perry

--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/89122a27-b658-4132-aacb-548782468bean%40googlegroups.com.

dirkdouw...@gmail.com

未读,
2021年8月4日 18:01:152021/8/4
收件人 nimble-users
Thanks Perry!

I like your idea of writing to a text file. Unfortunately it is not really working for me as the sampler just runs at a snails pace when I try this and I don't get the error till the 20k iteration mark or so. I would have to run for an eternity.

A big issue I am facing is just how rare the error must be. The sampler is attached to 157 nodes so the error occurs in around 1/ 3,140,000 iterations of the sampler. This is why I am thinking it must be an numeric error or something, however, I tried adding a small amount to all log calculations and it still doesn't work. 

Is there any way to see where the MCMC breaks and just examine that iteration?

Perry de Valpine

未读,
2021年8月4日 19:11:142021/8/4
收件人 dirkdouw...@gmail.com、nimble-users
Hi Dirk,

Yes, you can potentially see where the MCMC breaks, but things are always harder to pin down when your bug manifests as a crash.  Here are some more ideas in case something is useful.

You can take control of the master MCMC function by copying and pasting buildMCMC from nimble's source code.  It's "just" a nimbleFunction and if you copy and paste it locally (and change the "name = 'MCMC'" argument), you can use your local copy in place of nimble's buildMCMC.  Then for example in the master loop at line 209 (default time=FALSE case) you can insert "print(i)" and you'll get a rolling output of iteration counts until the crash.  Sometimes crashes can suppress final output that has not yet been flushed, so it might not be perfect.  

If the error is reproducible, i.e. occurs on the same iteration every time (assuming you used set.seed to get the same RNG sequence), and you can find that iteration by the above method, you can arrange to let the sampler know when to go into a debugging mode.  For example, you could use the model to share information like this:

In the model, add a dummy node:
debugging ~ dbern(0.5) # This node is isolated, not connected to the rest of the model.  set debugging = 0 in inits

In the master MCMC loop in buildMCMC, say you have determined that the problem occurs during iteration 20345, so insert:
if(iter == 20345) model$debugging <<- 1 # Use the "debugging" nodes to share information with the samplers

In your custom sampler:
if(model$debugging) my_debug_fn(<inputs>)

where my_debug_fn is a nimbleRcall and <inputs> is whatever you want to inspect.

You can pause for inspection by putting a debugger (browser) on the R function called by the nimbleRcall.  Say that is called R_my_debug_fn
debug(R_my_debug_fn)

Then at iteration 20345, your sampler will call my_debug_fn which will call R_my_debug_fn which will put you in a browser.  You can then inspect the compiled model object and/or the compiled MCMC object.

This bug sounds really mysterious so I hope these strategies are some help.

-Perry

dirkdouw...@gmail.com

未读,
2021年8月13日 19:53:402021/8/13
收件人 nimble-users
Thanks for the suggestions Perry!

I managed to find the problem, it was in this part of the code,

        newp <- expit(logit(model$p1[c_nei,2])+add)
        loS <- loS+model$S[c_nei,2]*log(newp)+(1-model$S[c_nei,2])*log(1-newp)

Due to numerical inaccuracy newp would sometimes be 1. In that case the second line gives -Inf*0=NaN (the S variables are binary). This led to some NaN values being sampled which were used to index matrixes in the sampler which restarted R. To fix it I just replaced the above with,

        newp <- expit(logit(model$p1[c_nei,2])+add)
        if(model$S[c_nei,2]==1){
          loS <- loS +log(newp)
        }else{
          loS <-loS +log(1-newp)
        }

which works. To find the problem I wrote to log files like you suggested,

        if(is.na(loS)|is.nan(loS)){
          R_write_output(x=c(loS,loc,lf[1,2],oS,c_nei,add,model$p1[c_nei,2],newp,
                             model$S[c_nei,2],(1-model$S[c_nei,2])*log(1-newp),
                             model$S[c_nei,2]*log(newp)))
          R_write_output_broken(x=c(loS))
        }

here the first function wrote the variables in a text file and the second line stops the sampler but does not restart R. I only write the log file if a NaN is found which solves issues with having a million log files.  

Perry de Valpine

未读,
2021年8月16日 11:29:282021/8/16
收件人 dirkdouw...@gmail.com、nimble-users
Fantastic.  Glad to hear the debugging ideas helped you track down what was happening.

回复全部
回复作者
转发
0 个新帖子