convergence problem with a Gaussian Process Model

539 views
Skip to first unread message

Enrico Ryunosuke Crema

unread,
Jul 6, 2021, 11:38:22 AM7/6/21
to nimble-users
Dear All,
I'm attempting to fit a Gaussian process quantile regression with measurement error and started to play with a simpler version of the model fitted to some simulated data with known parameters. The models runs fairly quickly, but fails to converge even after 1m iterations (with 500k burnin). I then used the stan (via the rethinking package) using the same model and parameters, and this worked fine. I know for some processes HMC can be more efficient (and I'm very much looking forward to have this implemented in nimble!), but I am wondering if there is something glaringly obvious that I missed as the trace-plots clearly show disagreement between chains traceplot_nimble.jpeg

The attached script simulates my entire workflow, and the figures compares the posterior of the stan vs nimble models. 
nimble_posterior.jpeg
stan_posterior.jpeg

GP_param_prediction.jpeg
I also copied and pasted the script of the main model below. Thanks in advance for any suggestions!

All Best,
Enrico

***cov_GPL2 <- nimbleFunction(
  run = function(dists = double(2), rhosq = double(0), etasq = double(0), sigmasq = double(0)) {
    returnType(double(2))
    n <- dim(dists)[1]
    result <- matrix(nrow = n, ncol = n, init = FALSE)
    deltaij <- matrix(nrow = n, ncol = n,init = TRUE)
    diag(deltaij) <- 1
    for(i in 1:n)
      for(j in 1:n)
        result[i, j] <- etasq*exp(-rhosq*dists[i,j]^2)+sigmasq*deltaij[i,j]
    return(result)
  })
Ccov_GPL2 <- compileNimble(cov_GPL2)

dispersalinference <- nimbleCode({
  for (i in 1:N){
    # Model
    mu[i] <- alpha + (s[i]+beta)*d[i] 
    Y[i] ~ dnorm(mu[i],sd=sigma) 
  }
  #priors
  alpha ~ dnorm(3000,sd=200);
  beta ~ dnorm(0,sd=3)
  sigma ~ dexp(0.05)
  etasq ~ dexp(1);
  rhosq ~ dexp(1000);
  mu_s[1:N] <- 0;
  cov_s[1:N, 1:N] <- cov_GPL2(dists[1:N, 1:N], rhosq, etasq, 0.001)
  s[1:N] ~ dmnorm(mu_s[1:N], cov = cov_s[1:N, 1:N])
}) 

## Setup Init & Constant
inits=list(alpha=3000,beta=-3,sigma=200,s=rep(N,0),etasq=2,rhosq=0.001)
inits$cov_s <- Ccov_GPL2(dmat, inits$rhosq, inits$etasq, 0.001)
inits$s <-  t(chol(inits$cov_s)) %*% rnorm(N)
inits$s <- inits$s[ , 1]  # so can give nimble a vector rather than one-column matrix
constants <- list(N=N,dists=dmat,d = coords$d)
data <- list(Y=coords$Y)
## Compile
model <- nimbleModel(dispersalinference,constants = constants,data=data,inits=inits)
cModel <- compileNimble(model)
conf <- configureMCMC(model)
conf$addMonitors('s')
conf$removeSamplers('s[1:100]')
conf$addSampler('s[1:100]', 'RW_block',control = list(adaptFactorExponent = 0.2))
MCMC <- buildMCMC(conf)
cMCMC <- compileNimble(MCMC, project = cModel)
## MCMC
samples.nimble <- runMCMC(cMCMC, nchain=2,niter = 1000000, thin=100,nburnin = 500000,samplesAsCodaMCMC = T)








 
spatial_regression_dispersal_nimble.R

Enrico Ryunosuke Crema

unread,
Jul 6, 2021, 1:24:08 PM7/6/21
to nimble-users
Ok ... setting adaptFactorExponent  to 0.8 seems to have fixed this somehow. A bit confused as my understanding from the manual is that decreasing ‘adaptFactorExponent’ generally lead to improved performance but I might have misinterpreted this?

Daniel Turek

unread,
Jul 6, 2021, 2:52:06 PM7/6/21
to Enrico Ryunosuke Crema, nimble-users
Very briefly - there's no "hard and fast rule" about whether larger/smaller values of "adaptFactorExponent" will improve, or hinder the speed of convergence.  It would be model and situation dependent, and I suppose is an open area of study.

The default value for both the "RW" and the "RW_block" samplers is adaptFactorExponent = 0.8.  Larger values, say adaptFactorExponent = 1 or 2, will greatly reduce the magnitude of the adaptation of the scale (standard deviation) of the RW proposals; or more specifically, the magnitude of the recurring adaptations of the proposal scale will decrease (and nearly vanish) very quickly, as the number of MCMC iterations increased.  In contrast, very small (positive) values of adaptFactorExponent, say for example adaptFactorExponent = 0.1, will allow the magnitude of the adaptations of the RW proposal scale to persist, for larger numbers of MCMC iterations.  The limiting case, *which is not valid to use*, of adaptFactorExponent = 0 would suggest no decay in the magnitude of the adaptation of the RW proposal scale, but once again, this would lead to an in-valid MCMC.

Hope this helps,
Daniel


--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/a2f29103-222e-485b-86c4-d1196abafdccn%40googlegroups.com.

Perry de Valpine

unread,
Jul 7, 2021, 11:57:26 AM7/7/21
to Daniel Turek, Enrico Ryunosuke Crema, nimble-users
I will add a few points to this. It's an interesting example and thanks for presenting it so clearly, Enrico.  In general I am not sure if the adaptive random-walk Metropolis-Hastings samplers will work well in very high dimensions.  If you are interested in experimenting, it could be interesting to try a set of smaller blocks (e.g. 10 blocks of 10 nodes each, or even blocks with overlapping sets of nodes) that together cover all the dimensions.  Another parameter that enters into the adaptation scheme is the adaptInterval.  This is the number of iterations between adaptation steps.  If you decrease this, it will be loosely similar to increasing adaptFactorExponent, but not equivalent.  The source code is in sampler_RW_block in MCMC_samplers.R.  Last but not least, it can matter whether you are starting from good initial values.  If the initial values are far from the posterior region of interest, then some of the adaptation (self-tuning) will occur in a region not very representative of the main posterior region of interest.  Hence, initial values can impact the quality of adaptation.  In your example, it looks like at least some of the random effect marginal posteriors are bimodal.  It would be interesting to see if that's the case in Stan as well, to confirm that they are exploring the same posterior.

-Perry


Enrico Crema

unread,
Jul 7, 2021, 12:29:03 PM7/7/21
to Perry de Valpine, Daniel Turek, nimble-users

Thank you Perry and Daniel - this is extremely helpful. In the meanwhile I did also compare the MCMC efficiency of RW_block (using adaptFactorExponent=0.8) against AF_slice. It's based only on 5 runs, but the results show that the latter was nearly an order of magnitude more efficient (btw thanks for dedicating a slot on this topic in the workshop!). As for your question regarding bi-modality - both stan and the run based on adaptFactorExponent set at 0.8 returned unimodal posteriors for all random effects, so presumably the bi-modality was just the chains stuck into different regions?

Enrico

Perry de Valpine

unread,
Jul 7, 2021, 12:33:08 PM7/7/21
to Enrico Crema, Daniel Turek, nimble-users
Ah, that makes sense about the apparent bi-modality.

The result on automated factor slice sampling is very interesting.  Thanks for trying that and sharing the result.

One more option I wanted to mention with the RW_block sampler is the 'tries' control parameter.  If you set this >1, it will simply run the sampler 'tries' times in a row, to give it more propose-accept-reject tries.  This essentially changes the balance of sampling effort on the block vs the other parameters.

-Perry
Reply all
Reply to author
Forward
0 new messages