Significant slowdown with NUTS in Nimble after first run

117 views
Skip to first unread message

Matheus Castro

unread,
May 7, 2025, 10:44:36 AM5/7/25
to nimble-users
Hello everyone, I hope you're doing well.

I'm encountering a strange performance issue with Nimble (specifically using the NUTS sampler) and could really use some help.

I'm currently working on a paper where I plan to use Nimble with the NUTS sampler. As part of this, I need to run several simulation studies. It's essential for me to be able to re-estimate the same model on different datasets and with different initial values without recompilation.

Everything works smoothly when using the Random Walk (RW) sampler. However, when I switch to NUTS, I experience a major slowdown after the first run. (I'm not comparing execution times between different samplers, but rather between consecutive runs using the same sampler!)

Unfortunately, I can't share the code for my full model due to project restrictions, but the issue can be reproduced with a simplified version like the one below:

#######################################
rm(list = ls())

library("nimble")
library("nimbleHMC")
library("bench")

## Choosing parameters
n <- 200
mu <- 10
sigma <- 3

# Setting the MCMC
norm_code <- nimbleCode({
  # Model
  sigma <- exp(delta)
  for (i in 1:n) {
    y[i] ~ dnorm(mu, sd = sigma)
  }
  # Priors
  mu ~ dnorm(0, sd = 10)
  delta ~ dnorm(0, sd = 10)
})
y <- rnorm(n, mu, sigma)
const_list <- list(n = n)
data_list <- list(y = y)
inits <- function() {
  list(mu = mean(y), delta = exp(sd(y)))
}

model <- nimbleModel(code = norm_code, constants = const_list,
                     data = data_list, inits = inits(), buildDerivs = TRUE)
Cmodel <- compileNimble(model)
monitors <- c("mu", "sigma")
MCMCconf <- configureMCMC(model, monitors = monitors)

HMC <- buildMCMC(MCMCconf)
CHMC <- compileNimble(HMC, project = model)


## First dataset (works fine, runs fast)
y <- rnorm(n, mu, sigma)
Cmodel$setData(y = y)
inits <- function() {
  list(mu = mean(y), delta = exp(sd(y)))
}
Cmodel$setInits(inits())
time1 <- bench_time(mcmc <- runMCMC(CHMC, niter = 2000, nburnin = 1000,
                    thin = 1, nchains = 4))

## Second dataset (same code but waaay slower)
y <- rnorm(n, mu, sigma)
Cmodel$setData(y = y)
inits <- function() {
  list(mu = mean(y), delta = exp(sd(y)))
}
Cmodel$setInits(inits())
time2 <- bench_time(mcmc <- runMCMC(CHMC, niter = 2000, nburnin = 1000,
                    thin = 1, nchains = 4))

list(time1 = time1, time2 = time2)

#######################################
#$time1
#process    real
#  402ms   398ms

#$time2
#process    real
#  1.13m   1.13m
# 183 times slower!
#######################################

If I switch from configureHMC() (i.e., using NUTS) to configureMCMC() (i.e., RW sampler), the performance is consistent across runs:

#######################################
#$time1
#process    real
#103.6ms  99.5ms

#$time2
#process    real
#103.4ms  99.8ms
#######################################

Note: In this toy model, the second NUTS run takes ~1 minute, which is manageable. However, with my full model, this slowdown makes repeated simulations practically infeasible. Also, I've run this code multiple times, and the slowdown consistently occurs.

Has anyone else experienced this? Any insight or workaround would be greatly appreciated.

Thanks in advance!

Chris Paciorek

unread,
May 8, 2025, 12:29:56 PM5/8/25
to Matheus Castro, nimble-users
hi Matheus,

That does indeed seem very odd. I just tried your example on my Linux and Mac machines and I was able to reproduce the issue in both cases.
The progress bar progression the second time around shows slow progression (but with some "jerkiness"/inconsistent progress), so to some extent it seems the
individual iterations are slower rather than that something is getting hung up.

I will try to dig into it over the next few days.

As a (possibly unappealing, and shouldn't-be-needed) work-around you could rebuild and recompile the model and MCMC before the second run. For your real use case, I'm not sure how that extra build/compilation time will compare to the MCMC run-time, but hopefully it is a relatively small fraction and will allow you to proceed. If that's not the case, you might also be able to only re-do some of those steps (possibly using the `resetFunctions` argument to `compileNimble`).

(As a side note, the code you provided does not use `configureHMC`, so I switched to that...)

-chris

--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/nimble-users/a411cfe9-eb1b-46ba-85a6-a653694f08e7n%40googlegroups.com.

Chris Paciorek

unread,
May 8, 2025, 12:37:32 PM5/8/25
to Matheus Castro, nimble-users
Here's another wrinkle. If I use the "manual" method of running the MCMC: `CHMC$run(niter=2000, nburnin=1000)`, I don't see the slowdown.

So that's another work-around. You can read about using $run in the manual. You would need to manually handle running multiple chains if you want multiple chains.

-chris

Matheus Castro

unread,
May 8, 2025, 3:26:21 PM5/8/25
to nimble-users
Hello, Chris!

First of all, congratulations on the development of NIMBLE—it’s an outstanding package. Its flexibility for modeling and custom sampler development is remarkable and has been incredibly valuable for my research. For instance, I was able to obtain incredible results using the NUTS sampler within a mixture model while maintaining the standard data augmentation scheme and a discrete sampler—something impossible with other R packages for MCMC.

Regarding the issue, I’ve so far tested it on two Linux machines (both x86_64 with Ubuntu-based distributions). I plan to test it on a Windows 11 machine later today and will report back if the results differ. In the meantime, here are a few new information:

1) Convergence concern: After reading your post, I looked into convergence diagnostics. In one replicate, the first run achieved an n.eff of over 3300, while the second run barely reached 150. The difference is clearly visible in the traceplots—the second run doesn’t converge properly. So the issue appears to impact not just execution time, but also the quality of inference.

Note: For items 2 and 3, I simulated only one chain.
2) Script 2 – I tested the following sequence:
 - Estimate (via runMCMC) using dataset y1
 - Estimate using dataset y2
 - Re-estimate using dataset y1
The results were:
  $time1
  process    real
     81ms    81ms
  $time2
  process    real
    12.9s   12.9s
  $time3
  process    real
   82.5ms  82.5ms
So, reusing runMCMC() with the same dataset—even after a problematic estimation—doesn’t introduce delay. This leads me to wonder whether some dataset-dependent autodiff variables used by NUTS aren’t being properly updated between runs.

3) Script 3 – I also tested the sequence as suggested:
 - runMCMC() with dataset y1
 - CHMC$run() with dataset y2
 - CHMC$run() with dataset y1
Results:
    $time1
  process    real
   82.4ms  82.5ms
  $time2
  process    real
      12s     12s
  $time3
  process    real
   78.1ms  78.1ms 
Again, runMCMC() is significantly faster, and CHMC$run() matches its speed only when using the same dataset.

I’ve attached both scripts for clarity and reproducibility.

At this stage of my project, I’m recompiling the model for each replicate in my simulation study, which is extending the timeline by a few weeks.

Thank you again for your time and support—I truly appreciate your help,
Matheus Castro
script2.R
script3.r

Matheus Castro

unread,
May 9, 2025, 9:23:16 AM5/9/25
to nimble-users
Hello again,

I’ve made some other tests that may be useful.

If nimbleModel() is constructed with data = NULL (or simply without passing a data list), the model no longer experiences a slowdown during the second run. However, it leads to poor estimation (even in the first run).

Model setups:
- Model 1 (model_1):
nimbleModel(code = norm_code, constants = const_list, data = data_list, buildDerivs = TRUE)

- Model 2 (model_2):
nimbleModel(code = norm_code, constants = const_list, buildDerivs = TRUE)

Note: I ran four (>1) chains for each model to compute Rhat. A reproducible script is attached [script4.R].
Timing Results:
$model1_time1
  process    real
   433ms     433ms
$model1_time2
  process    real
   1.23m     1.23m

$model2_time1
  process    real
   171ms     171ms
$model2_time2
  process    real
   172ms     172ms
   
We no longer see the slowdown in execution time (model_2). But now, take a look at the summary (true_mu = 10 and true_sigma = 3):
MCMCvis::MCMCsummary(mcmc1)  # model1_time1
        mean      Rhat       n.eff
mu     10.37     1.00     3555
sigma   3.01     1.00    3585
MCMCvis::MCMCsummary(mcmc2)  # model1_time2
        mean      Rhat       n.eff
mu      9.70     1.09         44
sigma   2.81     1.76       60

MCMCvis::MCMCsummary(mcmc3)  # model2_time1
              mean             Rhat     n.eff
mu        0.121             1.00     3686
sigma  2.45e+10       1.28     4000
MCMCvis::MCMCsummary(mcmc4)  # model2_time2
            mean            Rhat     n.eff
mu      -0.112            1.00     4206
sigma  1.16e+12      1.27     4000

Model 2 converges (ish) and runs fast. However, it seems that it is not considering the data—for example, look at the posterior mean of sigma.

However, the attached script5.R shows that switching from configureHMC() to configureMCMC() (i.e, no longer using the NUTS sampler) allows model1 to run without any issues, but model2 still produces absurd estimates for sigma. So, I'm not sure whether creating a model without data and then using Cmodel$setData() is actually allowed in NIMBLE. According to Section 6.1.1.2 of the manual (Providing (or changing) data and initial values for an existing model), I don't see why this would be a problem.

Thanks again for your time. I hope this information is helpful,
Matheus Castro
script5.r
script4.R

Daniel Turek

unread,
May 9, 2025, 10:01:20 AM5/9/25
to Matheus Castro, nimble-users
Matheus, this was extremely helpful.  Thank you for identifying this problem in the first place, and the extensive test cases and scripts you've provided.  They were quite helpful in diagnosing the issue.

It seems this boiled down to a relatively simple issue.  The recently added automatic-differentiation (AD) system for nimble, which supports derivative-based sampling algorithms such as HMC, has some built in efficiencies for calculating numerical derivatives.  Among those, is "taping" operations the first time a call to nimDerivs takes place, which fixes some of the assumptions of the derivative calculations (including the flow of execution through if-then-else sequences, and critical to this matter, the data values involved in the calculations).  This process of "taping" allows subsequent calls to nimDerivs to execute faster.

Here, when the data values in the model changed between operations, it made the internal AD "taped operations" invalid (which are internal to the HMC samplers).  Unfortunately, there wasn't a hook for triggering these operations to be "re-taped", if/when the data might have changed.

A small update has just been made to nimbleHMC, which triggers re-taping of the AD operations on the first iteration of any MCMC algorithm.  This is actually triggered by the internal "reset" method of the samplers, which is called at the onset of every new MCMC chain.

If you're willing to try installing this updated version of nimbleHMC from a different branch called "ADreset", it would be helpful to know what you observe.  I'm optimistic this will resolve the problem you identified.  The new version of nimbleHMC can be installed using the code below:

remove.packages('nimbleHMC')
library(remotes)
remotes::install_github('nimble-dev/nimbleHMC', ref = 'ADreset', subdir = 'packanimbleHMC')
library(nimbleHMC)

Thank you again, and please let us know how this works.

Cheers,
Daniel






Matheus Castro

unread,
May 9, 2025, 1:13:37 PM5/9/25
to nimble-users
Daniel, thanks!

I'm currently running one of my simulation studies using nimbleHMC from the ADreset branch, skipping compilation in each iteration. That solved the issue! It's clearly much faster now, and the convergence is as expected.

For my first test, I chose a scenario with a small model and dataset. In such cases, the compilation time was even more painful, as it can take multiple times longer than the execution itself. Also, with smaller models, I can quickly generate a larger number of iterations, which helps in assessing convergence. In this scenario, with about 100 iterations (100 estimations to different datasets), I feel confident that the sampler results are (stochastically) equivalent to the version that compiles at every iteration.

Since I need to run several simulations across multiple scenarios, skipping the compilation phase will significantly reduce the overall runtime. This will be extremely helpful!

(Sidenote: I think there's a typo—the subdir argument should be "nimbleHMC".)

Thanks a lot for the time you've invested in solving this—I truly appreciate your help,
Matheus
Reply all
Reply to author
Forward
0 new messages