Significant slowdown with NUTS in Nimble after first run

Matheus Castro

unread,

May 7, 2025, 10:44:36 AM5/7/25

to nimble-users

Hello everyone, I hope you're doing well.

I'm encountering a strange performance issue with Nimble (specifically using the NUTS sampler) and could really use some help.

I'm currently working on a paper where I plan to use Nimble with the NUTS sampler. As part of this, I need to run several simulation studies. It's essential for me to be able to re-estimate the same model on different datasets and with different initial values without recompilation.

Everything works smoothly when using the Random Walk (RW) sampler. However, when I switch to NUTS, I experience a major slowdown after the first run. (I'm not comparing execution times between different samplers, but rather between consecutive runs using the same sampler!)

Unfortunately, I can't share the code for my full model due to project restrictions, but the issue can be reproduced with a simplified version like the one below:

#######################################
rm(list = ls())

library("nimble")
library("nimbleHMC")
library("bench")

## Choosing parameters
n <- 200
mu <- 10
sigma <- 3

# Setting the MCMC
norm_code <- nimbleCode({
# Model
sigma <- exp(delta)
for (i in 1:n) {
y[i] ~ dnorm(mu, sd = sigma)
}
# Priors
mu ~ dnorm(0, sd = 10)
delta ~ dnorm(0, sd = 10)
})
y <- rnorm(n, mu, sigma)
const_list <- list(n = n)
data_list <- list(y = y)
inits <- function() {
list(mu = mean(y), delta = exp(sd(y)))
}

model <- nimbleModel(code = norm_code, constants = const_list,
data = data_list, inits = inits(), buildDerivs = TRUE)
Cmodel <- compileNimble(model)
monitors <- c("mu", "sigma")
MCMCconf <- configureMCMC(model, monitors = monitors)

HMC <- buildMCMC(MCMCconf)
CHMC <- compileNimble(HMC, project = model)

## First dataset (works fine, runs fast)
y <- rnorm(n, mu, sigma)
Cmodel$setData(y = y)
inits <- function() {
list(mu = mean(y), delta = exp(sd(y)))
}
Cmodel$setInits(inits())
time1 <- bench_time(mcmc <- runMCMC(CHMC, niter = 2000, nburnin = 1000,
thin = 1, nchains = 4))

## Second dataset (same code but waaay slower)
y <- rnorm(n, mu, sigma)
Cmodel$setData(y = y)
inits <- function() {
list(mu = mean(y), delta = exp(sd(y)))
}
Cmodel$setInits(inits())
time2 <- bench_time(mcmc <- runMCMC(CHMC, niter = 2000, nburnin = 1000,
thin = 1, nchains = 4))

list(time1 = time1, time2 = time2)

#######################################
#$time1
#process real
# 402ms 398ms

#$time2
#process real
# 1.13m 1.13m
# 183 times slower!
#######################################

If I switch from configureHMC() (i.e., using NUTS) to configureMCMC() (i.e., RW sampler), the performance is consistent across runs:

#######################################
#$time1
#process real
#103.6ms 99.5ms

#$time2
#process real
#103.4ms 99.8ms
#######################################

Note: In this toy model, the second NUTS run takes ~1 minute, which is manageable. However, with my full model, this slowdown makes repeated simulations practically infeasible. Also, I've run this code multiple times, and the slowdown consistently occurs.

Has anyone else experienced this? Any insight or workaround would be greatly appreciated.

Thanks in advance!

Chris Paciorek

unread,

May 8, 2025, 12:29:56 PM5/8/25

to Matheus Castro, nimble-users

hi Matheus,

That does indeed seem very odd. I just tried your example on my Linux and Mac machines and I was able to reproduce the issue in both cases.

The progress bar progression the second time around shows slow progression (but with some "jerkiness"/inconsistent progress), so to some extent it seems the

individual iterations are slower rather than that something is getting hung up.

I will try to dig into it over the next few days.

As a (possibly unappealing, and shouldn't-be-needed) work-around you could rebuild and recompile the model and MCMC before the second run. For your real use case, I'm not sure how that extra build/compilation time will compare to the MCMC run-time, but hopefully it is a relatively small fraction and will allow you to proceed. If that's not the case, you might also be able to only re-do some of those steps (possibly using the `resetFunctions` argument to `compileNimble`).

(As a side note, the code you provided does not use `configureHMC`, so I switched to that...)

-chris

--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/nimble-users/a411cfe9-eb1b-46ba-85a6-a653694f08e7n%40googlegroups.com.

Chris Paciorek

unread,

May 8, 2025, 12:37:32 PM5/8/25

to Matheus Castro, nimble-users

Here's another wrinkle. If I use the "manual" method of running the MCMC: `CHMC$run(niter=2000, nburnin=1000)`, I don't see the slowdown.

So that's another work-around. You can read about using $run in the manual. You would need to manually handle running multiple chains if you want multiple chains.

-chris

Matheus Castro

unread,

May 8, 2025, 3:26:21 PM5/8/25

to nimble-users

Hello, Chris!

First of all, congratulations on the development of NIMBLE—it’s an outstanding package. Its flexibility for modeling and custom sampler development is remarkable and has been incredibly valuable for my research. For instance, I was able to obtain incredible results using the NUTS sampler within a mixture model while maintaining the standard data augmentation scheme and a discrete sampler—something impossible with other R packages for MCMC.

Regarding the issue, I’ve so far tested it on two Linux machines (both x86_64 with Ubuntu-based distributions). I plan to test it on a Windows 11 machine later today and will report back if the results differ. In the meantime, here are a few new information:

1) Convergence concern: After reading your post, I looked into convergence diagnostics. In one replicate, the first run achieved an n.eff of over 3300, while the second run barely reached 150. The difference is clearly visible in the traceplots—the second run doesn’t converge properly. So the issue appears to impact not just execution time, but also the quality of inference.

Note: For items 2 and 3, I simulated only one chain.
2) Script 2 – I tested the following sequence:
- Estimate (via runMCMC) using dataset y1
- Estimate using dataset y2
- Re-estimate using dataset y1
The results were:
$time1
process real
81ms 81ms
$time2
process real
12.9s 12.9s
$time3
process real
82.5ms 82.5ms
So, reusing runMCMC() with the same dataset—even after a problematic estimation—doesn’t introduce delay. This leads me to wonder whether some dataset-dependent autodiff variables used by NUTS aren’t being properly updated between runs.

3) Script 3 – I also tested the sequence as suggested:
- runMCMC() with dataset y1
- CHMC$run() with dataset y2
- CHMC$run() with dataset y1
Results:
$time1
process real
82.4ms 82.5ms
$time2
process real
12s 12s
$time3
process real
78.1ms 78.1ms
Again, runMCMC() is significantly faster, and CHMC$run() matches its speed only when using the same dataset.

I’ve attached both scripts for clarity and reproducibility.

At this stage of my project, I’m recompiling the model for each replicate in my simulation study, which is extending the timeline by a few weeks.

Thank you again for your time and support—I truly appreciate your help,
Matheus Castro

script2.R

script3.r

Matheus Castro

unread,

May 9, 2025, 9:23:16 AM5/9/25

to nimble-users

Hello again,

I’ve made some other tests that may be useful.

If nimbleModel() is constructed with data = NULL (or simply without passing a data list), the model no longer experiences a slowdown during the second run. However, it leads to poor estimation (even in the first run).

Model setups:
- Model 1 (model_1):
nimbleModel(code = norm_code, constants = const_list, data = data_list, buildDerivs = TRUE)

- Model 2 (model_2):
nimbleModel(code = norm_code, constants = const_list, buildDerivs = TRUE)

Note: I ran four (>1) chains for each model to compute Rhat. A reproducible script is attached [script4.R].
Timing Results:
$model1_time1
process real
433ms 433ms
$model1_time2
process real
1.23m 1.23m

$model2_time1
process real
171ms 171ms
$model2_time2
process real
172ms 172ms

We no longer see the slowdown in execution time (model_2). But now, take a look at the summary (true_mu = 10 and true_sigma = 3):
MCMCvis::MCMCsummary(mcmc1) # model1_time1
mean Rhat n.eff
mu 10.37 1.00 3555
sigma 3.01 1.00 3585
MCMCvis::MCMCsummary(mcmc2) # model1_time2
mean Rhat n.eff
mu 9.70 1.09 44
sigma 2.81 1.76 60

MCMCvis::MCMCsummary(mcmc3) # model2_time1
mean Rhat n.eff
mu 0.121 1.00 3686
sigma 2.45e+10 1.28 4000
MCMCvis::MCMCsummary(mcmc4) # model2_time2
mean Rhat n.eff
mu -0.112 1.00 4206
sigma 1.16e+12 1.27 4000

Model 2 converges (ish) and runs fast. However, it seems that it is not considering the data—for example, look at the posterior mean of sigma.

However, the attached script5.R shows that switching from configureHMC() to configureMCMC() (i.e, no longer using the NUTS sampler) allows model1 to run without any issues, but model2 still produces absurd estimates for sigma. So, I'm not sure whether creating a model without data and then using Cmodel$setData() is actually allowed in NIMBLE. According to Section 6.1.1.2 of the manual (Providing (or changing) data and initial values for an existing model), I don't see why this would be a problem.

Thanks again for your time. I hope this information is helpful,
Matheus Castro

script5.r

script4.R

Daniel Turek

unread,

May 9, 2025, 10:01:20 AM5/9/25

to Matheus Castro, nimble-users

Matheus, this was extremely helpful. Thank you for identifying this problem in the first place, and the extensive test cases and scripts you've provided. They were quite helpful in diagnosing the issue.

It seems this boiled down to a relatively simple issue. The recently added automatic-differentiation (AD) system for nimble, which supports derivative-based sampling algorithms such as HMC, has some built in efficiencies for calculating numerical derivatives. Among those, is "taping" operations the first time a call to nimDerivs takes place, which fixes some of the assumptions of the derivative calculations (including the flow of execution through if-then-else sequences, and critical to this matter, the data values involved in the calculations). This process of "taping" allows subsequent calls to nimDerivs to execute faster.

Here, when the data values in the model changed between operations, it made the internal AD "taped operations" invalid (which are internal to the HMC samplers). Unfortunately, there wasn't a hook for triggering these operations to be "re-taped", if/when the data might have changed.

A small update has just been made to nimbleHMC, which triggers re-taping of the AD operations on the first iteration of any MCMC algorithm. This is actually triggered by the internal "reset" method of the samplers, which is called at the onset of every new MCMC chain.

If you're willing to try installing this updated version of nimbleHMC from a different branch called "ADreset", it would be helpful to know what you observe. I'm optimistic this will resolve the problem you identified. The new version of nimbleHMC can be installed using the code below:

remove.packages('nimbleHMC')
library(remotes)
remotes::install_github('nimble-dev/nimbleHMC', ref = 'ADreset', subdir = 'packanimbleHMC')
library(nimbleHMC)

Thank you again, and please let us know how this works.

Cheers,

Daniel

To view this discussion visit https://groups.google.com/d/msgid/nimble-users/2720d287-516a-4cab-a588-52509a940676n%40googlegroups.com.

Matheus Castro

unread,

May 9, 2025, 1:13:37 PM5/9/25

to nimble-users

Daniel, thanks!

I'm currently running one of my simulation studies using nimbleHMC from the ADreset branch, skipping compilation in each iteration. That solved the issue! It's clearly much faster now, and the convergence is as expected.

For my first test, I chose a scenario with a small model and dataset. In such cases, the compilation time was even more painful, as it can take multiple times longer than the execution itself. Also, with smaller models, I can quickly generate a larger number of iterations, which helps in assessing convergence. In this scenario, with about 100 iterations (100 estimations to different datasets), I feel confident that the sampler results are (stochastically) equivalent to the version that compiles at every iteration.

Since I need to run several simulations across multiple scenarios, skipping the compilation phase will significantly reduce the overall runtime. This will be extremely helpful!

(Sidenote: I think there's a typo—the subdir argument should be "nimbleHMC".)

Thanks a lot for the time you've invested in solving this—I truly appreciate your help,
Matheus

Reply all

Reply to author

Forward