Hi Remy,
Thanks for the question and very clear reproducible code. It was puzzling but I think I know what is going on.
First I checked on time spent in each sampler, using
cmcmcRmodel$run(niter, time = TRUE)
cmcmcRmodel$samplerTimes
The second sampler accounts for essentially all of the difference in computing time (about 8 seconds difference on my machine out of 25 or 34 seconds in the two cases, so quite noticeable).
From
confRModel$printSamplers()
we see that the second sampler is for eps. This will have the expensive computations of M as a dependency.
Yet the computations of the pieces of M itself are not counter-intuitive in terms of computational cost.
system.time(for(i in 1:100) cRmodel$calculate("M")) # Loop 100 times for some averaging of costs
# This is nearly the same in the two cases but slightly *faster* when you've used 2 sub-matrices instead of 10. That makes sense and is opposite to the weird MCMC result. Also note that there will be some overhead in R when nimble evaluates cRmodel$calculate("M"). It will be in R that it determines what nodes need calculation, and sets up a loop for that, so that may account for some of the slowdown for 10. But the difference is 2.1 seconds (case 10) vs 2.0 seconds (case 2) on my machine, so not of the magnitude in the MCMC issue.
So this suggests it is not raw computation time of the M pieces that is the issue.
I think what is happening is that invalid proposals (outside support of the prior) in the block sampler are immediately rejected without wasting time computing dependencies, and this is much more likely to happen in the case of 10 elements for eps than 2 elements for eps. To explain more, your eps[1:q] is a vector node, meaning it is (by default MCMC configuration) sampled as a vector using an adaptive random-walk Metropolis-Hastings block sampler. And the prior is your dUnifVector which makes it uniform from 1 to 3 in every element. The RW-MH block sampler makes multivariate normal proposals, for which the scale and correlation are adapted over many iterations and might be quite poor (arbitrary) choices at the start. Like many of our samplers, after it makes a proposal, it first checks the log probability of the prior. If that is -Inf (meaning the proposal is in a region of 0 prior probability) it immediately rejects the proposal and that iteration of that sampler is done. When this occurs the dependency computations are *not* done. In your case, every time an invalid proposal is made, the cost of computing M (all of its pieces) is skipped because the proposal will be rejected regardless. A block sampler in 10 dimensions with uniform prior is much more likely to have at least one element fall outside of the valid range (1 to 3) and thus have the whole vector proposal be immediately rejected. Also I noticed from the posterior that elements of eps can spend a lot of time (iterations) up against a boundary value, making it likely that a proposal will be on the invalid side of that boundary. In other words, the case of 10 goes faster because it is more often making waster proposals for eps[1:10] that are rejected without incurring the cost of computing M (and then W, lambda and Z)
I'm not sure of the full context here, but some ideas are: you could set up eps as scalar nodes and they will be sampled one by one; and/or you could set up eps on a different scale, like logit(eps), so they do not have the boundaries from the uniform. Or you may have just wanted to make sense of what nimble is doing.
The cost of the actual multivariate normal draw and adaptation inside the RW-MH block sampler in this case is very small compared to the model computations.
I hope that is clear!
Perry