So, I did some very quick tests with a 15-D correlated Gaussian with unit variance and various degree of correlation (up to rho = 0.99).
I compared NUTS with numerical gradient vs NUTS with exact gradient vs an adaptive slice sampling method that estimates the covariance matrix and subsequent slicing direction using warmup samples (the method I would be otherwise using; so both NUTS and slice sampling have an adaptive component).
Up to rho = 0.9, the numerical gradient does okay at least on this very simple case.
The number of effective samples (Neff) per "gradient evaluation" is comparable between the numerical and analytical gradients (perhaps with a tiny reduction in efficiency of the former, less than 5-10%).
Of course the numerical method takes a much larger number of function evaluations (2*D function evals per gradient evaluation, if using a central difference scheme).
However, performance of the numerical gradient drops for the largest value of rho that I tried (rho = 0.99); there is a difference of almost a factor of 10 in Neff per gradient evaluation between numerical and exact gradient.
For values of rho of 0.9 or less, the comparison of NUTS with numerical gradient with the slice sampler is in strong favour of the slice sampler, which is up to 20 times better in terms of Neff per function evaluation. The advantage of slice sampler becomes astronomical for rho = 0.99.
I tried using HMC (with dual-averaging) and numerical gradient and it seems to perform better than NUTS with numerical gradient (I just picked a unit-length trajectory) -- this suggests that the trajectories in NUTS end up being too long and accumulate too large error.
This was very quick and dirty -- I just wanted to get some ballpark.
Clearly this is not a fair comparison since the adaptive slice sampler can estimate quite easily the shape of the covariance matrix.
NUTS would get a bigger advantage for complex density shapes. However, if the shape of the distribution is particularly complex, NUTS with numerical differentiation would also start to perform poorly.
Things can get better with better differentiation schemes, but they would require more function evaluations.
In conclusion, I think it might not be worth for me to implement HMC/NUTS with naive numerical differentiation.
It seems that any gain might be swamped by the loss in efficiency due to accumulated numerical error and the increased cost in function evaluations to estimate the gradient numerically.
Of course there might be clever solutions to this problem, but I'd rather focus on other methods at this point -- coming back to HMC/NUTS when I can compute an exact gradient.
Best,
Luigi