I have tried the call 20 times, with init_r=2, and in all of these cases the ADVI algorithm converges to similar parameter values which are not those given by the MCMC run.
This actually looks like premature convergence. The values alpha are all around 4, but the parameter sigma_alpha is around 250, which just doesn't make sense in the context of this model. The alpha should have mean of mu_alpha and standard deviation sigma_alpha, but that's clearly not true here... sigma_alpha is orders of magnitude larger than the dispersion between the alphas. See highlighted text below.
On the other hand, if I run
rats_stanified_ADVI <- vb(rats_stanified_model, data = dataIndexed,
eta=5, adapt_engaged=FALSE, iter = 1000000, tol_rel_obj=0.00001,
grad_samples=1)
Then the model very very slowly (over the course of 1 million iterations) gets to an ELBO of around -500, which is much better than under the automated tuning. The parameter estimates also look good here now as well. So this looks like non-convergence of the stochastic optimisation under the default parameters.
Higher values of eta give me faster convergence, if the model starts (with eta =10 I often get an error saying "Error: stan::variational::normal_meanfield::calc_grad: The number of dropped evaluations has reached its maximum amount (100). Your model may be either severely ill-conditioned or misspecified.").
I have tried setting grad_samples=20 (rather than the default 1) and this does not seem to affect convergence.
Maybe the issue is the fact that the algorithm uses gradient ascent? Maybe stochastic BFGS would work better here... perhaps there are severe degeneracies when far away from the ELBO mode.
---
> rats_stanified_ADVI <- vb(rats_stanified_model, data = dataIndexed)
This is Automatic Differentiation Variational Inference.
(EXPERIMENTAL ALGORITHM: expect frequent updates to the procedure.)
Gradient evaluation took 0 seconds
1000 iterations under these settings should take 0 seconds.
Adjust your expectations accordingly!
Begin eta adaptation.
Iteration: 1 / 250 [ 0%] (Adaptation)
Iteration: 50 / 250 [ 20%] (Adaptation)
Iteration: 100 / 250 [ 40%] (Adaptation)
Iteration: 150 / 250 [ 60%] (Adaptation)
Iteration: 200 / 250 [ 80%] (Adaptation)
Success! Found best value [eta = 1] earlier than expected.
Begin stochastic gradient ascent.
iter ELBO delta_ELBO_mean delta_ELBO_med notes
100 -1e+006 1.000 1.000
200 -19339.5 29.376 57.751
300 -1140.4 24.903 15.959
400 -1043.6 18.701 15.959
500 -1041.6 14.961 1.000
600 -1041.7 12.468 1.000
700 -1040.7 10.687 0.093
800 -1039.9 9.351 0.093
900 -1042.2 8.312 0.002 MEDIAN ELBO CONVERGED
Drawing 1000 samples from the approximate posterior... COMPLETED.
> print(rats_stanified_ADVI)
Inference for Stan model: rats_stanified.
1 chains, each with iter=1000; warmup=0; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=1000.
mean sd 2.5% 25% 50% 75% 97.5%
alpha[1] 5.29 12.35 -18.91 -2.88 5.50 13.32 30.97
alpha[2] 5.13 11.65 -18.21 -2.72 5.21 13.10 27.34
alpha[3] 6.71 12.25 -18.82 -1.25 6.99 15.35 29.40
alpha[4] 5.67 12.50 -18.86 -3.10 5.84 14.16 30.59
alpha[5] 5.86 10.69 -14.72 -1.52 5.86 13.49 25.66
alpha[6] 5.76 10.34 -13.66 -1.45 5.54 12.58 26.96
alpha[7] 4.54 11.47 -17.63 -3.13 5.07 12.32 26.26
alpha[8] 5.61 10.18 -13.70 -1.18 5.64 12.68 25.83
alpha[9] 6.01 11.36 -16.30 -1.34 5.94 13.41 28.14
alpha[10] 6.02 12.31 -18.78 -2.11 6.00 14.34 29.69
alpha[11] 6.45 12.54 -17.74 -2.36 6.52 14.72 30.59
alpha[12] 5.64 11.63 -17.19 -2.10 5.76 13.92 27.68
alpha[13] 6.15 9.93 -13.25 -0.16 6.44 13.13 24.75
alpha[14] 5.44 12.51 -18.39 -3.08 5.67 13.84 29.51
alpha[15] 5.79 9.21 -12.49 -0.35 5.97 11.96 23.69
alpha[16] 5.92 9.53 -12.63 -0.52 5.96 12.46 24.37
alpha[17] 6.22 9.44 -13.35 -0.32 6.33 12.44 24.77
alpha[18] 5.19 12.19 -19.80 -3.01 5.42 13.70 28.20
alpha[19] 5.48 11.83 -18.01 -2.68 4.98 13.20 29.42
alpha[20] 5.34 10.84 -15.85 -1.66 5.12 12.41 27.20
alpha[21] 5.12 10.65 -15.06 -1.84 5.20 12.35 26.27
alpha[22] 5.39 11.82 -17.38 -2.63 5.50 13.35 28.47
alpha[23] 6.09 11.62 -15.12 -2.01 5.82 13.99 29.30
alpha[24] 6.45 8.56 -10.11 0.56 6.01 12.44 23.11
alpha[25] 6.22 10.09 -12.89 -0.60 6.00 13.26 26.57
alpha[26] 4.77 10.36 -15.14 -2.36 4.70 11.74 25.53
alpha[27] 5.70 8.86 -11.63 -0.17 5.60 11.37 22.79
alpha[28] 5.07 10.55 -16.59 -1.78 4.80 12.08 25.96
alpha[29] 5.10 11.06 -16.63 -2.37 5.20 12.35 26.58
alpha[30] 7.89 13.58 -19.32 -1.44 8.46 16.88 33.43
beta[1] -87.31 12.52 -111.70 -95.78 -87.78 -78.10 -64.21
beta[2] 4.23 2.74 -1.20 2.46 4.23 6.01 9.79
beta[3] 4.45 3.33 -2.36 2.25 4.40 6.66 10.73
beta[4] 4.18 3.20 -1.99 2.08 4.23 6.36 10.48
beta[5] 4.39 3.08 -1.61 2.41 4.48 6.49 10.30
beta[6] 4.47 3.11 -1.66 2.41 4.54 6.63 10.49
beta[7] 4.24 2.63 -0.74 2.37 4.28 6.02 8.99
beta[8] 4.12 3.13 -2.01 2.07 4.11 6.17 10.39
beta[9] 4.19 2.79 -1.14 2.17 4.19 6.11 9.75
beta[10] 4.39 2.56 -0.75 2.64 4.50 6.14 9.22
beta[11] 4.37 2.85 -1.29 2.37 4.34 6.35 10.14
beta[12] 4.46 2.85 -1.07 2.46 4.45 6.31 10.19
beta[13] 4.42 2.62 -0.69 2.64 4.48 6.24 9.44
beta[14] 4.26 3.21 -2.08 2.06 4.40 6.54 10.19
beta[15] 4.09 2.81 -1.06 2.09 4.00 6.04 9.50
beta[16] 4.14 2.99 -1.51 2.01 4.21 6.11 10.17
beta[17] 4.37 3.14 -1.63 2.21 4.39 6.42 10.57
beta[18] 4.12 2.57 -0.79 2.29 4.11 5.92 9.06
beta[19] 4.23 3.15 -2.22 2.15 4.20 6.47 10.17
beta[20] 4.12 2.98 -1.77 2.15 4.12 6.05 10.20
beta[21] 4.43 2.92 -1.31 2.56 4.38 6.35 10.49
beta[22] 4.15 2.74 -1.21 2.28 4.13 5.96 9.44
beta[23] 4.14 2.68 -1.18 2.40 4.11 5.92 9.28
beta[24] 4.21 2.79 -1.18 2.41 4.10 6.04 9.82
beta[25] 4.38 2.33 -0.08 2.73 4.42 6.05 8.76
beta[26] 4.28 3.22 -2.27 2.17 4.42 6.49 10.21
beta[27] 4.41 3.05 -1.66 2.33 4.38 6.51 10.39
beta[28] 4.13 2.92 -1.55 2.16 4.09 6.20 9.83
beta[29] 4.41 2.77 -1.36 2.51 4.38 6.28 9.82
beta[30] 3.93 2.97 -1.85 2.00 4.03 5.89 9.63
mu_alpha 4.34 3.08 -1.75 2.24 4.34 6.56 9.93
mu_beta 5.37 1.81 1.79 4.14 5.33 6.56 8.90
sigma_y 4.21 0.57 3.15 3.79 4.24 4.59 5.35
sigma_alpha 267.60 14.45 241.58 257.46 267.58 277.35 296.42
sigma_beta 12.10 1.73 9.07 10.93 12.00 13.20 15.83
alpha0 2.94 0.38 2.28 2.68 2.91 3.17 3.73
lp__ -87.31 12.52 -111.70 -95.78 -87.78 -78.10 -64.21
Approximate samples were drawn using VB(meanfield) at Fri Jan 15 15:50:28 2016.
We recommend genuine 'sampling' from the posterior distribution for final inferences!
---
Results if using
rats_stanified_ADVI <- vb(rats_stanified_model, data = dataIndexed,
eta=5, adapt_engaged=FALSE, iter = 1000000, tol_rel_obj=0.00001)
> print(rats_stanified_ADVI)
Inference for Stan model: rats_stanified_mod.
1 chains, each with iter=998; warmup=0; thin=1;
post-warmup draws per chain=998, total post-warmup draws=998.
mean sd 2.5% 25% 50% 75% 97.5%
alpha[1] 240.03 2.83 234.59 238.08 240.03 241.92 245.53
alpha[2] 247.77 2.76 242.14 245.89 247.78 249.67 253.21
alpha[3] 252.44 2.83 246.92 250.51 252.50 254.24 257.98
alpha[4] 232.41 2.78 227.16 230.41 232.49 234.37 237.64
alpha[5] 231.65 2.76 226.38 229.76 231.66 233.54 236.80
alpha[6] 249.77 2.88 244.24 247.75 249.78 251.77 255.23
alpha[7] 228.63 2.94 222.76 226.82 228.57 230.57 234.39
alpha[8] 248.53 2.90 242.55 246.63 248.51 250.53 253.91
alpha[9] 283.21 2.81 277.57 281.45 283.22 285.10 288.52
alpha[10] 219.16 2.84 213.33 217.35 219.12 220.90 225.10
alpha[11] 258.26 2.76 252.86 256.33 258.43 260.15 263.30
alpha[12] 228.12 2.59 223.17 226.36 228.12 229.87 233.49
alpha[13] 242.34 2.97 236.40 240.25 242.36 244.34 247.90
alpha[14] 268.32 2.85 262.42 266.55 268.29 270.32 273.93
alpha[15] 242.85 2.82 237.53 240.88 242.84 244.77 248.28
alpha[16] 245.18 2.90 239.46 243.27 245.17 247.08 251.17
alpha[17] 232.14 2.82 226.51 230.23 232.13 234.00 237.70
alpha[18] 240.38 2.73 235.12 238.52 240.28 242.18 245.85
alpha[19] 253.61 2.68 248.15 251.79 253.67 255.33 259.18
alpha[20] 241.48 2.97 235.54 239.49 241.51 243.49 247.14
alpha[21] 248.54 2.76 243.29 246.67 248.58 250.44 253.62
alpha[22] 225.10 2.90 219.53 223.15 225.13 227.11 230.60
alpha[23] 228.46 2.86 222.76 226.60 228.39 230.34 234.05
alpha[24] 245.02 2.99 239.37 242.99 244.98 247.10 250.83
alpha[25] 234.48 2.70 229.39 232.58 234.35 236.36 239.94
alpha[26] 253.98 2.71 248.58 252.24 253.98 255.73 259.25
alpha[27] 254.21 2.94 248.44 252.18 254.21 256.15 259.96
alpha[28] 243.03 2.77 237.68 241.19 242.93 244.89 248.46
alpha[29] 217.75 2.84 212.34 215.69 217.80 219.62 223.47
alpha[30] 241.57 2.81 235.95 239.69 241.58 243.42 247.06
beta[1] 106.78 3.50 99.69 104.41 106.74 109.16 113.65
beta[2] 6.08 0.25 5.58 5.90 6.08 6.25 6.55
beta[3] 7.13 0.27 6.58 6.95 7.13 7.32 7.64
beta[4] 6.52 0.24 6.04 6.34 6.51 6.68 6.98
beta[5] 5.33 0.25 4.82 5.15 5.33 5.50 5.84
beta[6] 6.58 0.27 6.06 6.39 6.59 6.76 7.12
beta[7] 6.15 0.28 5.58 5.97 6.16 6.34 6.67
beta[8] 5.97 0.25 5.48 5.80 5.97 6.14 6.47
beta[9] 6.44 0.26 5.90 6.27 6.44 6.62 6.96
beta[10] 7.12 0.26 6.61 6.95 7.12 7.28 7.64
beta[11] 5.82 0.25 5.35 5.66 5.82 5.98 6.29
beta[12] 6.83 0.26 6.33 6.64 6.82 7.00 7.34
beta[13] 6.12 0.24 5.66 5.96 6.12 6.28 6.58
beta[14] 6.16 0.26 5.66 5.99 6.16 6.33 6.65
beta[15] 6.69 0.25 6.20 6.53 6.70 6.86 7.15
beta[16] 5.41 0.24 4.94 5.25 5.41 5.57 5.86
beta[17] 5.95 0.26 5.43 5.78 5.94 6.12 6.46
beta[18] 6.28 0.26 5.77 6.10 6.27 6.45 6.80
beta[19] 5.86 0.27 5.35 5.69 5.87 6.04 6.37
beta[20] 6.39 0.26 5.91 6.21 6.39 6.57 6.89
beta[21] 6.05 0.25 5.57 5.88 6.05 6.22 6.57
beta[22] 6.43 0.26 5.94 6.26 6.43 6.60 6.93
beta[23] 5.83 0.26 5.33 5.65 5.84 6.02 6.33
beta[24] 5.78 0.26 5.26 5.61 5.78 5.96 6.30
beta[25] 5.89 0.26 5.40 5.71 5.88 6.06 6.42
beta[26] 6.90 0.25 6.42 6.75 6.90 7.07 7.39
beta[27] 6.55 0.27 6.00 6.37 6.56 6.73 7.03
beta[28] 5.89 0.25 5.41 5.72 5.89 6.05 6.41
beta[29] 5.87 0.24 5.40 5.71 5.87 6.03 6.34
beta[30] 5.65 0.23 5.19 5.49 5.65 5.81 6.08
mu_alpha 6.11 0.26 5.59 5.94 6.10 6.28 6.63
mu_beta 242.44 2.64 237.40 240.71 242.48 244.27 247.67
sigma_y 6.17 0.10 5.97 6.09 6.17 6.24 6.38
sigma_alpha 6.16 0.40 5.45 5.89 6.14 6.43 7.01
sigma_beta 14.73 1.96 11.27 13.39 14.53 15.81 19.33
alpha0 0.54 0.07 0.42 0.50 0.54 0.59 0.69
lp__ 106.78 3.50 99.69 104.41 106.74 109.16 113.65