ADVI and Rats example - very different parameter estimates

1,253 views
Skip to first unread message

Julian King

unread,
Jan 14, 2016, 11:18:30 PM1/14/16
to Stan users mailing list
Hello!

Very keen to use the ADVI functionality of Stan.  I am running RStan v2.9

I have tried to run the Rats (from BUGS) example, but the parameter estimates for the alphas are very different under NUTS and ADVI.

For instance:


rats_model <- stan_model(file = "rats.stan")
rats
<- sampling(rats_model, data = data, chains = chains,
                 init
= init, iter = iter)
print(rats)


Gives:
Inference for Stan model: rats.
2 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=1000.

                 mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
alpha[1]       239.90    0.08  2.59  234.71  238.15  239.94  241.59  245.25  1000    1
alpha[2]       247.76    0.09  2.72  242.34  245.88  247.75  249.77  252.79  1000    1
alpha[3]       252.58    0.09  2.86  246.74  250.62  252.70  254.46  258.19  1000    1


But:
rats_model <- stan_model(file = "rats.stan")
rats_advi
<- vb(rats_model, data = data)

Gives:
--
Inference for Stan model: rats.
1 chains, each with iter=1000; warmup=0; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=1000.

                  mean      sd     2.5%      25%      50%      75%    97.5%
alpha[1]         11.24   17.75   -24.17    -0.48    11.05    22.64    45.84
alpha[2]          8.78   15.95   -21.72    -2.18     8.36    19.63    40.85
alpha[3]         11.94   16.10   -19.12     0.76    12.32    22.52    43.54
--

I note that the sismasq_alpha parameter is unreasonably large:
--
sigmasq_alpha 61314.73 7601.37 48000.31 55781.03 60716.20 66063.05 77905.34
--

The ADVI algorithm claims that it has converged:
------
Begin eta adaptation.
Iteration:   1 / 250 [  0%]  (Adaptation)
Iteration:  50 / 250 [ 20%]  (Adaptation)
Iteration: 100 / 250 [ 40%]  (Adaptation)
Iteration: 150 / 250 [ 60%]  (Adaptation)
Iteration: 200 / 250 [ 80%]  (Adaptation)
Success! Found best value [eta = 1.000] earlier than expected.

Begin stochastic gradient ascent.
  iter       ELBO   delta_ELBO_mean   delta_ELBO_med   notes 
   100   -11859.3             1.000            1.000
   200    -1167.0             5.081            9.162
   300    -1067.9             3.418            1.000
   400    -1069.6             2.564            1.000
   500    -1067.7             2.052            0.093
   600    -1068.6             1.710            0.093
   700    -1067.2             1.466            0.002   MEDIAN ELBO CONVERGED

Drawing 1000 samples from the approximate posterior... COMPLETED.
----

Any idea what's going on here? This is a pretty simple model, so I wouldn't have expected such a disparity.

The same behaviour occurs with "rats_stanified.stan" as well.

Thanks in advance!

Ben Goodrich

unread,
Jan 14, 2016, 11:34:12 PM1/14/16
to Stan users mailing list
On Thursday, January 14, 2016 at 11:18:30 PM UTC-5, Julian King wrote:
Any idea what's going on here? This is a pretty simple model, so I wouldn't have expected such a disparity.
 
It is not so much the complexity of the model but whether the posterior distribution in the unconstrained space is multivariate normal (and uncorrelated in the meanfield case) and whether the globally optimal multivariate normal distribution is found. This seems not to hold for many of the BUGS examples.

Ben

Andrew Gelman

unread,
Jan 14, 2016, 11:39:01 PM1/14/16
to stan-...@googlegroups.com
One thing you can try is different starting points.  We’ve had lots of discussions in the stan meeting about ADVI failing in lots and lots of these simple examples, and it’s never clear whether the problem is with the variational solution or with the algorithm not converging.  Our guess is that in these settings the result is highly sensitive to initial values and turning parameters.  We have hopes that ADVI’s tuning can be improved so it will give reasonable answers more reliably.  For now, I think it’s best to be safe and run from multiple inits.

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Julian King

unread,
Jan 15, 2016, 12:33:11 AM1/15/16
to Stan users mailing list, gel...@stat.columbia.edu
I have tried the call 20 times, with init_r=2, and in all of these cases the ADVI algorithm converges to similar parameter values which are not those given by the MCMC run.

This actually looks like premature convergence.  The values alpha are all around 4, but the parameter sigma_alpha is around 250, which just doesn't make sense in the context of this model.  The alpha should have mean of mu_alpha and standard deviation sigma_alpha, but that's clearly not true here... sigma_alpha is orders of magnitude larger than the dispersion between the alphas.  See highlighted text below.

On the other hand, if I run

rats_stanified_ADVI <- vb(rats_stanified_model, data = dataIndexed,
eta
=5, adapt_engaged=FALSE, iter = 1000000, tol_rel_obj=0.00001,
grad_samples
=1)

Then the model very very slowly (over the course of 1 million iterations) gets to an ELBO of around -500, which is much better than under the automated tuning.  The parameter estimates also look good here now as well.  So this looks like non-convergence of the stochastic optimisation under the default parameters.

Higher values of eta give me faster convergence, if the model starts (with eta =10 I often get an error saying "Error: stan::variational::normal_meanfield::calc_grad: The number of dropped evaluations has reached its maximum amount (100). Your model may be either severely ill-conditioned or misspecified.").

I have tried setting grad_samples=20 (rather than the default 1) and this does not seem to affect convergence.

Maybe the issue is the fact that the algorithm uses gradient ascent?  Maybe stochastic BFGS would work better here... perhaps there are severe degeneracies when far away from the ELBO mode.


---

> rats_stanified_ADVI <- vb(rats_stanified_model, data = dataIndexed)

This is Automatic Differentiation Variational Inference.

(EXPERIMENTAL ALGORITHM: expect frequent updates to the procedure.)

Gradient evaluation took 0 seconds
1000 iterations under these settings should take 0 seconds.
Adjust your expectations accordingly!

Begin eta adaptation.
Iteration:   1 / 250 [  0%]  (Adaptation)
Iteration:  50 / 250 [ 20%]  (Adaptation)
Iteration: 100 / 250 [ 40%]  (Adaptation)
Iteration: 150 / 250 [ 60%]  (Adaptation)
Iteration: 200 / 250 [ 80%]  (Adaptation)
Success! Found best value [eta = 1] earlier than expected.

Begin stochastic gradient ascent.
  iter       ELBO   delta_ELBO_mean   delta_ELBO_med   notes 
   100    -1e+006             1.000            1.000
   200   -19339.5            29.376           57.751
   300    -1140.4            24.903           15.959
   400    -1043.6            18.701           15.959
   500    -1041.6            14.961            1.000
   600    -1041.7            12.468            1.000
   700    -1040.7            10.687            0.093
   800    -1039.9             9.351            0.093
   900    -1042.2             8.312            0.002   MEDIAN ELBO CONVERGED

Drawing 1000 samples from the approximate posterior... COMPLETED.
> print(rats_stanified_ADVI)
Inference for Stan model: rats_stanified.
1 chains, each with iter=1000; warmup=0; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=1000.

              mean    sd    2.5%    25%    50%    75%  97.5%
alpha[1]      5.29 12.35  -18.91  -2.88   5.50  13.32  30.97
alpha[2]      5.13 11.65  -18.21  -2.72   5.21  13.10  27.34
alpha[3]      6.71 12.25  -18.82  -1.25   6.99  15.35  29.40
alpha[4]      5.67 12.50  -18.86  -3.10   5.84  14.16  30.59
alpha[5]      5.86 10.69  -14.72  -1.52   5.86  13.49  25.66
alpha[6]      5.76 10.34  -13.66  -1.45   5.54  12.58  26.96
alpha[7]      4.54 11.47  -17.63  -3.13   5.07  12.32  26.26
alpha[8]      5.61 10.18  -13.70  -1.18   5.64  12.68  25.83
alpha[9]      6.01 11.36  -16.30  -1.34   5.94  13.41  28.14
alpha[10]     6.02 12.31  -18.78  -2.11   6.00  14.34  29.69
alpha[11]     6.45 12.54  -17.74  -2.36   6.52  14.72  30.59
alpha[12]     5.64 11.63  -17.19  -2.10   5.76  13.92  27.68
alpha[13]     6.15  9.93  -13.25  -0.16   6.44  13.13  24.75
alpha[14]     5.44 12.51  -18.39  -3.08   5.67  13.84  29.51
alpha[15]     5.79  9.21  -12.49  -0.35   5.97  11.96  23.69
alpha[16]     5.92  9.53  -12.63  -0.52   5.96  12.46  24.37
alpha[17]     6.22  9.44  -13.35  -0.32   6.33  12.44  24.77
alpha[18]     5.19 12.19  -19.80  -3.01   5.42  13.70  28.20
alpha[19]     5.48 11.83  -18.01  -2.68   4.98  13.20  29.42
alpha[20]     5.34 10.84  -15.85  -1.66   5.12  12.41  27.20
alpha[21]     5.12 10.65  -15.06  -1.84   5.20  12.35  26.27
alpha[22]     5.39 11.82  -17.38  -2.63   5.50  13.35  28.47
alpha[23]     6.09 11.62  -15.12  -2.01   5.82  13.99  29.30
alpha[24]     6.45  8.56  -10.11   0.56   6.01  12.44  23.11
alpha[25]     6.22 10.09  -12.89  -0.60   6.00  13.26  26.57
alpha[26]     4.77 10.36  -15.14  -2.36   4.70  11.74  25.53
alpha[27]     5.70  8.86  -11.63  -0.17   5.60  11.37  22.79
alpha[28]     5.07 10.55  -16.59  -1.78   4.80  12.08  25.96
alpha[29]     5.10 11.06  -16.63  -2.37   5.20  12.35  26.58
alpha[30]     7.89 13.58  -19.32  -1.44   8.46  16.88  33.43
beta[1]     -87.31 12.52 -111.70 -95.78 -87.78 -78.10 -64.21
beta[2]       4.23  2.74   -1.20   2.46   4.23   6.01   9.79
beta[3]       4.45  3.33   -2.36   2.25   4.40   6.66  10.73
beta[4]       4.18  3.20   -1.99   2.08   4.23   6.36  10.48
beta[5]       4.39  3.08   -1.61   2.41   4.48   6.49  10.30
beta[6]       4.47  3.11   -1.66   2.41   4.54   6.63  10.49
beta[7]       4.24  2.63   -0.74   2.37   4.28   6.02   8.99
beta[8]       4.12  3.13   -2.01   2.07   4.11   6.17  10.39
beta[9]       4.19  2.79   -1.14   2.17   4.19   6.11   9.75
beta[10]      4.39  2.56   -0.75   2.64   4.50   6.14   9.22
beta[11]      4.37  2.85   -1.29   2.37   4.34   6.35  10.14
beta[12]      4.46  2.85   -1.07   2.46   4.45   6.31  10.19
beta[13]      4.42  2.62   -0.69   2.64   4.48   6.24   9.44
beta[14]      4.26  3.21   -2.08   2.06   4.40   6.54  10.19
beta[15]      4.09  2.81   -1.06   2.09   4.00   6.04   9.50
beta[16]      4.14  2.99   -1.51   2.01   4.21   6.11  10.17
beta[17]      4.37  3.14   -1.63   2.21   4.39   6.42  10.57
beta[18]      4.12  2.57   -0.79   2.29   4.11   5.92   9.06
beta[19]      4.23  3.15   -2.22   2.15   4.20   6.47  10.17
beta[20]      4.12  2.98   -1.77   2.15   4.12   6.05  10.20
beta[21]      4.43  2.92   -1.31   2.56   4.38   6.35  10.49
beta[22]      4.15  2.74   -1.21   2.28   4.13   5.96   9.44
beta[23]      4.14  2.68   -1.18   2.40   4.11   5.92   9.28
beta[24]      4.21  2.79   -1.18   2.41   4.10   6.04   9.82
beta[25]      4.38  2.33   -0.08   2.73   4.42   6.05   8.76
beta[26]      4.28  3.22   -2.27   2.17   4.42   6.49  10.21
beta[27]      4.41  3.05   -1.66   2.33   4.38   6.51  10.39
beta[28]      4.13  2.92   -1.55   2.16   4.09   6.20   9.83
beta[29]      4.41  2.77   -1.36   2.51   4.38   6.28   9.82
beta[30]      3.93  2.97   -1.85   2.00   4.03   5.89   9.63
mu_alpha      4.34  3.08   -1.75   2.24   4.34   6.56   9.93
mu_beta       5.37  1.81    1.79   4.14   5.33   6.56   8.90
sigma_y       4.21  0.57    3.15   3.79   4.24   4.59   5.35
sigma_alpha 267.60 14.45  241.58 257.46 267.58 277.35 296.42
sigma_beta   12.10  1.73    9.07  10.93  12.00  13.20  15.83
alpha0        2.94  0.38    2.28   2.68   2.91   3.17   3.73
lp__        -87.31 12.52 -111.70 -95.78 -87.78 -78.10 -64.21

Approximate samples were drawn using VB(meanfield) at Fri Jan 15 15:50:28 2016.
We recommend genuine 'sampling' from the posterior distribution for final inferences!

---



Results if using
 rats_stanified_ADVI <- vb(rats_stanified_model, data = dataIndexed,
eta
=5, adapt_engaged=FALSE, iter = 1000000, tol_rel_obj=0.00001)

> print(rats_stanified_ADVI)
Inference for Stan model: rats_stanified_mod.
1 chains, each with iter=998; warmup=0; thin=1; 
post-warmup draws per chain=998, total post-warmup draws=998.

              mean   sd   2.5%    25%    50%    75%  97.5%
alpha[1]    240.03 2.83 234.59 238.08 240.03 241.92 245.53
alpha[2]    247.77 2.76 242.14 245.89 247.78 249.67 253.21
alpha[3]    252.44 2.83 246.92 250.51 252.50 254.24 257.98
alpha[4]    232.41 2.78 227.16 230.41 232.49 234.37 237.64
alpha[5]    231.65 2.76 226.38 229.76 231.66 233.54 236.80
alpha[6]    249.77 2.88 244.24 247.75 249.78 251.77 255.23
alpha[7]    228.63 2.94 222.76 226.82 228.57 230.57 234.39
alpha[8]    248.53 2.90 242.55 246.63 248.51 250.53 253.91
alpha[9]    283.21 2.81 277.57 281.45 283.22 285.10 288.52
alpha[10]   219.16 2.84 213.33 217.35 219.12 220.90 225.10
alpha[11]   258.26 2.76 252.86 256.33 258.43 260.15 263.30
alpha[12]   228.12 2.59 223.17 226.36 228.12 229.87 233.49
alpha[13]   242.34 2.97 236.40 240.25 242.36 244.34 247.90
alpha[14]   268.32 2.85 262.42 266.55 268.29 270.32 273.93
alpha[15]   242.85 2.82 237.53 240.88 242.84 244.77 248.28
alpha[16]   245.18 2.90 239.46 243.27 245.17 247.08 251.17
alpha[17]   232.14 2.82 226.51 230.23 232.13 234.00 237.70
alpha[18]   240.38 2.73 235.12 238.52 240.28 242.18 245.85
alpha[19]   253.61 2.68 248.15 251.79 253.67 255.33 259.18
alpha[20]   241.48 2.97 235.54 239.49 241.51 243.49 247.14
alpha[21]   248.54 2.76 243.29 246.67 248.58 250.44 253.62
alpha[22]   225.10 2.90 219.53 223.15 225.13 227.11 230.60
alpha[23]   228.46 2.86 222.76 226.60 228.39 230.34 234.05
alpha[24]   245.02 2.99 239.37 242.99 244.98 247.10 250.83
alpha[25]   234.48 2.70 229.39 232.58 234.35 236.36 239.94
alpha[26]   253.98 2.71 248.58 252.24 253.98 255.73 259.25
alpha[27]   254.21 2.94 248.44 252.18 254.21 256.15 259.96
alpha[28]   243.03 2.77 237.68 241.19 242.93 244.89 248.46
alpha[29]   217.75 2.84 212.34 215.69 217.80 219.62 223.47
alpha[30]   241.57 2.81 235.95 239.69 241.58 243.42 247.06
beta[1]     106.78 3.50  99.69 104.41 106.74 109.16 113.65
beta[2]       6.08 0.25   5.58   5.90   6.08   6.25   6.55
beta[3]       7.13 0.27   6.58   6.95   7.13   7.32   7.64
beta[4]       6.52 0.24   6.04   6.34   6.51   6.68   6.98
beta[5]       5.33 0.25   4.82   5.15   5.33   5.50   5.84
beta[6]       6.58 0.27   6.06   6.39   6.59   6.76   7.12
beta[7]       6.15 0.28   5.58   5.97   6.16   6.34   6.67
beta[8]       5.97 0.25   5.48   5.80   5.97   6.14   6.47
beta[9]       6.44 0.26   5.90   6.27   6.44   6.62   6.96
beta[10]      7.12 0.26   6.61   6.95   7.12   7.28   7.64
beta[11]      5.82 0.25   5.35   5.66   5.82   5.98   6.29
beta[12]      6.83 0.26   6.33   6.64   6.82   7.00   7.34
beta[13]      6.12 0.24   5.66   5.96   6.12   6.28   6.58
beta[14]      6.16 0.26   5.66   5.99   6.16   6.33   6.65
beta[15]      6.69 0.25   6.20   6.53   6.70   6.86   7.15
beta[16]      5.41 0.24   4.94   5.25   5.41   5.57   5.86
beta[17]      5.95 0.26   5.43   5.78   5.94   6.12   6.46
beta[18]      6.28 0.26   5.77   6.10   6.27   6.45   6.80
beta[19]      5.86 0.27   5.35   5.69   5.87   6.04   6.37
beta[20]      6.39 0.26   5.91   6.21   6.39   6.57   6.89
beta[21]      6.05 0.25   5.57   5.88   6.05   6.22   6.57
beta[22]      6.43 0.26   5.94   6.26   6.43   6.60   6.93
beta[23]      5.83 0.26   5.33   5.65   5.84   6.02   6.33
beta[24]      5.78 0.26   5.26   5.61   5.78   5.96   6.30
beta[25]      5.89 0.26   5.40   5.71   5.88   6.06   6.42
beta[26]      6.90 0.25   6.42   6.75   6.90   7.07   7.39
beta[27]      6.55 0.27   6.00   6.37   6.56   6.73   7.03
beta[28]      5.89 0.25   5.41   5.72   5.89   6.05   6.41
beta[29]      5.87 0.24   5.40   5.71   5.87   6.03   6.34
beta[30]      5.65 0.23   5.19   5.49   5.65   5.81   6.08
mu_alpha      6.11 0.26   5.59   5.94   6.10   6.28   6.63
mu_beta     242.44 2.64 237.40 240.71 242.48 244.27 247.67
sigma_y       6.17 0.10   5.97   6.09   6.17   6.24   6.38
sigma_alpha   6.16 0.40   5.45   5.89   6.14   6.43   7.01
sigma_beta   14.73 1.96  11.27  13.39  14.53  15.81  19.33
alpha0        0.54 0.07   0.42   0.50   0.54   0.59   0.69
lp__        106.78 3.50  99.69 104.41 106.74 109.16 113.65

Andrew Gelman

unread,
Jan 15, 2016, 12:58:04 AM1/15/16
to Julian King, Stan users mailing list
Intersting, thanks for experimenting!

Dustin Tran

unread,
Jan 15, 2016, 6:21:10 AM1/15/16
to stan-...@googlegroups.com
With the normal hierarchical model that Rats uses, it’d be surprising if ADVI’s posterior mean estimates are really that off. A quick diagnostic: can you try initializing ADVI with the posterior mean estimate from NUTS? ADVI should not move very much from that initialization. This would tell us whether the problem is the optimization or the limited expressiveness of the normal mean-field approximation.

Dustin
On Jan 14, 2016, at 11:34 PM, Ben Goodrich <goodri...@gmail.com> wrote:

Dustin Tran

unread,
Jan 15, 2016, 6:27:42 AM1/15/16
to stan-...@googlegroups.com, Julian King
Oops, just saw this—it was under a different thread than the previous message. Yes this makes sense, I interpret this as the adaptation of eta being bad. That is, one can either force the relative tolerance for convergence to be very small as you put, or one can force the step-size eta to be very small. In any case this is a tell-tale sign that black box variational inference (and stochastic optimization in general) needs a better convergence diagnostic.

Dustin

Alp Kucukelbir

unread,
Jan 15, 2016, 2:32:06 PM1/15/16
to Stan users mailing list, jking...@gmail.com
agreed. this definitely seems like a false `convergence'.

probabilistic line search may help in situations like this; we're investigating this line of research.

cheers,
alp

terrance savitsky

unread,
Jan 15, 2016, 3:00:36 PM1/15/16
to stan-...@googlegroups.com, jking...@gmail.com
Dustin and Alp,  What is your opinion about manually setting the "tol_rel_obj" option?  ADVI works well for my model and data, though I find I receive sufficiently different results when setting tol_rel_obj = 0.001.  Does changing this setting make it more likely that the gradient optimization will bore more deeply into a local optimum (assume the optimum is local) or jump between local optima?

thank you for your advice,

Terrance
--
Thank you, Terrance Savitsky

Julian King

unread,
Jan 15, 2016, 7:16:29 PM1/15/16
to Stan users mailing list, jking...@gmail.com
The challenge is that to get convergence you need upwards of one million iterations.   The convergence diagnostic makes sense, but it assumes that the algorithm is making appropriate progress at each step (this sort of diagnostic is fine if your algorithm has quadratic convergence properties ie Newton's method).  The problem here is that the algorithm does not make sufficient improvements at each step.

As I noted, some sort of stochastic quasi Newton method may be needed.  Gradient ascent is a terrible method for many problems.  Convergence can be made arbitrarily bad simply by rescaling the problem (eg expand one axis by a factor of ten and convergence will be ~ ten times slower).  This is a very bad property for an algorithm to have.  Any sort of second derivative information, probably regularised, would almost certainly be helpful here (even if noisy).

Julian King

unread,
Jan 15, 2016, 7:18:37 PM1/15/16
to Stan users mailing list
RStan currently does not allow a set of initial values to be fed to the ADVI module.  You can only set init_r, which controls the scale of random initialisation.

My experimentation suggests that it's the optimisation, because if you run the optimisation with the right parameters you (eventually) get to similar parameter values to MCMC.  Ideally, ADVI should be used to produce parameter estimates which can be fed to MCMC, not the other way around ;)

Andrew Gelman

unread,
Jan 15, 2016, 7:23:32 PM1/15/16
to Stan users mailing list, jking...@gmail.com
Let me repeat that I think it will make sense to use multiple starting points rather than running anything a million iterations!

Alp Kucukelbir

unread,
Jan 16, 2016, 7:15:17 AM1/16/16
to stan-...@googlegroups.com, jking...@gmail.com
leveraging second order derivative information is definitely something
the community is investigating. for instance:

@article{fan2015fast,
Author = {Fan, Kai and Wang, Ziteng and Beck, Jeff and Kwok, James and
Heller, Katherine},
Journal = {arXiv preprint arXiv:1509.02866},
Title = {Fast Second-Order Stochastic Backpropagation for Variational
Inference},
Year = {2015}}

we could also consider something similar for ADVI.
> You received this message because you are subscribed to a topic in the
> Google Groups "Stan users mailing list" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/stan-users/FaBvi8w7pc4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to

Bob Carpenter

unread,
Jan 16, 2016, 1:33:22 PM1/16/16
to stan-...@googlegroups.com
I think the point's that it might take 1M iterations
to get convergence to within epsilon using SGD. Often,
though, in my experience, you can get close enough for
stats with many many fewer iterations. So you might
find SGD converging to within 1e-3 faster than
L-BFGS but converging more slowly to within 1e-6.

Alp & Dustin: Is there really no rescaling of parameters
going on? That's going to be a killer for a lot of models,
because it's hard to get users to give you a standardized model.

In NUTS, we tune a mass matrix which determines relative
parameter scale (and optionally rotation); the big problem
is that it's a global estimate, that it has to be estimated
during warmup when we haven't gotten to where we want to go yet,
and many of the posteriors we care about have varying curvature.
So something like L-BFGS makes a lot of sense because it
estimates the curvature directly (rather than using draws to
estimate covariance).

- Bob

Bob Carpenter

unread,
Jan 16, 2016, 1:34:22 PM1/16/16
to stan-...@googlegroups.com, jking...@gmail.com
We can do one better than "second-order backprop" --- we have
general second-order autodiff!

- Bob

Dustin Tran

unread,
Jan 16, 2016, 1:38:16 PM1/16/16
to stan-...@googlegroups.com, jking...@gmail.com
The inverse Hessian/Fisher information, used in the second-order optimization, would be most appropriate for rescaling parameters. The current first-order optimization we use with an adaptive learning rate tries to do this as well: you can interpret the adaptive learning rate as a diagonal approximation to the inverse Fisher information. In other words, the learning rate attempts to rescale parameters but it can only do so much as it compromises for a simpler per-iteration complexity.

Dustin

Bob Carpenter

unread,
Jan 16, 2016, 1:48:21 PM1/16/16
to stan-...@googlegroups.com
Right --- we can calculate the inverse Hessian and even condition
it so that it's positive definite (see Michael's arXiv paper
on SoftAbs). But it's a cubic operation in the number of parameters
(so big data's not the problem per se, big parameters are).

Or you could approximate it efficiently like L-BFGS does. I have
to imagine someone's played with that kind of thing mixed with SGD.

- Bob

Andrew Gelman

unread,
Jan 16, 2016, 10:21:09 PM1/16/16
to stan-...@googlegroups.com
Yes, but _my_ point is that, if it takes 1M iterations to get convergence to within epsilon, I bet you could learn a fuckload by running 4 chains for 1000 iterations each. It might be that one of these 4 is close, or something can be learned about adaptation, or whatever.

Bob Carpenter

unread,
Jan 17, 2016, 1:00:23 AM1/17/16
to stan-...@googlegroups.com
Absolutely. Everyone I know that uses SGD does
it for quick answers that are close enough, not
to get very tight estimates (it's not so good at that
using only gradients and only part of the data at a time).

Running multiple inits is a good idea. That's especially
true if you lower learning rate as you go --- you can stop
prematurely because learning rate goes to zero, not
because you converged.

- Bob
Reply all
Reply to author
Forward
0 new messages