Could Stan have a gradient-descent optimizer?

Andrew Gelman

unread,

Aug 25, 2016, 12:57:33 PM8/25/16

to stan development mailing list

Hi all. Could Stan have a gradient-descent optimizer? This came up in our GMO discussion.
A

Bob Carpenter

unread,

Aug 25, 2016, 1:27:17 PM8/25/16

to stan...@googlegroups.com

I'm guessing what you really want is stochastic gradient
with auto-tuning.

If you know how to do the tuning, that won't be a problem;
if you don't, it's a research issue.

Stochastic gradient means somehow figuring out how to handle
splitting the data and the priors. I've been proposing
to have the member variables holding data in the generated C++ code
be mutable, which would mean you could swap subsets of the
data in and out from the outside programatically without having
to reload everything. But that doesn't solve the issue of what
to do with the priors.

- Bob

> On Aug 25, 2016, at 6:57 PM, Andrew Gelman <gel...@stat.columbia.edu> wrote:
>
> Hi all. Could Stan have a gradient-descent optimizer? This came up in our GMO discussion.
> A
>

> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

Andrew Gelman

unread,

Aug 25, 2016, 4:18:36 PM8/25/16

to stan development mailing list, Dustin Tran, Aki Vehtari (Aki.Vehtari@aalto.fi)

Hi, Bob, I'm cc-ing Dustin (who suggested the idea) and Aki (who was in the meeting where it came up). Dustin said that for certain well behaved problems such as glms with normal priors, that gradient descent can be much faster than L-BFGS, and that this could make a difference for GMO. I don't think he was talking about stochastic gradient, actually.
A

Bob Carpenter

unread,

Aug 25, 2016, 5:09:25 PM8/25/16

to stan...@googlegroups.com, Dustin Tran, Aki Vehtari (Aki.Vehtari@aalto.fi)

If that's what Dustin's saying, then he's almost certainly talking
about stochastic gradient. The speed comes because you move more
for less likelihood evaluation. If you have to evaluate the whole
likelihood for each step and take a linear step in the gradient,
you can't move very far and it's expensive. If you use stochastic
gradient, the evaluations are much faster, so on net, convergence
is much faster. It will take longer to get to many decimal places
of accuracy compared to L-BFGS, but we don't usually care about that.

I built SGD for L1-regularized multi-logit regression when I
worked at Alias-i; it's the industry standard classifier, so
to speak (well, was until deep belief nets came into vogue).
But you had to tune learning rate, batch size, and schedule for
lowering learnign rate pretty carefully. Much of the literature
is very misleading because they show you times for optimal tuning
parameters, which are great for SGD. But I know there's been
work on auto-tuning it.

- Bob

Avraham Adler

unread,

Aug 26, 2016, 9:44:38 AM8/26/16

to stan development mailing list, dus...@cs.columbia.edu, Aki.V...@aalto.fi

On Thursday, August 25, 2016 at 5:09:25 PM UTC-4, Bob Carpenter wrote:
> If that's what Dustin's saying, then he's almost certainly talking
> about stochastic gradient. The speed comes because you move more
> for less likelihood evaluation. If you have to evaluate the whole
> likelihood for each step and take a linear step in the gradient,
> you can't move very far and it's expensive. If you use stochastic
> gradient, the evaluations are much faster, so on net, convergence
> is much faster. It will take longer to get to many decimal places
> of accuracy compared to L-BFGS, but we don't usually care about that.
>
> I built SGD for L1-regularized multi-logit regression when I
> worked at Alias-i; it's the industry standard classifier, so
> to speak (well, was until deep belief nets came into vogue).
> But you had to tune learning rate, batch size, and schedule for
> lowering learnign rate pretty carefully. Much of the literature
> is very misleading because they show you times for optimal tuning
> parameters, which are great for SGD. But I know there's been
> work on auto-tuning it.
>
> - Bob

This page has a decent overview of the various types of CG out there including self-adaptive types: http://sebastianruder.com/optimizing-gradient-descent/. As a non-programmer, I tend to use NLOPT(r)'s version of L-BFGS (when I have gradients) or subplex when I don't. NLOPT is released under L-GPL or lighter, so if y'all want, you may be able to fold it in as is if it plays nicely with Stan.

Avi

Bob Carpenter

unread,

Aug 26, 2016, 11:17:19 AM8/26/16

to stan...@googlegroups.com

Thanks for the link. There are a lot more of these
algorithms out there and the main problem I have with
them all is that they're usually only tested on one or
two kinds of problems in the context of an application
or NIPS paper.

Maybe Dustin knows what we should be using for robustness
for the kinds of problems we have in Stan where not all
parameters are necessarily on the same scale and where
we get varying curvature over the posterior due to hierarchical
and other complex models.

We have an auto-diff based L-BFGS in Stan, which is
our default optimizer for (penalized) MLE.

- Bob

P.S. Early stopping is just an ad hoc approach to regularization.

P.P.S. You also need special algorithms for L1 regularization if you
want to really get zero estimates for coefficients (like in
Vowpal Wabbit or LingPipe).

Joshua N Pritikin

unread,

Aug 26, 2016, 11:30:28 AM8/26/16

to stan...@googlegroups.com

On Fri, Aug 26, 2016 at 05:16:10PM +0200, Bob Carpenter wrote:
> Thanks for the link. There are a lot more of these
> algorithms out there and the main problem I have with
> them all is that they're usually only tested on one or
> two kinds of problems in the context of an application
> or NIPS paper.

I agree. I've had to make modifications to get NLOPT's BGFS optimizer to
work with non-linear constraints. I submitted my changes upstream months
ago, but I don't think they are merged yet. My impression is that new
development moves at a snail's pace.

> > As a non-programmer, I tend to use NLOPT(r)'s version of L-BFGS
> > (when I have gradients) or subplex when I don't. NLOPT is released
> > under L-GPL or lighter, so if y'all want, you may be able to fold it
> > in as is if it plays nicely with Stan.

--
Joshua N. Pritikin, Ph.D.
Virginia Institute for Psychiatric and Behavioral Genetics
Virginia Commonwealth University
PO Box 980126
800 E Leigh St, Biotech One, Suite 1-133
Richmond, VA 23219
http://people.virginia.edu/~jnp3bc

Dustin Tran

unread,

Aug 26, 2016, 1:56:05 PM8/26/16

to stan...@googlegroups.com

I meant (non-stochastic) gradient descent. In general, somewhere along the massive Stan wishlist, it would be nice for Stan’s optimizers to separate the implementation of the iterative update and the implementation of the learning rate which the iterative update is proportional to. This enables Stan’s optimize(), ADVI, and any future optimization method to benefit from the same internals.

For example, I don’t recall the specific model/dataset, but I remember there was a case where ADVI in fact converged faster than Stan’s optimize(). This can be due to the fact that ADVI uses gradient descent and does not use a pseudo-second order gradient as in LBFGS.

> Maybe Dustin knows what we should be using for robustness
> for the kinds of problems we have in Stan where not all
> parameters are necessarily on the same scale and where
> we get varying curvature over the posterior due to hierarchical
> and other complex models.

This is partly what Alp and I tried to address with the auto-tuning of the step-size in ADVI, although I don’t think it succeeded too much in terms of robustness for all the kinds of problems we have in Stan. The adaptive learning rate that’s used can be viewed as a diagonal matrix approximation to the inverse Fisher information, which is the optimal schedule but expensive. This diagonal learning rate is in-between the scalar learning rate that gradient descent classically uses and the low rank learning rate that LBFGS uses.

Dustin

Andrew Gelman

unread,

Aug 26, 2016, 8:56:29 PM8/26/16

to stan...@googlegroups.com

Once we've written the GMO paper (which now uses various Python code for its examples cos the Stan optimizer is too slow), we can see if the relevant optimizers can be put into Stan. I think it will be important to have a fast practical GMO implementation in Stan, and I think the details of what is needed will be clearer once we have GMO all written up.
A

Bob Carpenter

unread,

Aug 26, 2016, 9:21:17 PM8/26/16

to stan...@googlegroups.com

OK. I'm not quite sure what needs to be separated here,
but the optimizer code's relatively simple and should be
easy to refactor. And non-stochastic gradient descent's trivial
without changing anything in Stan, so we can certainly do that.
And it sounds like there's already an example in ADVI. I
do think it would be good to abstract out the optimizer
from the rest of ADVI. Do you have a general optimizer
interface in mind?

But I'm probably missing something here because this sounds
too easy. We can wait for a paper and example Python code.

- Bob

Andrew Gelman

unread,

Aug 26, 2016, 9:58:41 PM8/26/16

to stan...@googlegroups.com

I think it probably is easy, it just wasn't already in Stan. But we can see what Dustin says.

Marcus Brubaker

unread,

Aug 28, 2016, 1:30:01 PM8/28/16

to stan...@googlegroups.com

If someone really wanted to add a batch gradient descent optimizer, it would be trivial in Stan right now, simply write a dummy version of a QNUpdate (see lbfgs_update and bfgs_update classes for examples) which doesn't keep any history around and doesn't adjust the gradient direction at all. All the other line-search stuff stays the same. If you want to do the linesearch differently that's also possible, as that part of the code is also encapsulated.

That said, I really am hard pressed to see an advantage to adding a simple (batch) gradient descent optimizer. It should basically always be beaten by LBFGS assuming you're using the same convergence criteria. If there are models where this isn't the case I'd love to hear about it, but most likely the default convergence thresholds need to be adjusted or other tweaks made (those values may not have been set optimally), but I think adding a full other optimizer isn't worth the added testing and maintenance overhead.

Cheers,

Marcus

Andrew Gelman

unread,

Aug 28, 2016, 5:57:58 PM8/28/16

to stan development mailing list, Marcus Brubaker, Dustin Tran

Hi, I'm cc-ing Dustin because he seemed pretty sure that BFGS was too slow. Dustin was sure enough about this that he actually bypassed Stan in our GMO experiments because he felt the optimisation was going too slowly.

A

Bob Carpenter

unread,

Aug 28, 2016, 6:16:44 PM8/28/16

to stan...@googlegroups.com, Marcus Brubaker, Dustin Tran

Unless there's a problem with the implementation, that's always
been the word on the street with gradient descent vs. L-BFGS.
Dustin --- was that BFGS or L-BFGS? And it's straight-up
gradient descent we're talking about, right, not some
mini-batch stochastic version? I don't have a lot of experience
with anything but stochastic gradient, so this is all just
hearsay from me.

- Bob

Marcus Brubaker

unread,

Aug 29, 2016, 11:03:24 AM8/29/16

to stan...@googlegroups.com, Dustin Tran

This has always been my experience in practice as well, ie, that (L-)BFGS beats gradient descent, in fact often quite badly. But that comes with the caveat that you're making an apples to apples comparison by using the same definitions and tolerances to assess convergence.

Gradient descent can sometimes seem to terminate faster than L-BFGS but that is generally because it's hit a plateau where gradients nearly vanish or it's hit a highly curved valley and it is no longer able to make good progress. It can look like it's converged, but it's really failed to find a true minima.

Anyway, if there are cases where it looks like L-BFGS in Stan isn't performing as well as it should I'd be happy to take a quick look at them. It may be the case that we need to look at adjusting the convergence thresholds.

Cheers,

Marcus

Avraham Adler

unread,

Aug 29, 2016, 12:28:14 PM8/29/16

to stan development mailing list, dus...@cs.columbia.edu

On Monday, August 29, 2016 at 11:03:24 AM UTC-4, Marcus Brubaker wrote:
> This has always been my experience in practice as well, ie, that (L-)BFGS beats gradient descent, in fact often quite badly. But that comes with the caveat that you're making an apples to apples comparison by using the same definitions and tolerances to assess convergence.
>

My experience as well, FWIW.

If I could make a couple of suggestions, the first would be that Stan may be better served adding an optimizer that does NOT need gradients or Hessians, which would allow optimization in cases of truly psychopathic objective functions.

On the other hand, if every distribution allowed in Stan is required to be able to have calculable gradients and Hessians, perhaps a trust-region method would make sense, as it could take advantage of the Hessian data and in my very limited experience, that helps with convergence speed as well.

Thanks,

Avi

Andrew Gelman

unread,

Aug 29, 2016, 4:58:48 PM8/29/16

to stan development mailing list, Marcus Brubaker, Dustin Tran

Ok, I think Dustin has some examples so it looks like we can move forward on this!

A

Marcus Brubaker

unread,

Aug 30, 2016, 2:42:50 PM8/30/16

to stan development mailing list, Marcus Brubaker, Dustin Tran

Stan has (very old) support for a Newton optimizer which uses Hessian information but it is very slow because the Hessian information is computed with finite differences. Once second-order gradient information is available (specifically Hessian-vector products), adding an implementation of something like Truncated Newton would be a great idea.

Since the Stan language and function library has been strongly designed to support HMC, almost everything you can implement should be differentiable, so I don't see a good reason to implement a non-gradient based optimizer. That said, implementing some form of global optimizer might be useful to complement the local optimizers.

If Dustin can send along some concrete examples, I'd love to see them.

Cheers,

Marcus

Dustin Tran

unread,

Aug 30, 2016, 3:13:48 PM8/30/16

to stan...@googlegroups.com, Marcus Brubaker

Hi all,

Yes I have a very specific example in mind where we want to estimate fixed effects (regression parameters) and variance components in a linear mixed effects model. There is a closed-form solution for the fixed effects—conditional on the variance components—that is equivalent to an update from Newton’s method. This is what lme4 uses under the hood and is the main bottleneck in the computation (O(np^2) complexity for an n x p design matrix).

I implemented this, L-BFGS, and gradient descent (with an adaptive diagonal learning rate) in Python. You can vary settings of n and p to show that gradient descent outperforms the other two due to the computational complexity of their iterative updates.

In general, I think all gradient descent-based optimizers share the following pattern:

> In a gradient-based optimization, we iteratively update values of $\phi$ using the gradient

> $\nabla_\phi\log p(\phi|y)$ evaluated at the previous iterate; that is, at iteration $t$ for $t=1,2,\ldots$,

> \begin{equation}

> \phi_{t+1} = \phi_t + \rho_t \nabla_\phi \log p(\phi_t | y).

> \end{equation}

> The value $\rho_t$ is a sequence of scalars or a sequence of matrices of diagonal, low rank, or even full rank structure. This dimensionality choice serves as a tradeoff between the statistical efficiency of the iterative update and the computational efficiency of performing the update. The dimensionality choice underpins different numerical algorithms: in classical Fisher scoring and Newton's method, $\rho_t$ is a full rank matrix constructed from the inverse Fisher information~\citep{osborne1992fisher}; in quasi-Newton methods such as (L)BFGS, the matrix has low rank~\citep{nocedal2006numerical}; in gradient descent, $\rho_t$ is traditionally scalar-valued, or diagonal in the case of recent adaptive learning rates~\citep{duchi2011adaptive}.

(excerpt from the GMO draft)

This is why I’m in favor of having an optimize refactor that separately implements the iterative update and chooses how to compute step-size schedule. We can then use this for stan-optimize, ADVI, and later on other optimization algorithms such as GMO and SGD whenever solve the data subsampling conundrum.

> If I could make a couple of suggestions, the first would be that Stan may be better served adding an optimizer that does NOT need gradients or Hessians, which would allow optimization in cases of truly psychopathic objective functions.

I don’t know much about gradient-free optimizers to tell if there’s a common pattern that also incorporates this case. Nelder-Mead could probably fit the above pattern—but probably not Bayesian optimization or random walk methods.

Dustin

Marcus Brubaker

unread,

Aug 30, 2016, 4:57:50 PM8/30/16

to stan...@googlegroups.com

Outperforms in what sense? Wall-clock time? Number of iterations/gradients? And to reach a specific value of the objective or proximity of the solution or until some other measure of convergence? The L-BFGS update should be quite fast for small histories (by default most implementations use 5-10 which is usually more than enough) but that is implementation dependant.

In general the assumption is for most problems the objective and gradient evaluation is going to take much longer than computing the step. Of course, for sufficiently simple or highly optimized models this may not always be true.

Maybe we should find a time to chat about this. It might be easier and more efficient than a back and forth via email.

Can you send your python code and/or a Stan model and data file for this?

Regardless, as I mentioned elsewhere the code is already more or less factorized the way you're describing.

Cheers,

Marcus

Dustin Tran

unread,

Aug 30, 2016, 5:25:37 PM8/30/16

to stan...@googlegroups.com

Outperforms in wall-clock time to reach a specific value of the objective (the marginal mode).

> Regardless, as I mentioned elsewhere the code is already more or less factorized the way you're describing.

Great, we’re on the same page. :) I won’t have time to dig into the implementation of gradient descent in Stan for a while. But when I do we should definitely chat before I touch anything.

Andrew, Aki and I are working on the GMO paper. When we’re done, we’ll have a better understanding of what we need for implementing GMO in Stan. The gradient descent stuff is to specifically beat lme4’s optimizer for large enough N and p settings; this may not end up in Stan anyways because some of the things are model-specific (for example, caching intensive computations like XX^T).

Dustin

Bob Carpenter

unread,

Aug 30, 2016, 5:56:20 PM8/30/16

to stan...@googlegroups.com

In experiments I did ages ago (Feb 2014) at Andrew's behest,
Stan did outperform glm() for large enough problems and even
outperformed lm() when the model was coded directly as follows
and called with init=0 and all tolerances at 1e-8:

data {
int<lower=0> N;
int<lower=1> K;
vector[N] y;
matrix[N,K] x;
}
parameters {
vector[K] beta;
}
transformed parameters {
real<lower=0> squared_error;
squared_error <- dot_self(y - x * beta);
}
model {
increment_log_prob(-squared_error);
}
generated quantities {
real<lower=0> sigma_squared;
sigma_squared <- squared_error / N;
}

You could use (N - 1) if you wnat the usual estimate
of sigma^2.

This should only get faster as our matrix operations like
dot_self are further optimized.

I don't seem to have saved any of the output of the runs
and can't find it in email. You had to get pretty big
problems that took a couple minutes to run before Stan
was faster.

For logistic regression, I just used:

data {
int<lower=1> K; // number of predictors (including intercept)
int<lower=0> N; // number of observations
matrix[N,K] x; // predictors
int<lower=0,upper=1> y[N]; // outcomes
}
parameters {
vector[K] beta; // coeffs
}
model {
y ~ bernoulli_logit(x * beta);
}

- Bob

Reply all

Reply to author

Forward