variational approximation arithmetic operations

Bob Carpenter

unread,

Nov 19, 2015, 12:30:25 AM11/19/15

to stan...@googlegroups.com

Full rank (dense covariance) and mean field (diagonal covariance)
store these parameters for a multivariate normal:

* full rank: mean vector, Cholesky factor

* mean field: mean vector, log std dev vector

But the operations are all carried out elementwise on the
stored parameters:

square(), sqrt(), operator+, operator+=, operator/, operator/=, operator*

Somehow I would have expected if you had diagonal covariance
that the full rank and mean field operations would work out
to the same things. That is, operator* when you were working
with logs would add the log of the value instead of multiplying
the value.

So I guess the question is, what am I missing here?

- Bob

Alp Kucukelbir

unread,

Nov 19, 2015, 8:26:27 AM11/19/15

to stan development mailing list

hi bob,

could you elaborate? what do you mean by "working with logs"?

we also use these objects to store running norms of previous gradients and such. manipulating these objects is where we mostly use these operators. it makes sense (to us) for everything to be elementwise.

cheers
alp

Bob Carpenter

unread,

Nov 19, 2015, 11:47:26 AM11/19/15

to stan...@googlegroups.com

It's not the elementwise nature of it, it's that the mean field
algorithm stores its scale on a log scale whereas the full rank
one stores it on Cholesky factors for covariance matrices, yet
both do the same updates.

If I have two log quantities, say log(sigma1) and log(sigma2)
and I want to add them I need log_sum_exp, i.e.,

log_sum_exp(log(sigma1), log(sigma2)) = log(sigma1 + sigma2)

That is you add them on the linear scale and take the log.

So I wouldn't expect the same algorithm to work with elementwise
operations as defined in the full rank and mean field algorithms.

I also don't understand how square() or sqrt() is supposed to
work with quantities that can be negative, like mean vectors.

- Bob

> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Alp Kucukelbir

unread,

Nov 19, 2015, 1:39:29 PM11/19/15

to stan development mailing list

hi bob,

re: the log. that's fine because we take gradients in the log space too. so no need for the log_sum_exp trick here. we just march around the log space and account for the jacobian in the rest of the ADVI algorithm.

re: the operations. we only use the square/sqrt/multiply operations for the adaptive stepsize sequence. these calculations are on the norm of previous gradients. we store both the gradient and these norms using instantiations of the variational family.

what we actually need is just some sort of container class that has the same "shape" as our variational family. (i.e. 2 vectors for mean-field, 1 vector + 1 L matrix for full-rank).

happy to explain this in person, if that would be better.

cheers
alp

Bob Carpenter

unread,

Nov 19, 2015, 2:48:26 PM11/19/15

to stan...@googlegroups.com

> On Nov 19, 2015, at 1:39 PM, Alp Kucukelbir <akucu...@gmail.com> wrote:
>
> hi bob,
>
> re: the log. that's fine because we take gradients in the log space too. so no need for the log_sum_exp trick here. we just march around the log space and account for the jacobian in the rest of the ADVI algorithm.

OK. I'd still expect different algorithm behavior because
the mean field version works in

(mean, log(sd)),

whereas full rank works in

(mean, cholesky(Sigma)).

The Cholesky entries are on the linear sd scale
whereas the mean field ones are on the log sd scale.

Also, the diagonal of the Cholesky is constrained to be
positive. Should it be log transformed so everything's
unconstrained or is the algorithm OK with violating constraints?

> re: the operations. we only use the square/sqrt/multiply operations for the adaptive stepsize sequence. these calculations are on the norm of previous gradients. we store both the gradient and these norms using instantiations of the variational family.

OK, I was confused by the name of the class and the basic
doc. It needs to make this usage clear, which will be easy
to add.

It might make sense to pull out some base classes with implementations.
We can do that without requiring virtual functions. Ideally,
each subclass would add subclass-specific behavior. It helps to
be able to look at a type/class for a variable and see which
operations apply --- if you overload a class with two intended uses
it's harder to understand.

> what we actually need is just some sort of container class that has the same "shape" as our variational family. (i.e. 2 vectors for mean-field, 1 vector + 1 L matrix for full-rank).

Is the gradient also a multivariate normal in any
useful sense? (I'm always amazed at its properties.)

> happy to explain this in person, if that would be better.

I think I get it now. I'll think about how it might be reorganized
or redocumented. That's something I can do without getting in your way.
And it doesn't need to be done right now.

- Bob

Alp Kucukelbir

unread,

Nov 19, 2015, 4:50:53 PM11/19/15

to stan development mailing list

> The Cholesky entries are on the linear sd scale
> whereas the mean field ones are on the log sd scale.
>
> Also, the diagonal of the Cholesky is constrained to be
> positive. Should it be log transformed so everything's
> unconstrained or is the algorithm OK with violating constraints?

This is OK. We use the non-unique Cholesky factorization in our algorithm. The diagonals need not be positive.

> OK, I was confused by the name of the class and the basic
> doc. It needs to make this usage clear, which will be easy
> to add.

I'd appreciate any and all help on this.

> It might make sense to pull out some base classes with implementations.
> We can do that without requiring virtual functions. Ideally,
> each subclass would add subclass-specific behavior. It helps to
> be able to look at a type/class for a variable and see which
> operations apply --- if you overload a class with two intended uses
> it's harder to understand.

Agreed. Again, point me in the right direction and I'll take it from there.

> Is the gradient also a multivariate normal in any
> useful sense? (I'm always amazed at its properties.)

Not to my knowledge, no.

Bob Carpenter

unread,

Nov 19, 2015, 5:21:26 PM11/19/15

to stan...@googlegroups.com

> On Nov 19, 2015, at 4:50 PM, Alp Kucukelbir <akucu...@gmail.com> wrote:
>
>> The Cholesky entries are on the linear sd scale
>> whereas the mean field ones are on the log sd scale.
>>
>> Also, the diagonal of the Cholesky is constrained to be
>> positive. Should it be log transformed so everything's
>> unconstrained or is the algorithm OK with violating constraints?
>
> This is OK. We use the non-unique Cholesky factorization in our algorithm. The diagonals need not be positive.

I'm in over my head on linear algebra here.
How do you make sure that L * L' is positive definite?

>> OK, I was confused by the name of the class and the basic
>> doc. It needs to make this usage clear, which will be easy
>> to add.
>
> I'd appreciate any and all help on this.

As I just wrote in an issue comment, I'll take a pass when
the dust settles and clean up as much as I can.

>> It might make sense to pull out some base classes with implementations.
>> We can do that without requiring virtual functions. Ideally,
>> each subclass would add subclass-specific behavior. It helps to
>> be able to look at a type/class for a variable and see which
>> operations apply --- if you overload a class with two intended uses
>> it's harder to understand.
>
> Agreed. Again, point me in the right direction and I'll take it from there.

You should focus on the algorithms. I can do this kind of
stuff.

>> Is the gradient also a multivariate normal in any
>> useful sense? (I'm always amazed at its properties.)
>
> Not to my knowledge, no.

OK, then some basic naming needs to change. No big deal. It
won't change functionality. And again, I can do it later.

- Bob

Alp Kucukelbir

unread,

Nov 20, 2015, 11:28:26 AM11/20/15

to stan development mailing list

> I'm in over my head on linear algebra here.
> How do you make sure that L * L' is positive definite?

Positive semi-definite is enough. In any case, though, it doesn't bother us because we'll never drive a diagonal term down to 0 (with probability 1) because of our noisy gradients. And ADVI seeks a local optimum, so uniqueness is not a problem (a diagonal term can be -x or x).

Bob Carpenter

unread,

Nov 20, 2015, 2:23:27 PM11/20/15

to stan...@googlegroups.com

I think we need to take this up in a higher bandwidth channel,
because I still don't think I've been clear enough.

- Bob

Alp Kucukelbir

unread,

Nov 20, 2015, 3:11:27 PM11/20/15

to stan...@googlegroups.com

sounds good. we can chat in person :)

Reply all

Reply to author

Forward