could you elaborate? what do you mean by "working with logs"?
we also use these objects to store running norms of previous gradients and such. manipulating these objects is where we mostly use these operators. it makes sense (to us) for everything to be elementwise.
cheers
alp
re: the log. that's fine because we take gradients in the log space too. so no need for the log_sum_exp trick here. we just march around the log space and account for the jacobian in the rest of the ADVI algorithm.
re: the operations. we only use the square/sqrt/multiply operations for the adaptive stepsize sequence. these calculations are on the norm of previous gradients. we store both the gradient and these norms using instantiations of the variational family.
what we actually need is just some sort of container class that has the same "shape" as our variational family. (i.e. 2 vectors for mean-field, 1 vector + 1 L matrix for full-rank).
happy to explain this in person, if that would be better.
cheers
alp
This is OK. We use the non-unique Cholesky factorization in our algorithm. The diagonals need not be positive.
> OK, I was confused by the name of the class and the basic
> doc. It needs to make this usage clear, which will be easy
> to add.
I'd appreciate any and all help on this.
> It might make sense to pull out some base classes with implementations.
> We can do that without requiring virtual functions. Ideally,
> each subclass would add subclass-specific behavior. It helps to
> be able to look at a type/class for a variable and see which
> operations apply --- if you overload a class with two intended uses
> it's harder to understand.
Agreed. Again, point me in the right direction and I'll take it from there.
> Is the gradient also a multivariate normal in any
> useful sense? (I'm always amazed at its properties.)
Not to my knowledge, no.
Positive semi-definite is enough. In any case, though, it doesn't bother us because we'll never drive a diagonal term down to 0 (with probability 1) because of our noisy gradients. And ADVI seeks a local optimum, so uniqueness is not a problem (a diagonal term can be -x or x).