Multimodality in Unit Vector Transform?

98 views
Skip to first unread message

Michael Betancourt

unread,
Aug 18, 2013, 1:58:17 PM8/18/13
to stan...@googlegroups.com
Right now the unit vector transform treats the hyper spherical angles as unconstrained -- this is mathematically fine but leads to serious non-identifiability in the transformation. Marcus, I believe you implemented this code -- is there any reason you didn't transform from the principle domain of sin/cos to (-inf, inf) with a logit?

Marcus Brubaker

unread,
Aug 18, 2013, 2:27:41 PM8/18/13
to stan...@googlegroups.com
Yeah, there's a trade-off with using the logit transform in that it introduces an arbitrarily placed (relative to the posterior) energy barrier and singularity making it hard to explore the space well.  The non-identifiability can be easily resolved by post-processing modulo pi or 2pi.

Cheers,
Marcus


On Sun, Aug 18, 2013 at 1:58 PM, Michael Betancourt <betan...@gmail.com> wrote:
Right now the unit vector transform treats the hyper spherical angles as unconstrained -- this is mathematically fine but leads to serious non-identifiability in the transformation.  Marcus, I believe you implemented this code -- is there any reason you didn't transform from the principle domain of sin/cos to (-inf, inf) with a logit?

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Michael Betancourt

unread,
Aug 18, 2013, 2:54:15 PM8/18/13
to stan...@googlegroups.com
But I see no logit in the code -- it's using the implicit modularity of sin/cos. And that post processing is exactly what causes the non-identifiability in the posterior.

Marcus Brubaker

unread,
Aug 18, 2013, 3:01:11 PM8/18/13
to stan...@googlegroups.com
Right, I chose to make the sampling easier and the parameterization less arbitrary by not using the logit.  It can be *made* (mostly) identifiable by taking x = mod(y,pi) (with y the unconstrained coordinate) or something like that, I forget the specifics right now.

Cheers,
Marcus

Michael Betancourt

unread,
Aug 18, 2013, 3:12:29 PM8/18/13
to stan...@googlegroups.com
Okay, I see what you mean -- we're talking about slightly different things. It is my understanding that your parameterization should actually make sampling worse, because no matter how many intermediate modular divisions you add the unconstrained space will always be multimodal. Logit transforming an octant of the hypersphere, however, ensures that the unconstrained space has the same modality as the constrained space.

Marcus Brubaker

unread,
Aug 18, 2013, 3:53:06 PM8/18/13
to stan...@googlegroups.com
I'm not really talking about mutli-modality persay.  Lets say we use a logit transform to keep parameters in [0,2*pi].  If there is a single dominant mode (on the hypersphere) at 2pi/0 then it will be almost impossible for the sampler to accurately capture this mode with the logit transform because going from 2pi - epsilon to 2pi + epsilon = epsilon will require going from a very large positive number to a very large negative number and the posterior probability in between will be low.

I agree that not using the logit introduces multimodality in the unconstrained space because of the mod 2pi issue.  Thinking about it more, this is likely to screw up adaptation pretty badly because it won't have any finite moments which is perhaps what you're worried about?

Not sure the best solution here.  There are some other parameterizations of unit vectors we could look at perhaps?

Cheers,
Marcus

Michael Betancourt

unread,
Aug 18, 2013, 4:08:50 PM8/18/13
to stan...@googlegroups.com
You might be able to look at sections of the hypershere and avoid the period boundary conditions. I was going to add another level of transformation to avoid the sin/cos calls so I'll have to do this anyways.

Michael Betancourt

unread,
Aug 18, 2013, 5:49:23 PM8/18/13
to stan...@googlegroups.com
Whoa, it's worse than I thought.

In hindsight, it should have been obvious that this would be a problem: S^{1} (the 1 dimensional sphere, or circle) is a manifold without boundary and e no 1-1 mapping to the real line exists.

Without a 1-1 map we'd have to resort to a multi covering, with the obvious choice being a helix. Indeed, this is what Marcus had previously implemented. The problem is that he forgot about how densities transform under maps that aren't 1-1 -- you have to include ALL the zeroes. For the helix, there are an infinite number of zeros and the transformation should read

p(z) = \sum_{i = - infinty}^{infinty} p( \theta(z) + 2 \pi n) | cos( \theta(z) + 2 \pi n ) |

which, to say the least, presents some implementation problems.

Implementing distributions defined on manifolds without boundaries is going to require a different approach. At this point I would not be against un-exposing the unit vector and von Mises distributions.

Marcus Brubaker

unread,
Aug 18, 2013, 8:49:45 PM8/18/13
to stan...@googlegroups.com
I don't think it's quite as dire as that.  If you ignore issues related to adaptation, the resulting Markov chain is fine.  Consider it this way:

1. Assume we define the hyperspherical parameters on a limited domain, e.g.., x_\in [0,2pi].  This means we only need one term of that infinite sum, which is what we're doing now.
2. When we do HMC (or MH) there is a question of what to do if the parameter ends up outside of [0,2pi].  One thing we could do is at the end of every step, reset the coordinate modulo 2pi, e.g., x <- mod(x',2pi).  This is fine because we know the mapping is completely invariant to shifts +/- 2pi so nothing changes.  You can think of this another way by assuming that we setup a local chart where y \in [x - pi, x + pi], do the simulation in y, and then go back to the global chart as a representation.
3. Finally, if you never need to do any kind of estimation or diagnostics in x then you never actually need to bother correcting back to [0,2pi] during sampling.

That basically is where things are now.

Cheers,
Marcus

Michael Betancourt

unread,
Aug 19, 2013, 3:43:24 AM8/19/13
to stan...@googlegroups.com
> I don't think it's quite as dire as that. If you ignore issues related to adaptation, the resulting Markov chain is fine. Consider it this way:

I disagree.

> 1. Assume we define the hyperspherical parameters on a limited domain, e.g.., x_\in [0,2pi]. This means we only need one term of that infinite sum, which is what we're doing now.

The hyper spherical angles are always defined on [0, 2pi] -- otherwise they wouldn't be hyper spherical angles. The infinite sum comes when we try to extend [0, 2pi] to \mathcal{R} -- if modular division shows up then an infinite number of terms are required.

> 2. When we do HMC (or MH) there is a question of what to do if the parameter ends up outside of [0,2pi]. One thing we could do is at the end of every step, reset the coordinate modulo 2pi, e.g., x <- mod(x',2pi). This is fine because we know the mapping is completely invariant to shifts +/- 2pi so nothing changes. You can think of this another way by assuming that we setup a local chart where y \in [x - pi, x + pi], do the simulation in y, and then go back to the global chart as a representation.

Modular division is not equivalent to an atlas. In particular, an atlas is assumed to be a minimal covering of charts: for S^{1} that would be two charts, whereas the modular division defines an infinite number of charts. Also, charts can't be defined on [0, 2pi] or (0, 2pi) or (0, 2pi] or [0, 2pi), because the resulting boundary isn't open.

But more importantly if we work in charts then we can't do expectations in the typical fashion. We have to compute expectations on each chart and then average the results in any overlap using partitions of unity -- not exactly something you'd want to implement. Essentially, that's what the infinite sum in the change of variables is trying to represent anyways.

> 3. Finally, if you never need to do any kind of estimation or diagnostics in x then you never actually need to bother correcting back to [0,2pi] during sampling.

If adaptation or diagnostics weren't necessary then HMC would have been adopted by everyone a long time ago. Not exactly realistic example.

Marcus Brubaker

unread,
Aug 19, 2013, 2:44:52 PM8/19/13
to stan...@googlegroups.com
The method of local coordinates is pretty well spelled out in V.4.2 of Hairer et al's book.  In their notation my described use of modular division is finding z_n in Alg 4.5.  All of that aside, if you need to do coherent expectation in the unconstrained space, I agree this is an issue.  Expectation in the constrained space work exactly as expected.

Since the adaption is effectively computing expectation in the unconstrained space it's a potentially big issue, so the question is how to solve it.

As a stop-gap you could do the logit transform but I really dislike what that does to the space.  I think it makes the sampling geometry harder than it needs to be.

Could we do the following to parameterize S^n?

- Let the unconstrained parameters x be a vector in R^n+1
- The mapping to constrained parameters is then y = f(x) = x/norm(x)
- Under the hood place a regularizing distribution on the quantity norm(x) (maybe a inverse gamma or something to keep it away from the singularity at zero).

It would take some effort to work out the volume correction here, but it should be doable as there is a one-to-one mapping between x and (y, norm(x)).

Cheers,
Marcus




Marcus Brubaker

unread,
Aug 19, 2013, 3:11:52 PM8/19/13
to stan...@googlegroups.com
Actually, I just realized the volume correction isn't too bad.

x is an unconstrained vector in R^n+1
c is a dummy variable which will be fixed to c=1
y is the constrained point on S^n
r is a strictly positive auxiliary variable

Then the transformation is:

f(x,c) = ( c g(x), ||x||^2 )
        = ( y, r )

where g(x) = x/norm(x)

The Jacobian d{y,r}/d{x,c} can be computed (fixing c = 1) and should be full rank so long as x != 0.

In Stan we could add, as part of the log determinant term during transformation, a basic Inverse Gamma prior on r.

Thoughts?  This should preserve the (relatively) nice geometry of S^n in the unconstrained space and not introduce any complications for adaptation.  In fact, the more I think about this, I think it should work better than hyper-spherical coordinates as it has only one singularity at x = 0 which should never actually be hit in practice.

Cheers,
Marcus

Michael Betancourt

unread,
Aug 19, 2013, 4:34:07 PM8/19/13
to stan...@googlegroups.com
I agree that the logit is bad news, but this parameterization is also bad news. The problem is that there's a scale non-identifiability -- multiply the unconstrained space by a constant and the constrained parameters will still lie on the unit sphere. Consequently the chain will drift and expectations will remain unreliable. I don't think there's an immediate answer. If I have time this week I'll try looking into other implementations of periodic variables.

Marcus Brubaker

unread,
Aug 19, 2013, 4:47:07 PM8/19/13
to stan...@googlegroups.com
The scale issue you mentioned is the reason I mentioned the under-the-hood inverse gamma on r.  It makes it fully identifiable and will prevent drift.

Cheers,
marcus


Michael Betancourt

unread,
Aug 19, 2013, 5:14:51 PM8/19/13
to stan...@googlegroups.com
Right, but it will need to be tuned as a soft constraint will allow significant drifting and a hard constraint will induce significant non-convexity and curvature variations. In fact, the resulting posterior looks suspiciously like the pathological distribution I use for my HMC talks.

If it's easy it's worth implementing, we just have to be really careful with its performance.

Marcus Brubaker

unread,
Aug 19, 2013, 5:49:28 PM8/19/13
to stan...@googlegroups.com
Right, it would be a soft constraint.  I've used similar constructions during optimization and one of the things that turns out is the violation of the constraint (i.e., the magnitude of drift in r) is small unless there is large (i.e., > 90 degree) moves on the sphere in a single step.  This is because the transformation is such that dot(x,df/dx) = 0 where x is the unconstrained parameter and df/dx is the gradient of the objective which only operates on x through the constrained vector y.

All of this is to say that yes, there are potentially some thorny theoretical issues with this approach, but my experience with it in other spaces suggests that they may not show up in practice.

It wouldn't be hard to implement I don't think, although I don't have time to do it myself right now.  Are you working on a model that needs it?

Cheers,
Marcus

Michael Betancourt

unread,
Aug 19, 2013, 6:08:53 PM8/19/13
to stan...@googlegroups.com
> Right, it would be a soft constraint. I've used similar constructions during optimization and one of the things that turns out is the violation of the constraint (i.e., the magnitude of drift in r) is small unless there is large (i.e., > 90 degree) moves on the sphere in a single step. This is because the transformation is such that dot(x,df/dx) = 0 where x is the unconstrained parameter and df/dx is the gradient of the objective which only operates on x through the constrained vector y.
>
> All of this is to say that yes, there are potentially some thorny theoretical issues with this approach, but my experience with it in other spaces suggests that they may not show up in practice.

My concern is less theoretical and more based on what I've seen with HMC (and raw gradient based optimizers) and these kinds of problems. RHMC or CG/BFGS/Newton would perform much better.

> It wouldn't be hard to implement I don't think, although I don't have time to do it myself right now. Are you working on a model that needs it?

No, it just came up at the Stan meeting last week. There is no pressing need, and I'm curious to do some research before I would attempt it.

Ben Goodrich

unread,
Aug 19, 2013, 9:15:26 PM8/19/13
to stan...@googlegroups.com
On Monday, August 19, 2013 3:11:52 PM UTC-4, Marcus Brubaker wrote:
Actually, I just realized the volume correction isn't too bad.

x is an unconstrained vector in R^n+1
c is a dummy variable which will be fixed to c=1
y is the constrained point on S^n
r is a strictly positive auxiliary variable

Then the transformation is:

f(x,c) = ( c g(x), ||x||^2 )
        = ( y, r )

where g(x) = x/norm(x)

The Jacobian d{y,r}/d{x,c} can be computed (fixing c = 1) and should be full rank so long as x != 0.

In Stan we could add, as part of the log determinant term during transformation, a basic Inverse Gamma prior on r.

Thoughts?  This should preserve the (relatively) nice geometry of S^n in the unconstrained space and not introduce any complications for adaptation.  In fact, the more I think about this, I think it should work better than hyper-spherical coordinates as it has only one singularity at x = 0 which should never actually be hit in practice.

I agree that it would be good to have a normalize() transformation. You've basically described the need for a spherical density, which can be characterized as the product of a uniformally distributed vector on the unit sphere and a random radius with some density appropriate for a non-negative number. But the optimal way to do this with NUTS II would be to put standard normal priors on the elements of x and whatever prior the user wants on the radius. It's less straightforward to think about a prior on y with a Jacobian determinant correction to imply some prior on x and r.

Ben

Michael Betancourt

unread,
Aug 20, 2013, 3:55:34 AM8/20/13
to stan...@googlegroups.com
I'll just reiterate that such a geometry is not easy for HMC:

 

Ben Goodrich

unread,
Aug 20, 2013, 8:56:37 AM8/20/13
to stan...@googlegroups.com
On Tuesday, August 20, 2013 3:55:34 AM UTC-4, Michael Betancourt wrote:
I'll just reiterate that such a geometry is not easy for HMC:

 
Cool :) . But just doing a uniform on a sphere is not so bad as long as you are doing it by putting standard normals on the primitives (z) instead of trying to put the prior on the bounded parameter (x)

library(rstan)> model_code <- '
parameters {
  vector[3] z;
}
transformed parameters {
  vector[3] x;
  x <- z / sqrt(dot_self(z));
}
model {
  z ~ normal(0.0, 1.0);
}
'


fit
<- stan(model_code = model_code)
pairs
(fit) # attached

Ben
sphere.png

Ben Goodrich

unread,
Aug 20, 2013, 1:44:00 PM8/20/13
to stan...@googlegroups.com
I still am not seeing what I think I am supposed to be seeing. I compiled this Stan model to sample from the unit K-sphere

data { int<lower=2> K; }
parameters
{ vector[K] z; }
transformed parameters
{ vector[K] x; x <- z / sqrt(dot_self(z)); }

model
{ z ~ normal(0.0, 1.0); }

Then I did this from the command-line

for K in {2..300}; do echo "K <- $K ;" > sphere_data.txt; ./sphere --data=sphere_data.txt --refresh=0; grep "^# Step size =" samples.csv >> stepsize.txt; echo "K = $K"; done

To get the attached file of step sizes. They do trend downward as K gets bigger but seem to level off at 0.7 or so, which seems safe.

Ben
 
stepsize.txt

Marcus Brubaker

unread,
Aug 20, 2013, 2:44:18 PM8/20/13
to stan...@googlegroups.com
Michael,

I don't think the resulting geometry would necessarily be as bad as that.  For one, we can control the steepness or the ring (e.g., through the distribution on r) where all we really care about is that there is zero probability at 0.

I have some code to test some of these theories out.  I'll post when I get a chance to play with them.

Cheers,
Marcus


annulus.png

Marcus Brubaker

unread,
Aug 20, 2013, 2:49:15 PM8/20/13
to stan...@googlegroups.com
Ben,

All that model is doing is sampling z from a normal distribution; the posterior doesn't depend on the transformed x at all.  I.E., you could move x from a transformed parameter to a generated quantity.  In fact, this is a well known way to sample from the unit sphere: draw vector of IID normal(0,1) samples and then normalize the result.

Cheers,
Marcus



--

Ben Goodrich

unread,
Aug 20, 2013, 3:12:07 PM8/20/13
to stan...@googlegroups.com
On Tuesday, August 20, 2013 2:49:15 PM UTC-4, Marcus Brubaker wrote:
All that model is doing is sampling z from a normal distribution; the posterior doesn't depend on the transformed x at all.  I.E., you could move x from a transformed parameter to a generated quantity.  In fact, this is a well known way to sample from the unit sphere: draw vector of IID normal(0,1) samples and then normalize the result.

Right; that is what I was trying to do. Mike said on the Hangout this morning that for high K the stepsize would get really small due to the posterior having a high-dimensional donut shape, which I didn't understand because the mode in this case is at the origin.

Ben

Michael Betancourt

unread,
Aug 20, 2013, 4:34:30 PM8/20/13
to stan...@googlegroups.com
MCMC doesn't sample from the mode! The mode is the maximum probability density, but not the highest probability mass which is what matters for MCMC.

The problem with your test is that you didn't include the Jacobian which is where the nasty geometry here is going to hide -- i.e. you're not sampling from a uniform distribution on the hypershere.

Michael Betancourt

unread,
Aug 20, 2013, 4:36:52 PM8/20/13
to stan...@googlegroups.com
> I don't think the resulting geometry would necessarily be as bad as that. For one, we can control the steepness or the ring (e.g., through the distribution on r) where all we really care about is that there is zero probability at 0.

But if you loosen the steepness you also make the model less identifiable -- when you've overparameterized the model there's always going to be a compromise. Also, I doubt the singularity at zero will matter much (especially in high dimensions) since the measure is essentially zero on the computer.

Anyways, look forward to seeing what you uncover.

Ben Goodrich

unread,
Aug 20, 2013, 5:11:08 PM8/20/13
to stan...@googlegroups.com
On Tuesday, August 20, 2013 4:34:30 PM UTC-4, Michael Betancourt wrote:
MCMC doesn't sample from the mode!  The mode is the maximum probability density, but not the highest probability mass which is what matters for MCMC.

The problem with your test is that you didn't include the Jacobian which is where the nasty geometry here is going to hide -- i.e. you're not sampling from a uniform distribution on the hypershere.

But you don't need to adjust for the Jacobian determinant when you put the prior on the primitive parameters. You only have to deal with those difficulties when you try to put the prior on the constrained transformed parameter. If you try to put a uniform prior on x, then you get the lack of scale identifiably on the z. But if you fix the scale with the prior on z, then it is uniform on the hypersphere like Marcus was saying:

https://en.wikipedia.org/wiki/N-sphere#Uniformly_at_random_from_the_.28n.C2.A0.E2.88.92.C2.A01.29-sphere

Ben

Michael Betancourt

unread,
Aug 20, 2013, 5:15:09 PM8/20/13
to stan...@googlegroups.com
Right, I forgot the azimuthal part of the Jacobian integrates out to unity.

Ben Goodrich

unread,
Aug 20, 2013, 5:39:35 PM8/20/13
to stan...@googlegroups.com
On Tuesday, August 20, 2013 5:15:09 PM UTC-4, Michael Betancourt wrote:
Right, I forgot the azimuthal part of the Jacobian integrates out to unity.

So, if a user does something like

parameters {
  unit_vector
[K] x;
}

and wants a uniform prior on x, we have a problem because the scale of the primitives underlying x are not identified under the normalizing transformation. But if Stan secretly adds a term to the log-posterior that evaluates a chi-squared density with K degrees of freedom at the dot_self() of the primitives, then are we good?

Either way, I think we need to add a normalize() function to the library.

Ben

Reply all
Reply to author
Forward
0 new messages