distribution for a simplex

716 views
Skip to first unread message

Tamas Papp

unread,
Oct 24, 2013, 11:10:04 AM10/24/13
to stan-users mailing list
Hi,

I am trying to estimate an economic model on time use, and I ran into
some difficulties. I will try to summarize what I am doing below, in
the hope of receiving some help. Not all the questions are specific to
STAN, but I am looking for a specification which is "STAN-friendly"
because I would like to use it for the project.

In my setup, people can allocate their time between leisure, home
production, and market work (sum normalized to 1, hence the simplex).
We have time use data on these choices. Below, I show the model for
nonworkers for simplicify.

Given some preference parameters (alpha, beta below), theory predicts
certain time shares. It seemed natural to assume that actual, observed
values are a Dirichlet random draws of these with the given means, so I
wrote up this model in Stan:

--8<---------------cut here---------------start------------->8---
// model for time shares of nonworkers

data {
int<lower=0> N;
simplex[2] lhshares[N]; // lshare, hshare
}
parameters {
simplex[3] alphabeta;
real<lower=0> precision;
}
transformed parameters {
real<lower=0,upper=1> alpha;
real<lower=0,upper=1> beta;
real<lower=0,upper=1> hc;
real<lower=0,upper=1> lc;
alpha <- alphabeta[1];
beta <- alphabeta[2];
hc <- beta/(1-alpha); // comes from economic theory
lc <- (1-(alpha+beta))/(1-alpha); // also from theory
}
model {
alphabeta ~ dirichlet(rep_vector(1,3)); // flat prior
precision ~ chi_square(2); // seemed like a reasonable prior
for (i in 1:N) {
vector<lower=0>[2] lh;
lh[1] <- lc*precision;
lh[2] <- hc*precision;
lhshares[i] ~ dirichlet(lh);
}
}
--8<---------------cut here---------------end--------------->8---

STAN has difficulties finding an initial value for MCMC, and if I give
it one, it has problems with the step size:

--8<---------------cut here---------------start------------->8---
+ . + SAMPLING FOR MODEL 'model2' NOW (CHAIN 1).
Error in eval(expr, envir, enclos) :
No acceptably small step size could be found. Perhaps the posterior is not continuous?
error occurred during calling the sampler; sampling not done
--8<---------------cut here---------------end--------------->8---

I think that this is because some observations have 0 time shares for
home production, which would require Dirichlet parameters < 1, but at
the same time the distribution is unimodal, which would imply Dirichlet
parameters > 1.

If I drop these observations, I still run into occasional difficulties
(this does not happen every time):

--8<---------------cut here---------------start------------->8---
SAMPLING FOR MODEL 'model2' NOW (CHAIN 3).
Iteration: 1 / 3000 [ 0%] (Warmup)Error : Error in function boost::math::lgamma<d>(d): numeric overflow
error occurred during calling the sampler; sampling not done
--8<---------------cut here---------------end--------------->8---

I still don't understand why this happens. I am assuming that the
Dirichlet is not flexible enough: given the means, it only has one
additional degree of freedom.

So I guess the main question is: how should I extend this model with a
more flexible distribution? It it possible to generalize the Dirichlet
distribution, eg make it overdispersed, and somehow account for the fact
that data has observations on the edge of the simplex?

I am using STAN and RStan 2.0.0, but I think that my problems are
conceptual, not a bug in STAN.

Any suggestions would be appreciated.

Best,

Tamas

Bob Carpenter

unread,
Oct 24, 2013, 3:44:10 PM10/24/13
to stan-...@googlegroups.com
My first thought was that hc and lc may be violating
their constraints, but that can't happen with the simplex you have.

> model {
> alphabeta ~ dirichlet(rep_vector(1,3)); // flat prior

You can drop the flat prior, though it doesn't add much overhead.

> precision ~ chi_square(2); // seemed like a reasonable prior
> for (i in 1:N) {
> vector<lower=0>[2] lh;
> lh[1] <- lc*precision;
> lh[2] <- hc*precision;
> lhshares[i] ~ dirichlet(lh);
> }
> }
> --8<---------------cut here---------------end--------------->8---
>
> STAN has difficulties finding an initial value for MCMC,

Right --- see below.


? and if I give
> it one, it has problems with the step size:
>
> --8<---------------cut here---------------start------------->8---
> + . + SAMPLING FOR MODEL 'model2' NOW (CHAIN 1).
> Error in eval(expr, envir, enclos) :
> No acceptably small step size could be found. Perhaps the posterior is not continuous?
> error occurred during calling the sampler; sampling not done
> --8<---------------cut here---------------end--------------->8---
>
> I think that this is because some observations have 0 time shares for
> home production,

A true 0 value in an element of lhsares[i] is outside the support of the Dirichlet.
\
It's going to return -infinity, and that will reject any inits. Then if you give
it inits, you'll reject any sample, then a smaller step size
will be tried, until you get the above message.

Do you have the raw count data that went into lhshares? I'd think that'd
be easier to work with.

Because lhsares[i] is binary, you could reduce that dirichlet(lh) to
a beta.

> which would require Dirichlet parameters < 1, but at
> the same time the distribution is unimodal, which would imply Dirichlet
> parameters > 1.

It's that they imply Dirichlet parameters = 0, which we don't support.
Similarly for the beta, for which a question just like this one came
up a while back on the list.


> If I drop these observations, I still run into occasional difficulties
> (this does not happen every time):
>
> --8<---------------cut here---------------start------------->8---
> SAMPLING FOR MODEL 'model2' NOW (CHAIN 3).
> Iteration: 1 / 3000 [ 0%] (Warmup)Error : Error in function boost::math::lgamma<d>(d): numeric overflow
> error occurred during calling the sampler; sampling not done
> --8<---------------cut here---------------end--------------->8---

That's a separate problem. If you see it this early in warmup, it's
because Stan hasn't lowered the step size enough. The Hamiltonian simulation
algorithm takes a step in a direction determined by the gradient, and this can take it
into bad floating point land or can violate a constraint. Then we lower
the step size until this doesn't happen, though it can still happen if you
get close to a boundary, and then the Metropolis accept step cleans up to
make sure the samples are still right.

> I still don't understand why this happens. I am assuming that the
> Dirichlet is not flexible enough: given the means, it only has one
> additional degree of freedom.

It's really the zeros that are the problem. And the discretization of
the Hamiltonian.

>
> So I guess the main question is: how should I extend this model with a
> more flexible distribution? It it possible to generalize the Dirichlet
> distribution, eg make it overdispersed, and somehow account for the fact
> that data has observations on the edge of the simplex?

Yes, it is. You can use something like a multivariate logistic distribution.
That lets you model full covariance. (We don't have that coded directly, so you
need to run a multivariate normal through softmax.) An in-between step
without covariance is possible by doing the above with zero correlations (and
then you don't need anything multivariate). This allows the counts to vary by
dimension rather than in sum.

Having said this, the Dirichlet can wander off into high values if you
don't crank down the prior on the precision to keep things reasonable.

In any case, this isn't going to help with the out-of-simplex support issue.

> I am using STAN and RStan 2.0.0, but I think that my problems are
> conceptual, not a bug in STAN.

Agreed --- the bug in 2.0.0 was just for multivariate_normal distributions (and
there, not even the cholesky or precision parameterizations, just the
covariance parameterization).

- Bob

Reply all
Reply to author
Forward
0 new messages