Multinomial distribution & regression

1,055 views
Skip to first unread message

David Westergaard

unread,
Sep 27, 2015, 12:54:27 PM9/27/15
to Stan users mailing list
Hi,

I am trying to run a regression model on data that falls into three classes, with a number of covariates. Since the aim of the project is to find the probability of being in class 1, 2 or 3 giving covariates (and the effect of covariates on the probability), I thought it could be done using a simple multinomial regression model. Due to the very large number of data points I tried to count how many samples fall into each combination of class and covariates (I read on these forums that sampling can be faster using aggregated groups instead of individual samples for very large data sets). Also, I am not even sure if this is proper to do, so please correct me if I am going on a wild goose chase. Anywho, I ran into the problem that Stans implementation of the Multinomial distribution explicitly calculates the number N from the input vector y, and it cannot be given as an argument. I have tried looking a bit around on the forums, but either nothing is here or I am using a wrong search term. 

I tried to implement it simply looping over the number of counts for each group, but that did not work well since the Multinomial distribution expects a vector. https://gist.github.com/demodw/ce5e4f7f7e83184049cc

Any advice is appreciated.


Cheers,
David

  

Bob Carpenter

unread,
Sep 28, 2015, 1:21:57 AM9/28/15
to stan-...@googlegroups.com
If you have three outcomes and observe the second one
5 times for linear predictor theta[n], then you
can do something like this

{
int y[3];
y[1] <- 0;
y[2] <- 5;
y[3] <- 0;

y ~ categorical_logit(theta[n]);
}

The open bracket creates a new scope so you can drop
this anywhere. And the _logit means that you apply softmax
to get the usual categorical.

Alternatively, you can do this

increment_log_prob(5 * categorical_logit_log(2, theta[n]));

The above should produce the same posterior as the following loop

for (i in 1:count[n])
y[n] ~ categorical_logit(theta[n]);

You can break a multinomial down this way, too.

We should really have a multinomial_logit to match categorical.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

David Westergaard

unread,
Sep 28, 2015, 1:55:22 AM9/28/15
to stan-...@googlegroups.com
I see, that is quite useful. What is the smartest way to combine this with different combinations of covariates i Stan?

E.g. I have,
Group 1, count = 5, sex = 1, age group = 1
Group 1, count = 2, sex = 1, age group = 2
Group 1, count = 2, sex = 2, age group = 1
Group 2, count = 10, sex = 1, age group = 1
Group 2, count = 15, sex = 2, age group = 1
Group 3, count = 20, sex = 1, age group = 1
Group 3, count = 20, sex = 2, age group = 1
… And so forth. Would I have to specify every single combination a priori?

Cheers,
David




You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/y6PPWzu7xfw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Bob Carpenter

unread,
Sep 28, 2015, 1:44:25 PM9/28/15
to stan-...@googlegroups.com
It might be easiest to do this outside of Stan
if you don't have group-level covariates. If you
do, the data format you have it in will work with
group, count, sex, and age-group all being parallel
arrays of integers.

- Bob

David Westergaard

unread,
Sep 28, 2015, 3:20:40 PM9/28/15
to stan-...@googlegroups.com
I am not quite sure I understand. I tried creating three individual arrays, one for each class, and then just running increment_log_prob three times (the model can be seen at https://gist.github.com/demodw/9e69a07e24804afeb663), though I am not quite sure it behaves as I would expect.

The coefficients on the regression looks a bit odd. The three intercepts (95% interval in parenthesis) are -0.93 (-43.71 to 32.93), 0.85 (-16.23 to 26.47), 4.94 (-38.02 to 62.57). All the Rhat values are ~1.01, and the effective number of samples > 100, after 500 iterations across 3 chains. I might let it run a bit longer, just to see if it actually hasn’t converged yet.

I tried adding some structure to pull in the coefficients a bit by adding hyper priors, but it only helped slightly, though they were just uniform priors in the range 0-5. Reading through some of Gelman’s older work and searching on Google, I couldn’t really see if there are any recommendations for hyperpriors on regression coefficients.

Cheers,
David

Bob Carpenter

unread,
Sep 28, 2015, 3:38:19 PM9/28/15
to stan-...@googlegroups.com
I don't see why theta_one is an array. I thought you
were trying to collect sufficient statistics?

- Bob

Bob Carpenter

unread,
Sep 28, 2015, 3:41:26 PM9/28/15
to stan-...@googlegroups.com
Let me try to be a little clearer.

count: 7 outcome 1: group 1
count: 1 outcome 2: group 1
count: 3 outcome 1: group 2
...

That is, 7 times, items belonging to group 1 had outcome 1 and
1 time they had outcome 2. For instance, the groupings might
be income and the outcomes votes for republican, democrat, or other.

You can then code this up as a multinomial, which in pseudocode is

(7,1,0) ~ multinomial(theta[1]);

where theta[1] is the simplex of predictions for group 1. Of course,
you'll need to build the array (7, 1, 0) --- you can't literally use
that notation in Stan yet.

- Bob

David Westergaard

unread,
Sep 28, 2015, 5:12:14 PM9/28/15
to stan-...@googlegroups.com
I see. I was hoping to avoid that. I deliberately did not write out all group combinations in my previous example. The full number of combinations for covariates I work with is 5,700, but the data is very s (I observe maybe ~300 different combinations). The full set with all covariates is,

Group (1,2,3), Count (Integer), Sex(1,2), Age_i (1-10), Age_j (1-10), Type_i (1-4), Type_j (1-4)

and in reality these groups are combinations of ~1,700 other observed variables, so I was hoping to fit this into a big hierarchical model (~1 million regression models each with up to 5,700 data points each - but typically much less), but I am not sure I can fit that into memory if I will have to expand all the data.

I think I am also struggling a bit with how theta is defined. Say I observe the following (with a reduced number of covariates),

Group: 1, count: 7, sex: 1, age_i: 1
Group: 1, count: 2, sex: 2, age_i: 1
Group: 2, count: 10, sex: 1, age_i: 5
Group: 2, count: 1, sex: 1, age_i: 1
Group: 3, count: 15, sex: 2, age_i: 5

How would you construct theta? I would think I could just make a linear combination of the combinations that I actually observe.

Cheers,
David

Bob Carpenter

unread,
Sep 28, 2015, 5:18:39 PM9/28/15
to stan-...@googlegroups.com

> On Sep 28, 2015, at 5:12 PM, David Westergaard <da...@harsk.dk> wrote:
>
> I see. I was hoping to avoid that. I deliberately did not write out all group combinations in my previous example. The full number of combinations for covariates I work with is 5,700, but the data is very s (I observe maybe ~300 different combinations). The full set with all covariates is,
>
> Group (1,2,3), Count (Integer), Sex(1,2), Age_i (1-10), Age_j (1-10), Type_i (1-4), Type_j (1-4)
>
> and in reality these groups are combinations of ~1,700 other observed variables, so I was hoping to fit this into a big hierarchical model (~1 million regression models each with up to 5,700 data points each - but typically much less), but I am not sure I can fit that into memory if I will have to expand all the data.
>
> I think I am also struggling a bit with how theta is defined. Say I observe the following (with a reduced number of covariates),
>
> Group: 1, count: 7, sex: 1, age_i: 1
> Group: 1, count: 2, sex: 2, age_i: 1
> Group: 2, count: 10, sex: 1, age_i: 5
> Group: 2, count: 1, sex: 1, age_i: 1
> Group: 3, count: 15, sex: 2, age_i: 5
>
> How would you construct theta? I would think I could just make a linear combination of the combinations that I actually observe.

Right. Somewhere you need to define the count and theta
for each combination. theta should be defined the same way
as without combinations.

Then you just need to include the data as you indicate above,
where C is the number of combinations:

data {
int group[C];
int count[C];
int sex[C];
int age[C];
...

But critically, you need the outcomes for each c in 1:C. This
is just a single outcome:

int y[C];

then you define

for (c in 1:C) {
real theta;
theta <- ... define here ...
increment_log_prob(count[c] * categorical_logit_log(y[C], theta));
}

OR, and this is an OR, not an AND, if you have multiple outcomes
observed per combination of predictors, you can instead define y as

int y[C,K];

where K is the possible number out outcomes. Then you just need:

for (c in 1:C) {
real theta;
theta <- ...
y[c] ~ categorical_logit(theta);
}

- Bob

David Westergaard

unread,
Sep 28, 2015, 6:42:55 PM9/28/15
to stan-...@googlegroups.com
Questions inline. 
Sorry for the repeated emails, but this is the step where I think I fail to understand how to compute theta. categorical_logit_log does not accept a real as its second parameter, so when I try to feed it a real it complains. I adjusted the code to what you suggested: https://gist.github.com/demodw/f2e581f312424f86b858 - except for y. Isn’t y just what I name group/class?


OR, and this is an OR, not an AND, if you have multiple outcomes
observed per combination of predictors, you can instead define y as

  int y[C,K];

where K is the possible number out outcomes.  Then you just need:

 for (c in 1:C) {
   real theta;
   theta <- ...
   y[c] ~ categorical_logit(theta);
 }

That’s useful to know. Thanks.

Cheers,
David

-- 

David Westergaard

unread,
Sep 29, 2015, 10:30:46 AM9/29/15
to stan-...@googlegroups.com
OK. Re-reading your comments a few times, I think I finally understood it and managed to implement it as a model.

I restructured the data to be in the format,

count_outcome1, count_outcome2, count_outcome3, count_group, age_i, age_j, gender

So I can feed it directly into the model as,

(count_outcome1, count_outcome2, count_outcome3) ~ multinomial(softmax(b0 + b_age_i + b_age_j + b_gender))


When I run the model and re-generate the number of counts per outcome in the generated quantities block, it matches very closely what I am actually observing, so it seems like it is working. The coefficient estimate are, however, still very odd. In many cases the 95% HDI range from -50 +- 10 to 50 +- 10. I am not quite sure why. It seems like the model has converged (all Rhat < 1.1), though some variables have a low number of effective samples (<20). But even for variables with a large number of effective samples (>150) and Rhat < 1.01, the estimates are very broad. I tried putting uniform hyper priors on the priors of the coefficients, but even that did not help much. I am, however, wondering if the uniform is appropriate for that, or if there are other go-to hyper priors. Any advice on this one?

Cheers,
David

Michael Betancourt

unread,
Sep 29, 2015, 10:56:14 AM9/29/15
to stan-...@googlegroups.com

When I run the model and re-generate the number of counts per outcome in the generated quantities block, it matches very closely what I am actually observing, so it seems like it is working. The coefficient estimate are, however, still very odd. In many cases the 95% HDI range from -50 +- 10 to 50 +- 10. I am not quite sure why. It seems like the model has converged (all Rhat < 1.1), though some variables have a low number of effective samples (<20). But even for variables with a large number of effective samples (>150) and Rhat < 1.01, the estimates are very broad. I tried putting uniform hyper priors on the priors of the coefficients, but even that did not help much. I am, however, wondering if the uniform is appropriate for that, or if there are other go-to hyper priors. Any advice on this one?

If the simulated data are consistent with the measured data then why do you
claim that the coefficients are odd?  It sounds more like your model is
non identified and these extreme values are entirely valid.  If there’s
reason to prefer small coefficients then you should encode that in
informative (or at least weakly informative) priors!

Bob Carpenter

unread,
Sep 29, 2015, 2:46:28 PM9/29/15
to stan-...@googlegroups.com
Andrew's been urging us against uniform priors on intervals
and towards unconstrained (except as necessary, such as
lower=0 for scale/sd/variance parameters) priors with as
much information as possible packed into their parametric
form. This assumes you know the scale of answer you expect.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

David Westergaard

unread,
Feb 8, 2016, 5:37:23 AM2/8/16
to stan-...@googlegroups.com
Hi,

I have been changing the model a bit from the last one. I’ve fixated a class to always have coefficient = 0, to avoid identifiability problems.

When I run the model, it seems to run quite fast in the start (~50 samples per day). However, when it reaches about ~200 samples, it slows down dramatically, and only draws about 1 sample per day. I’m wondering if I’ve unintentionally come up with a problematic posterior. I fixed one of the classes, so that K[6] = 0, which should fix the identifiability problem with softmax. I have tried to set wide priors, to give Stan some room to explore. As I have understood it, multinomial regression coefficients can be interpreted as a relative risk. The priors I have specified now should give each coefficient a value -+3, which would give risks in the interval exp(+-3) = [0.05, 20], which are perfectly reasonable relative risk values. 
I do have a rather large number of data points, 7,493,720 to be exact. They are, however, split over 187,343 combinations. This is because there are 40 combinations of discrete covariates per data point ( 2 sex and 10  age). There are 6 different outcomes. 

Could there be any other problems, or bad specifications?

I have uploaded the model to https://gist.github.com/demodw/84e46c6bac8eb1e55d99. Any pointers on how to improve the efficiency would also be much appreciated! I could not find a multi_logit function, and softmax only takes vectors, so I had to do the log likelihood increment per data point.

Also, when I am running diagnostics, should I look at Rhat values for the raw (K-1) coefficients, or for the K coefficients? And can I somehow exclude the (K-1) coefficients from the output?


Variable names might be a bit silly, but it helps me keep track of where in the hierarchy I am. :-)


Thanks in advance!

David



You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/y6PPWzu7xfw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Daniel Lee

unread,
Feb 8, 2016, 10:11:13 AM2/8/16
to stan-users mailing list
Hi David,

On Mon, Feb 8, 2016 at 5:37 AM, David Westergaard <da...@harsk.dk> wrote:
When I run the model, it seems to run quite fast in the start (~50 samples per day). However, when it reaches about ~200 samples, it slows down dramatically, and only draws about 1 sample per day. I’m wondering if I’ve unintentionally come up with a problematic posterior.

Are you using the default algorithm? (NUTS with a diagonal mass matrix)
How many warmup iterations? What you may be seeing is a slow down in warmup with a change in adaptation. After warmup, you typically won't see drastic differences between samples unless you run into problems in the posterior distribution that weren't discovered in warmup.
 
I fixed one of the classes, so that K[6] = 0, which should fix the identifiability problem with softmax. I have tried to set wide priors, to give Stan some room to explore. As I have understood it, multinomial regression coefficients can be interpreted as a relative risk. The priors I have specified now should give each coefficient a value -+3, which would give risks in the interval exp(+-3) = [0.05, 20], which are perfectly reasonable relative risk values. 
I do have a rather large number of data points, 7,493,720 to be exact. They are, however, split over 187,343 combinations. This is because there are 40 combinations of discrete covariates per data point ( 2 sex and 10  age). There are 6 different outcomes. 

Could there be any other problems, or bad specifications?

Try a second level of non-centered parameterization for the hyperparameters.
 

I have uploaded the model to https://gist.github.com/demodw/84e46c6bac8eb1e55d99. Any pointers on how to improve the efficiency would also be much appreciated! I could not find a multi_logit function, and softmax only takes vectors, so I had to do the log likelihood increment per data point.

Also, when I am running diagnostics, should I look at Rhat values for the raw (K-1) coefficients, or for the K coefficients?

Just look at the K-1 coefficients. If you know the 0th one is always 0, then Rhat will be NaN (or some other non-nonsensical number).
 
And can I somehow exclude the (K-1) coefficients from the output?

Depends on the interface.

David Westergaard

unread,
Feb 8, 2016, 10:19:40 AM2/8/16
to stan-...@googlegroups.com
Hi Daniel,

Thanks for the reply! I have some additional questions inline.

On 08 Feb 2016, at 16:11, Daniel Lee <bea...@alum.mit.edu> wrote:

Hi David,

On Mon, Feb 8, 2016 at 5:37 AM, David Westergaard <da...@harsk.dk> wrote:
When I run the model, it seems to run quite fast in the start (~50 samples per day). However, when it reaches about ~200 samples, it slows down dramatically, and only draws about 1 sample per day. I’m wondering if I’ve unintentionally come up with a problematic posterior.

Are you using the default algorithm? (NUTS with a diagonal mass matrix)
How many warmup iterations? What you may be seeing is a slow down in warmup with a change in adaptation. After warmup, you typically won't see drastic differences between samples unless you run into problems in the posterior distribution that weren't discovered in warmup.

I’m doing 400 warmup and 400 samples, so it was only about halfway through the warm-up. Using CmdStan with default parameters for sampling. 


 
I fixed one of the classes, so that K[6] = 0, which should fix the identifiability problem with softmax. I have tried to set wide priors, to give Stan some room to explore. As I have understood it, multinomial regression coefficients can be interpreted as a relative risk. The priors I have specified now should give each coefficient a value -+3, which would give risks in the interval exp(+-3) = [0.05, 20], which are perfectly reasonable relative risk values. 
I do have a rather large number of data points, 7,493,720 to be exact. They are, however, split over 187,343 combinations. This is because there are 40 combinations of discrete covariates per data point ( 2 sex and 10  age). There are 6 different outcomes. 

Could there be any other problems, or bad specifications?

Try a second level of non-centered parameterization for the hyperparameters.

How so? parametrize hyper_* so what they are also pulled from a normal(0,1) distribution? Would I just multiply the hyper_hyper value on afterwards?


 

I have uploaded the model to https://gist.github.com/demodw/84e46c6bac8eb1e55d99. Any pointers on how to improve the efficiency would also be much appreciated! I could not find a multi_logit function, and softmax only takes vectors, so I had to do the log likelihood increment per data point.

Also, when I am running diagnostics, should I look at Rhat values for the raw (K-1) coefficients, or for the K coefficients?

Just look at the K-1 coefficients. If you know the 0th one is always 0, then Rhat will be NaN (or some other non-nonsensical number).
 
And can I somehow exclude the (K-1) coefficients from the output?

Depends on the interface.

I’m using CmdStan2.9 and I can’t see any way of specifying which parameters to save. Moving the *_raw parameters to the model block does not seem to be a possibility, since I am using them for assignment in the transformed parameters block.

Cheers,

Daniel Lee

unread,
Feb 8, 2016, 10:28:23 AM2/8/16
to stan-users mailing list
On Mon, Feb 8, 2016 at 10:19 AM, David Westergaard <da...@harsk.dk> wrote:
Hi Daniel,

Thanks for the reply! I have some additional questions inline.

On 08 Feb 2016, at 16:11, Daniel Lee <bea...@alum.mit.edu> wrote:

Hi David,

On Mon, Feb 8, 2016 at 5:37 AM, David Westergaard <da...@harsk.dk> wrote:
When I run the model, it seems to run quite fast in the start (~50 samples per day). However, when it reaches about ~200 samples, it slows down dramatically, and only draws about 1 sample per day. I’m wondering if I’ve unintentionally come up with a problematic posterior.

Are you using the default algorithm? (NUTS with a diagonal mass matrix)
How many warmup iterations? What you may be seeing is a slow down in warmup with a change in adaptation. After warmup, you typically won't see drastic differences between samples unless you run into problems in the posterior distribution that weren't discovered in warmup.

I’m doing 400 warmup and 400 samples, so it was only about halfway through the warm-up. Using CmdStan with default parameters for sampling. 


There's a description of adaptation in the Stan manual. At ~200 iterations, it would have change the parameters which may cause it to slow.

 
I fixed one of the classes, so that K[6] = 0, which should fix the identifiability problem with softmax. I have tried to set wide priors, to give Stan some room to explore. As I have understood it, multinomial regression coefficients can be interpreted as a relative risk. The priors I have specified now should give each coefficient a value -+3, which would give risks in the interval exp(+-3) = [0.05, 20], which are perfectly reasonable relative risk values. 
I do have a rather large number of data points, 7,493,720 to be exact. They are, however, split over 187,343 combinations. This is because there are 40 combinations of discrete covariates per data point ( 2 sex and 10  age). There are 6 different outcomes. 

Could there be any other problems, or bad specifications?

Try a second level of non-centered parameterization for the hyperparameters.

How so? parametrize hyper_* so what they are also pulled from a normal(0,1) distribution? Would I just multiply the hyper_hyper value on afterwards?


Exactly. And you'd update the lowest level parameters too.

 


 

I have uploaded the model to https://gist.github.com/demodw/84e46c6bac8eb1e55d99. Any pointers on how to improve the efficiency would also be much appreciated! I could not find a multi_logit function, and softmax only takes vectors, so I had to do the log likelihood increment per data point.

Also, when I am running diagnostics, should I look at Rhat values for the raw (K-1) coefficients, or for the K coefficients?

Just look at the K-1 coefficients. If you know the 0th one is always 0, then Rhat will be NaN (or some other non-nonsensical number).
 
And can I somehow exclude the (K-1) coefficients from the output?

Depends on the interface.

I’m using CmdStan2.9 and I can’t see any way of specifying which parameters to save. Moving the *_raw parameters to the model block does not seem to be a possibility, since I am using them for assignment in the transformed parameters block.

Parameters and transformed parameters are saved in the output. If you want things that you've declared in transformed parameters not to appear in the output at all, move the declaration to the model block and the definition there as well.


David Westergaard

unread,
Feb 8, 2016, 11:01:50 AM2/8/16
to Stan users mailing list
OK. I updated the code with parameterisation of all coefficients (https://gist.github.com/demodw/84e46c6bac8eb1e55d99). I had to move the hyper priors into a loop, though, since these are real[] and there is no operation for multiplying a collection of reals (real[]) with a real. 

I cannot figure out how to move all the raw parameters outside the parameter block, though. I cannot do any assignment in the model block, so that has be to be done in the transformed parameters block. I cannot declare any variables I am using in the transformed parameters block in the model block, so I am a bit stuck. 

I'm still wondering if there are more operations I could write smarter. I still think it's a bit odd that the slowdown happens at ~200 iterations. Wouldn't it start already in the slow adaptation window, around 75 iterations?

David

David Westergaard

unread,
Feb 9, 2016, 4:48:24 AM2/9/16
to stan-...@googlegroups.com
Changing the code, it's seemingly running even slower now. In the last 15 hours it's only drawn 18 and 10 samples (I'm running two chains). Could this just be a really slow initial adaption, and it will speed up later? Or will it only get worse during the second adaption period?

edit: did something change to the stan group? For whatever reason I'm not seeing my own post on the web interface (https://groups.google.com/forum/#!categories/stan-users/general), but I am getting it per email.

David 

Bob Carpenter

unread,
Feb 9, 2016, 4:02:15 PM2/9/16
to stan-...@googlegroups.com
There's a lot of variation due to warmup. If something's
going this slowly and it's not because of a huge density
function (that is, there are lots of leapfrog steps per
iteration), then you want to think about reparameterizing
if at all possible to remove extreme curvature and correlations.

The K-variable parameterization of the simplex with softmax
is seriously non-identified, which can often be a problem
in sampling --- it produces a narrow ridge of high density,
as described in the problematic posteriors section.

One thing you can do is move to only dealing with (K-1)
parameters, but that leads to asymmetry in the predictors.
Either that, or you need strong priors to identify the
regression.

You can speed up a lot of your loops with vectorization, e.g.,

for (d in 1:D) {
// Transform hyper priors to have scale = hyper-hyper prior
// Could potentially be
hyper_0[d] <- hyper_0_raw[d] * hyper_hyper_0;
hyper_sex[d] <- hyper_sex_raw[d] * hyper_hyper_sex;

can become

hyper_0 <- hyper0_raw * hyper_hyper_0;
hyper_sex <- hyper_sex_raw * hyper_hyper_sex;

and even more importantly,


for (d in 1:D) {
beta_0_raw[d] ~ normal(0, 1);

can become

to_vector(beta_0_raw) ~ normal(0, 1);

if you declare the relevant vectors/matrices instead of arrays (see the manual
chapter on arrays vs. vectors --- you need vectors to do vector
operations).

- Bob

> On Feb 9, 2016, at 4:48 AM, David Westergaard <da...@harsk.dk> wrote:
>
> Changing the code, it's seemingly running even slower now. In the last 15 hours it's only drawn 18 and 10 samples (I'm running two chains). Could this just be a really slow initial adaption, and it will speed up later? Or will it only get worse during the second adaption period?
>

David Westergaard

unread,
Feb 10, 2016, 4:02:40 AM2/10/16
to stan-...@googlegroups.com
I am not sure how to quantify the number of leapstep frogs per iteration. According to the diagnostic, the gradient evaluation took 17.98 seconds, which is not to bad I’d say. Are there any other diagnostic parameters I can look at? By  re-parameterising, do you mean something like multinomial-Poisson? e.g. http://stats.stackexchange.com/questions/105282/logistic-multinomial-regression-as-two-multiple-poisson-regressions. I know that this is the typical frequentist approach, since it’s difficult to get a multinomial regression to converge using a MLE. Or so I’ve been told, at least.

I thought the multinomial softmax would be fully identified by setting one of the elements to 0. That’s what I got from reading https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_log-linear_model at least. I am getting quite a few errors hyper_0 is -nan in the start, but I guess that is due to the constraints on the scale parameters. 

I updated the code with your vectorisation suggestions at https://gist.github.com/demodw/84e46c6bac8eb1e55d99. I am not yet sure if it’s running faster, but at least the code is cleaner

Cheers,
David

Bob Carpenter

unread,
Feb 10, 2016, 2:33:17 PM2/10/16
to stan-...@googlegroups.com
I meant reparameterizing the same likelihood, not moving to an
approximation. For instance, using K-1 in softmax. If you
do set one of the elements to 0 in a K-vector softmax, that
will be identified. I must have missed that in your model.

- Bob

David Westergaard

unread,
Feb 11, 2016, 11:16:01 AM2/11/16
to stan-...@googlegroups.com
To be fair, I actually think the use of multiple Poisson models instead of a multinomial is equivalent. In MLE terms, at least, and I’ve found a few articles with a Bayesian focus saying the same. The downside is that this kind of model causes a parameter explosion. I tried to reparameterise the multinomial to a Poisson log linear (https://gist.github.com/demodw/8c16d1eaa359aa36348d), following the recipe at http://data.princeton.edu/wws509/notes/c6.pdf (page 6-7). It seems like it has a hard time adjusting, or I messed up something. It has not gotten to first iteration after a few hours, and I keep getting the errors:

"Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
validate transformed params: hyper_0[1] is -nan, but must be greater than or equal to 0
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.”

Normally this would be fine. hyper_0 is, after all, constrained to be > 0. Usually I only see the error for the first few iterations. 

The last element of K should be zero. At least, I force the last element of every beta coefficient to be zero. Is there a chance the hierarchical structure or prior distributions could be causing adaptation problems? Or could it be the distribution in each group? Group 1 usually have a few million counts, whereas group2-6 are relatively rare.

David

David Westergaard

unread,
Feb 19, 2016, 11:20:09 AM2/19/16
to stan-...@googlegroups.com
I made some tweaks and whatnot and managed to at least get it running. Now the problem is, it’s not converging! Normally, I would just throw more iterations at the model, but this does not seem like a good solution in this case. There is some indications that something else is going on. Something I don’t quite understand. I read an excellent post by Michael Betancourt at https://groups.google.com/forum/#!topic/stan-users/hjYWqJZfSgw (I don’t know why I had not found that post until now, it was so useful! it should really be exposed more), and I’ve been using the suggestions from there to diagnose the model.

The simplified model is can be seen at https://gist.github.com/demodw/98424b08f42bcd8e09dc. Just to re-iterate, this is a hierarchical multinomial regression model over D experiments, and in each experiment there are the same K outcomes. K=6 is set as the reference category, and fixed to be zero. There is a  hyper prior on outcomes 1 and 2, and another on outcomes 3, 4, and 5. 

I decreased the size of my data set a lot, and removed all of the covariates (age, age^2 and sex) to make diagnostics simpler. This R code will generate data that’s fit as input, and in the same size range:

NROW <- 100
K <- 6
NChapters <- 3
Count <- matrix(nrow=NROW, ncol=K)
Count[,1] <- round(runif(100, 20, 2000))
Count[,2] <- round(runif(100, 20, 2000))
Count[,3] <- round(runif(100, 20, 300))
Count[,4] <- round(runif(100, 20, 300))
Count[,5] <- round(runif(100, 20, 300))
Count[,6] <- round(runif(100, 1*10^6, 4*10^6))

# Quite a lot of outcomes from group 3, 4, or 5 are zero.
Count[,3][sample(100, 30, replace=F)] <- 0
Count[,4][sample(100, 30, replace=F)] <- 0
Count[,5][sample(100, 30, replace=F)] <- 0

Di <- seq(1:NROW)
DUnique <- NROW
N <- NROW

D <- NROW
ICD_Chapter <- round(runif(NROW, 1, NChapters))
ICD_B <- round(runif(NROW, 1, NROW))
ICD_A <- round(runif(NROW, 1, NROW))


ICD_combination_index_length <- 10
ICD_combination_index <- round(runif(NROW, 1, ICD_combination_index_length))

stan_rdump(c("NROW", "K", "NChapters", "Count", "Di", "DUnique", "N", "D", "ICD_Chapter", "ICD_B", "ICD_A", "ICD_combination_index", "ICD_combination_index_length"), file='dummy.R')


Test 1. Default settings. 
Rhat for beta_0_raw’s are between 1.1-3154. Neff between 2 and 28. Clearly not converged. Trace plots show that chain get stuck and don’t mix at all, see http://imgur.com/Mz916tr. 3 divergences. Seemingly a lot of iterations hit tree depth. 

Test 2. Increased acceptance probability to 0.9
Same story. A bit mixing on some parameters, but overall no mixing at all. 9 divergences. Does not hit tree depth. 

Test 3. Decrease step size to 0.1. 
Rhat’s between 1.06-15333. Some chains are trying to explore values, instead of just sticking with the same values for all iterations, but most just stuck. See http://imgur.com/a3wpmI1. 8 divergences. Does not hit tree depth. 

Test 4. Decrease step size to 0.01.
R-hat between 1.09-641. Same problem, chains are not exploring any values. Does not hit tree depth. 

Test 5. Decrease step size to 0.01 and increase delta to 0.95
No difference, same result. 

Test 6. Decrease step size to 0.01 and increase delta to 0.95, 2000 warmup steps, 2000 samples.
No difference. 


Same 6 tests, using a Student-t prior with df=3, center=0, scale=10. Same results. High R-hat values, no mixing. 

Adding hyper-hyper priors onto the hyper priors does not change the situation either. 

The odd thing is, if I remove the priors on the scale, all chains mix beautifully and all R-hat values are <1.0001 and zero divergences. Why would putting a prior on the scale suddenly change this? Could there be identifiability issues in the posterior? The prior on the scale of the coefficients should be “weakly informative”, according to posts I’ve read on this forum. The density of the converged beta0_raw’s has most of their mass around -6 to -8. This is probably due to the reference category always having a rather large count. I don’t see why a student-t with 3 df and a rather large hyper prior on the scale should not be able to explore this area.

Could anyone offer some insight into what might be wrong? 

Cheers,
David

Bob Carpenter

unread,
Mar 1, 2016, 6:43:56 PM3/1/16
to stan-...@googlegroups.com
I missed that this hadn't gotten a response, so I'm trying
to catch up. And I'm actually quite interested in the ICD
coding problem (particular inter-rater agreement and gold
standard inference --- there's an example in the manual for
the Dawid and Skene model).

I couldn't follow all the indexing here:


for (d in 1:D) {
// Scale coefficients
beta_0[d, 1] <- beta_0_raw[d, 1] * hyper_icd_code[ICD_A[d]];
beta_0[d, 2] <- beta_0_raw[d, 2] * hyper_icd_code[ICD_B[d]];

for (k in 3:5) {
beta_0[d, k] <- beta_0_raw[d, k] * hyper_icd_combination[d];
}
beta_0[d, 6] <- 0;
}

so I suspect that's where your issues are at. In particular, I
don't understand why ICD_A[d] is being used as an index for
scale for the (d,1) and (d,2) entries of beta_0.

You don't need theta_hat, you can just do

for (n in 1:N)
Count[n] ~ multinomial(softmax(beta_0[Di[n]]'));

where ' is for transpose; that's because beta_0[Di[n]]
is the Di[n]-th row of beta_0, so it's a row vector.

If the data's separable, you need priors that will control
the growth of the parameters away from +/- infinity.

The main reason putting a prior on something can cause
non-convergence is that it's not consistent with the data.
Again, check out the separability of the data.

- Bob

David Westergaard

unread,
Mar 2, 2016, 3:55:00 AM3/2/16
to stan-...@googlegroups.com

> On 02 Mar 2016, at 00:42, Bob Carpenter <ca...@alias-i.com> wrote:
>
> I missed that this hadn't gotten a response, so I'm trying
> to catch up. And I'm actually quite interested in the ICD
> coding problem (particular inter-rater agreement and gold
> standard inference --- there's an example in the manual for
> the Dawid and Skene model).

Right — I’ll have a look at this example. I might be able to extract something, even though we are not looking at the coding problem per say. I’ll also be sure to send a copy of the article to the Stan list when it gets published. ;-)

>
> I couldn't follow all the indexing here:
>
>
> for (d in 1:D) {
> // Scale coefficients
> beta_0[d, 1] <- beta_0_raw[d, 1] * hyper_icd_code[ICD_A[d]];
> beta_0[d, 2] <- beta_0_raw[d, 2] * hyper_icd_code[ICD_B[d]];
>
> for (k in 3:5) {
> beta_0[d, k] <- beta_0_raw[d, k] * hyper_icd_combination[d];
> }
> beta_0[d, 6] <- 0;
> }
>
> so I suspect that's where your issues are at. In particular, I
> don't understand why ICD_A[d] is being used as an index for
> scale for the (d,1) and (d,2) entries of beta_0.

Yea, I guess this one requires a bit of an explanation. What we are looking at is co-occurences of diagnosis. So imagine three outcome categories: 0, A, B, A+B. So category 1 corresponds to neither A nor B being reported, category 2 to only A observed, and so on. We are looking at all possible combinations over a large number of diagnosis, and the idea is that the base rate of having diagnosis A is approximately the same over all combinations. This may not be true for some combinations, but for the vast majority it is indeed true.

>
> You don't need theta_hat, you can just do
>
> for (n in 1:N)
> Count[n] ~ multinomial(softmax(beta_0[Di[n]]'));
>
> where ' is for transpose; that's because beta_0[Di[n]]
> is the Di[n]-th row of beta_0, so it's a row vector.
>
> If the data's separable, you need priors that will control
> the growth of the parameters away from +/- infinity.
>
> The main reason putting a prior on something can cause
> non-convergence is that it's not consistent with the data.
> Again, check out the separability of the data.

I found this out the hard way. The problem is not so much the separability of the data, but the volume of data. Having a disease is a rare event, having a combination of two diseases is an even rarer event. Having coefficient priors centered around zero does not really accommodate this, since most coefficients seem to be in the range -12 to -6.

Having said that, it’s running, albeit a bit slow. I’ll have to wait for it to finish in a week or so to check convergence and tree depth. A few questions on speed,

1. Which is faster? R-like indexing (i.e. beta_0[, 6] <- 0;) or normal indexing (beta_0[d, 6] <- 0;)?

2. if A is a matrix[I, J] and A_Raw is a matrix[I, J], is there a faster way to scale A than,

for (i in 1:I) {
A[i] <- A_Raw[i] * sd + center
}

3. If B is a matrix[I, J] and B_Raw is a matrix[I, J], is there a faster way to scale B with a hyperprior C, which is a vector[J], than:

for (i in 1:I) {
B[i] <- B_Raw[i] .* C
}


4. If I add another covariate, say beta_age, does this following line of code actually do what I want for multinomial-softmax regression?

theta_hat <- softmax(to_vector(beta_0[Di[n]] + beta_age[Di[n]] * AgeZ[n]))

and can it be re-written to,

theta_hat <- softmax(beta_0[Di[n]]' + beta_age[Di[n]] * AgeZ[n]’)

(I know the theta_hat variable is not needed, just keeping it for clarity :-))


Thanks,
David

PS. multinomial_logit would be awesome ;-)

Bob Carpenter

unread,
Mar 3, 2016, 7:50:02 PM3/3/16
to stan-...@googlegroups.com
How many iterations are you running? I'd suggest running fewer
to see how it's going.

> A few questions on speed,
>
> 1. Which is faster? R-like indexing (i.e. beta_0[, 6] <- 0;) or normal indexing (beta_0[d, 6] <- 0;)?

The first one won't work because you're trying to assign a scalar to an array.

>
> 2. if A is a matrix[I, J] and A_Raw is a matrix[I, J], is there a faster way to scale A than,
>
> for (i in 1:I) {
> A[i] <- A_Raw[i] * sd + center
> }

Yes, matrix operations:

A <- A_Raw * sd + center;

> 3. If B is a matrix[I, J] and B_Raw is a matrix[I, J], is there a faster way to scale B with a hyperprior C, which is a vector[J], than:
>
> for (i in 1:I) {
> B[i] <- B_Raw[i] .* C
> }

Not really, but also that's not going to be particularly efficient
because matrices are organized by column and B[i] is pulling out
a row. I don't know what you do elsewhere, but you might want
to use an array of vectors for B. The loop's the same, but it's
more efficient.

> 4. If I add another covariate, say beta_age, does this following line of code actually do what I want for multinomial-softmax regression?
>
> theta_hat <- softmax(to_vector(beta_0[Di[n]] + beta_age[Di[n]] * AgeZ[n]))
>
> and can it be re-written to,
>
> theta_hat <- softmax(beta_0[Di[n]]' + beta_age[Di[n]] * AgeZ[n]’)

What are the types of the variables? The to_vector operation applied
to a row vector does the same thing as transpose. So you'd probably
just want

(beta_0[Di[n]] + beta_age[Di[n]] * AgeZ[n])'

to avoid transposing twice. But even better, make beta_0 an array
of vectors and you won't need any copies.

> (I know the theta_hat variable is not needed, just keeping it for clarity :-))

That's fine. It's better to do that than pack a huge expression in
as an argument. It can cause an extra copy, but that's no big deal
given everything else going on on the math side.

> Thanks,
> David
>
> PS. multinomial_logit would be awesome ;-)

Agreed. We just need someone to write it.

- Bob

David Westergaard

unread,
Mar 23, 2016, 6:12:57 PM3/23/16
to stan-...@googlegroups.com
I've made some progress on the model, but now I've run into another model, convergence. Particularly, even for a very small data set the model requires a treedepth>=13 to not be saturated. 
I've downscaled my problem to only include 10 combinations of diseases, and thereby 200 data points (2 genders * 10 age groups). The events I am trying to model are rather rare, and quite a few of the groups have a zero count. I've added a pseudo-count of 1 to all group, which seems to have helped convergence a lot (before that I couldn't even get it to converge..).

Since these are rare events, I used the advice from one of the Gelman articles and put student_t(3, 0,10) in the intercept, student_3(3, 0, 5) on the male/female priors, and student_3(3, 0, 2.5) on the age/age^2 priors. Inspecting the results from a run indicates that the intercept has a range of -15 to 0, the male/female coefficient is between -10 to 5, age between -5 to 5, and age^2 between -4 to 6.

The trace of log-posterior looks a bit odd, almost discrete. I am guessing this is due to the large amount of data, though. All the other parameters mix really well, all R-hat values < 1.01, the Neff is as high as it can get, and the Monte Carlo SE is ~0.03 for all parameters. The step sizes for the four chain were 0.00044, 0.00042, 0.00049, and 0.00047. The elements of the diagonal matrix are,

Chain 1: 14.043, 12.9052, 12.6201, 15.2883, 10.0836, 12.2389, 13.1901, 9.86744, 14.9441, 11.394, 12.4317, 13.917, 12.2393, 14.5692, 10.5179, 11.5033, 9.76957, 12.3624, 12.5164, 13.7865, 11.6927, 10.7039, 10.0851, 15.55, 13.2367, 11.4659, 13.269, 14.1504, 14.6697, 13.2124, 15.4585, 15.867, 12.0979, 12.6642, 13.7026, 11.044, 15.9293, 16.3373, 13.9431, 13.8957, 15.6302, 17.7608, 12.4378, 15.1601, 10.4172, 12.3482, 15.4909, 14.1474, 11.2272, 15.4267, 14.059, 12.8937, 12.6428, 15.2869, 10.0817, 12.207, 13.177, 9.85136, 14.9569, 11.389, 12.4311, 13.912, 12.2376, 14.5378, 10.522, 11.5037, 9.77343, 12.3466, 12.5015, 13.7852, 11.6593, 10.7428, 10.1734, 15.6586, 13.2999, 11.2834, 13.4555, 13.9869, 14.6193, 13.2874, 15.432, 16.0997, 11.9823, 12.7166, 13.495, 10.9287, 15.643, 16.4322, 13.7044, 13.9563, 15.5406, 17.6067, 12.5299, 15.1678, 10.4524, 12.3442, 15.5319, 14.1552, 11.3709, 15.2928, 14.0759, 12.8734, 12.6454, 15.2762, 10.0792, 12.2066, 13.1888, 9.85841, 14.9579, 11.379, 12.44, 13.9121, 12.237, 14.5394, 10.5253, 11.5035, 9.77561, 12.3469, 12.4988, 13.7812, 11.4006, 10.6821, 10.2495, 15.7533, 13.2385, 11.3722, 13.3422, 14.0025, 14.586, 13.2241, 15.5479, 16.0409, 11.951, 12.6098, 13.7558, 10.8319, 15.7109, 16.405, 13.6344, 13.8813, 15.5486, 17.7878, 12.4304, 15.0649, 10.3472, 12.3262, 15.5715, 14.2462, 11.2461, 15.243, 0.00630113, 0.168866, 0.0277536, 0.00769466, 0.000554683, 0.0165438, 0.03188, 0.00206091, 0.000314294, 0.00823307, 0.00106522, 0.000583316, 0.000260593, 0.0128949, 0.00697205, 0.000336129, 0.00123224, 0.00217515, 0.00677697, 0.00024906, 0.246635, 0.625805, 0.203487, 0.681022, 0.326638, 0.253975, 0.486852, 0.243647, 0.208211, 0.153335, 0.377292, 0.525492, 0.165444, 0.563578, 0.296085, 0.208064, 0.483836, 0.191759, 0.193616, 0.18571, 0.632175, 0.505585, 0.300669, 0.52072, 0.661498, 1.00162, 0.700852, 0.507173, 0.615634, 0.346109, 0.0039918, 0.0773755, 0.0120854, 0.00445177, 0.000444362, 0.00687884, 0.0159115, 0.00131414, 0.000185726, 0.00509776, 0.000639602, 0.000310973, 0.000144766, 0.00590522, 0.00446192, 0.000216318, 0.000766526, 0.000833341, 0.0041347, 0.000172936, 0.183275, 0.459222, 0.118468, 0.580052, 0.266814, 0.145126, 0.396592, 0.126621, 0.122745, 0.0948417, 0.256231, 0.415748, 0.0848545, 0.429427, 0.243844, 0.116547, 0.428908, 0.0839072, 0.110291, 0.122927, 0.563359, 0.445666, 0.214986, 0.439856, 0.64846, 0.964654, 0.613833, 0.418005, 0.54097, 0.242135

Chain 2: 12.6517, 13.1982, 15.2656, 20.9405, 8.73418, 17.482, 16.5468, 11.7412, 9.5733, 14.1408, 11.1094, 12.3986, 13.7714, 9.54041, 13.0299, 16.0783, 15.1513, 9.81916, 8.80175, 10.8002, 12.7513, 14.1615, 11.5979, 12.6386, 13.4697, 19.5252, 12.3008, 14.9694, 15.7475, 18.9507, 12.265, 11.7888, 12.1943, 10.8437, 13.2154, 10.0719, 14.6342, 12.8728, 13.3879, 13.8314, 15.8223, 16.6057, 16.263, 13.6203, 12.4016, 16.1087, 12.155, 15.1825, 11.4547, 16.9843, 12.6454, 13.0691, 15.3317, 20.9983, 8.73331, 17.4265, 16.5529, 11.7559, 9.57266, 14.1711, 11.1193, 12.3998, 13.7665, 9.54608, 13.0833, 16.0789, 15.1623, 9.83349, 8.8448, 10.7998, 13.0554, 14.0544, 11.6285, 12.4366, 13.5089, 19.7318, 12.33, 15.0357, 15.8051, 19.1019, 12.0503, 11.9396, 12.0963, 10.8348, 13.2689, 10.0881, 14.8307, 12.8654, 13.2574, 13.7798, 15.7874, 16.6641, 16.2162, 13.5524, 12.4889, 16.1128, 12.276, 14.8675, 11.4081, 16.8622, 12.6329, 13.1104, 15.3413, 20.9762, 8.72999, 17.4245, 16.556, 11.7568, 9.569, 14.1391, 11.114, 12.4015, 13.7693, 9.53651, 13.0602, 16.0794, 15.1518, 9.83596, 8.84041, 10.7995, 13.0017, 14.1547, 11.6693, 12.5618, 13.4853, 19.6915, 12.2769, 15.0388, 15.8927, 19.133, 12.193, 11.6323, 12.1016, 10.8778, 13.1324, 10.1105, 14.5407, 12.8486, 13.2688, 13.8299, 15.72, 16.8677, 16.3508, 13.5722, 12.4325, 16.0945, 12.0867, 15.0638, 11.5947, 16.9571, 0.00612744, 0.158244, 0.0256539, 0.00695229, 0.000741133, 0.0196976, 0.0238623, 0.00231263, 0.000371107, 0.00633083, 0.000742321, 0.000465582, 0.000274826, 0.0164803, 0.00579741, 0.000295521, 0.00135117, 0.00213075, 0.00653021, 0.000194419, 0.416122, 0.489884, 0.310308, 0.726444, 0.351643, 0.253591, 0.420063, 0.173823, 0.184423, 0.165494, 0.273572, 0.496527, 0.145313, 0.512722, 0.31031, 0.221953, 0.568573, 0.187525, 0.187225, 0.168169, 0.609625, 0.717075, 0.287121, 0.740436, 0.628548, 0.805814, 0.888566, 0.416725, 0.667942, 0.314826, 0.00369409, 0.074658, 0.0108575, 0.00387918, 0.000601269, 0.00817617, 0.0123586, 0.0015731, 0.000210346, 0.00404489, 0.000444415, 0.000248328, 0.000154655, 0.00754708, 0.00378347, 0.000201781, 0.000817857, 0.000810737, 0.00396325, 0.000131968, 0.313518, 0.361978, 0.175671, 0.626368, 0.293768, 0.143101, 0.333206, 0.0892489, 0.106372, 0.100504, 0.192008, 0.389598, 0.0742991, 0.404543, 0.250599, 0.128054, 0.506297, 0.0882338, 0.109104, 0.112346, 0.544274, 0.62666, 0.196595, 0.649313, 0.597799, 0.725706, 0.788202, 0.321053, 0.549322, 0.224486

Chain 3: 13.5858, 16.4967, 9.62417, 12.6452, 11.2762, 12.4675, 10.7077, 10.2308, 8.44112, 11.635, 8.1961, 13.5138, 10.9681, 8.36842, 12.2783, 10.8616, 12.6745, 8.45959, 12.0043, 10.1173, 14.8228, 11.43, 10.7602, 10.9639, 11.3344, 12.3334, 10.52, 18.1337, 11.6404, 11.176, 12.318, 14.802, 12.4747, 14.5665, 16.841, 15.0325, 12.8604, 13.4395, 12.7977, 12.0013, 13.2944, 12.9255, 16.5516, 17.9323, 14.037, 11.3665, 11.9487, 14.4922, 18.0052, 9.29764, 13.584, 16.2652, 9.62638, 12.6789, 11.2786, 12.4955, 10.7472, 10.2323, 8.43955, 11.6204, 8.19819, 13.5014, 10.9604, 8.34732, 12.2703, 10.859, 12.6617, 8.4509, 12.0006, 10.1132, 14.6719, 11.4477, 10.7859, 11.0018, 11.2352, 12.2959, 10.5911, 17.9625, 11.5444, 11.2262, 12.3286, 14.8234, 12.5637, 14.5765, 16.9907, 14.759, 12.7046, 13.4062, 12.5785, 11.9255, 13.2798, 12.9289, 16.6792, 18.0736, 14.123, 11.4587, 11.9927, 14.3993, 17.9988, 9.37532, 13.5994, 16.2589, 9.6206, 12.6757, 11.2872, 12.5049, 10.734, 10.2362, 8.44015, 11.6364, 8.19836, 13.5027, 10.9646, 8.35344, 12.2837, 10.8576, 12.6633, 8.4521, 11.9886, 10.1123, 14.7945, 11.4844, 10.7079, 11.1838, 11.3044, 12.3058, 10.5185, 17.9163, 11.5391, 11.1558, 12.2922, 14.9144, 12.5042, 14.6884, 16.9788, 14.7791, 12.8587, 13.3222, 12.6964, 11.9718, 13.3227, 12.9526, 16.6625, 17.7786, 13.8858, 11.292, 11.8326, 14.5355, 17.7381, 9.23665, 0.00592309, 0.183133, 0.029214, 0.00776546, 0.000527819, 0.0171359, 0.0350736, 0.00218855, 0.000385955, 0.00666091, 0.00106682, 0.000581924, 0.000350509, 0.0135115, 0.00753286, 0.000256792, 0.0011891, 0.00265174, 0.00621715, 0.000224548, 0.338416, 0.379644, 0.229462, 0.714838, 0.302272, 0.285944, 0.435157, 0.182661, 0.178482, 0.16883, 0.24413, 0.539745, 0.17103, 0.452812, 0.31316, 0.233799, 0.541428, 0.170556, 0.157633, 0.173223, 0.429549, 0.570643, 0.312018, 0.783902, 0.797453, 0.776367, 0.625457, 0.609488, 0.565653, 0.316253, 0.00366809, 0.0852558, 0.0130769, 0.00430994, 0.000431811, 0.00712439, 0.018174, 0.0014633, 0.000217246, 0.00417147, 0.000622008, 0.000309303, 0.000202462, 0.00610843, 0.00484937, 0.000168908, 0.000737878, 0.000998908, 0.00363807, 0.000157435, 0.238181, 0.283965, 0.131504, 0.621425, 0.258211, 0.167534, 0.352247, 0.0948844, 0.104927, 0.100312, 0.168589, 0.41855, 0.0867583, 0.35403, 0.255756, 0.132605, 0.493144, 0.076803, 0.0890774, 0.117263, 0.379172, 0.521678, 0.216933, 0.73017, 0.785032, 0.740184, 0.560742, 0.470628, 0.501667, 0.220406

Chain 4: 13.1763, 16.5098, 15.4034, 13.8657, 10.6507, 10.3166, 16.4773, 11.8088, 8.36818, 14.497, 11.1554, 10.4438, 10.9992, 14.3087, 9.05866, 11.1774, 12.533, 7.96157, 11.209, 9.71656, 16.1628, 9.25901, 13.4676, 15.2443, 16.8357, 15.445, 10.9829, 14.0603, 14.2375, 12.4452, 20.7462, 12.2698, 11.2416, 14.1829, 11.918, 15.7241, 13.8112, 14.9496, 12.9549, 13.6769, 16.8764, 14.456, 14.6275, 21.3555, 12.4405, 8.97836, 14.2752, 9.37907, 11.6585, 13.8758, 13.1459, 16.5488, 15.3587, 13.8773, 10.6541, 10.2438, 16.4726, 11.8037, 8.3731, 14.5257, 11.1541, 10.4405, 10.9955, 14.2982, 9.0645, 11.1809, 12.5254, 7.96125, 11.2134, 9.71418, 16.3146, 9.39194, 13.4011, 15.1885, 16.7852, 15.1981, 10.9059, 13.96, 14.2362, 12.3599, 20.5686, 12.2921, 11.0917, 14.1926, 11.8003, 15.6312, 13.8358, 14.9293, 12.9193, 13.8031, 16.9673, 14.5985, 14.5285, 21.2587, 12.6549, 9.1595, 14.2539, 9.30112, 11.4856, 13.8597, 13.1554, 16.5678, 15.3697, 13.8745, 10.6478, 10.2507, 16.4803, 11.8078, 8.37247, 14.5025, 11.1571, 10.4392, 10.9965, 14.3135, 9.06643, 11.1785, 12.5258, 7.95977, 11.2207, 9.71714, 16.28, 9.06573, 13.5115, 15.3766, 16.9871, 15.209, 10.8473, 14.0154, 14.2428, 12.4182, 20.6872, 12.2378, 11.1113, 13.9848, 11.7774, 15.6907, 13.7205, 14.9596, 12.8978, 13.7909, 17.0551, 14.591, 14.8288, 21.2051, 12.3777, 9.16848, 14.1289, 9.17777, 11.5811, 13.7875, 0.00536931, 0.150293, 0.0197464, 0.00756281, 0.000424949, 0.0236527, 0.026742, 0.00181321, 0.000306034, 0.00507678, 0.000750228, 0.000512726, 0.000327057, 0.0142834, 0.0048135, 0.000249385, 0.00106866, 0.00189048, 0.00673375, 0.000223599, 0.381365, 0.522543, 0.231816, 0.66972, 0.317955, 0.260824, 0.389522, 0.208647, 0.211149, 0.134335, 0.285031, 0.541061, 0.200892, 0.512403, 0.357446, 0.25518, 0.45006, 0.195701, 0.163371, 0.176426, 0.522214, 0.714288, 0.286548, 0.447662, 0.665323, 0.728012, 0.711062, 0.537686, 0.563243, 0.245985, 0.00327885, 0.0683658, 0.0087004, 0.00409966, 0.000375249, 0.0100826, 0.014027, 0.00126191, 0.000186632, 0.00329028, 0.000450937, 0.000267839, 0.000192614, 0.00655278, 0.00315431, 0.000158082, 0.00063755, 0.000727211, 0.00402587, 0.00016389, 0.271243, 0.392769, 0.133933, 0.589111, 0.276024, 0.151302, 0.304693, 0.106973, 0.119053, 0.0835544, 0.197475, 0.420097, 0.105766, 0.405149, 0.265535, 0.146508, 0.401087, 0.0914857, 0.0928488, 0.114282, 0.456261, 0.612131, 0.202884, 0.391701, 0.642547, 0.686604, 0.643917, 0.428764, 0.476866, 0.167072

I’m not sure how much info this adds or how to interpret them, though. I’ve attached some diagnostic plots. 

What I am mostly interested in trouble shooting is why the model requires such a large tree depth. I guess it could be because the coefficients are mostly negative, and I am only using weakly informed priors. I am not sure how to use more informed priors, though. I tried using a normal distribution instead of the cauchy, but the lack of heavy tails gave convergence problems. A student_t with 1 degrees of freedom did not solve the problem, either. Re-parametizing the student_t as a normal + gamma mixture had no effect either. I have also tried initialising the parameters with the maximum likelihood estimate, but that just increased the tree depth even further. Adding a prior to the variance for each coefficient did not seem to solve it either. 

Any suggestions on how to improve this this model?

Cheers,
David


multinomial_model.stan
mixed_traces.png
log_posterior.png
stan_diag.png
treedepth.png
pairs.png

Andrew Gelman

unread,
Mar 23, 2016, 7:07:13 PM3/23/16
to stan-...@googlegroups.com
A quick look and I see you have intercept, coef for male, and coef for female.  That’s nonidentified.  Better to have intercept and then a predictor which equals -1 for male and +1 for female.  Then you just have 2 coefs and all should go better.
A

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<multinomial_model.stan>

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<log_posterior.png>

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<stan_diag.png>

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<treedepth.png>

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<pairs.png>

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<mixed_traces.png>
-- 
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/y6PPWzu7xfw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David Westergaard

unread,
Mar 23, 2016, 7:09:25 PM3/23/16
to stan-...@googlegroups.com
Interesting. I thought using the two coeffs was the same as using a nested indexing approach, such as in BUGS.

I’ll give it a try, thanks!

David

Bob Carpenter

unread,
Mar 23, 2016, 7:17:57 PM3/23/16
to stan-...@googlegroups.com
For any constant C,

(male, female, intercept)

has the same likelihood as

(male + C, female + C, intercept - C).

A prior is only going to weakly identify this,
especially if it's t low dof.

- Bob

David Westergaard

unread,
Mar 24, 2016, 2:32:13 PM3/24/16
to stan-...@googlegroups.com
That really helped a lot! The model is running a lot faster now, though it’s still a bit saturated during the warmup-phase. When I scale up a bit, to 1000 combinations (20000 datapoints), the model seems to slow down a lot, and hit max tree depth in nearly every tree depth in the warm up phase. I tried tightening the priors a bit, but it did not really help (see multinomial_nopriors.stan). After 8 hours and 300 iterations I gave up (In an odd turn of events, the Variational Bayes algorithm was able to converge on a solution in <5minutes, impressive!) and tried to add some layers of priors (see multinomial_allpriors.stan). The warm up phase seems to be going a bit faster now, but still quite slow, and not nearly fast enough for the full data set of ~ 8 million data points.

So, I’m wondering if there is anything I could improve on in the model, like the prior structure, loops, or anything that makes computation heavy?

Thanks in advance,
David




multinomial_nopriors.stan
multinomial_allpriors.stan

David Westergaard

unread,
Mar 29, 2016, 7:58:02 AM3/29/16
to stan-...@googlegroups.com
OK, a bit more of experimenting, and I noticed that there is (apparently) a lot of correlation between some variables. I removed the Age^2 term since it was fully correlated with the Age term.

In the simple model, with no shared parameters, the pairs plot show that there is a near perfect correlation between the coefficients for intercept (beta_0) and age (beta_A) (fig1.png). Age is a continous variable that is taken as the mean value of the persons in the particular group, scaled to have mean=0 and sd=1. I tried to switch to another configuration, where I had a age parameter for each age interval. This had no real effect, and lead to a lot of iterations hitting tree depth. After reading some users posts on the forum, I tried switching to a formulation where I had a multi_normal prior on the coefficients, but this did not really help (fig3.png). I’ve also tried adding priors to the coefficient scales, but this did not really help, and the Neff for a lot of the scale priors were quite low (0.3-0.4). Switching to a non-centered parameterisation did not help either.

On a small subset of the data the models are converging very fast (within 50 warmup iterations), but on larger sets it hits tree depth a lot during warmup, and is taking very small step sizes. If I discard age and sex covariates, 5000 combinations takes about 1 hour to run. However, if I include covariates, 100 combinations takes overnight to finish (6-8 hours).

So, to sum up the questions,

1. Does this correlation between input variables affect the sampler? Is there an identifiability issue? And if so, how can I improve?
2. Is there a clever way of including the Age^2 term and breaking the correlation to age?

If it helps, I can provide a synthetic data set. Also, the models are right now written for clarity and not computational efficiency. :-)

Best,
David





normal.stan
multi_normal.stan
fig1.png
fig2.png
fig3.png

Krzysztof Sakrejda

unread,
Mar 29, 2016, 11:21:49 AM3/29/16
to Stan users mailing list


On Tuesday, March 29, 2016 at 7:58:02 AM UTC-4, David Westergaard wrote:
OK, a bit more of experimenting, and I noticed that there is (apparently) a lot of correlation between some variables. I removed the Age^2 term since it was fully correlated with the Age term.

In the simple model, with no shared parameters, the pairs plot show that there is a near perfect correlation between the coefficients for intercept (beta_0) and age (beta_A) (fig1.png). Age is a continous variable that is taken as the mean value of the persons in the particular group, scaled to have mean=0 and sd=1. I tried to switch to another configuration, where I had a age parameter for each age interval. This had no real effect, and lead to a lot of iterations hitting tree depth. After reading some users posts on the forum, I tried switching to a formulation where I had a multi_normal prior on the coefficients, but this did not really help (fig3.png). I’ve also tried adding priors to the coefficient scales, but this did not really help, and the Neff for a lot of the scale priors were quite low (0.3-0.4). Switching to a non-centered parameterisation did not help either.

On a small subset of the data the models are converging very fast (within 50 warmup iterations), but on larger sets it hits tree depth a lot during warmup, and is taking very small step sizes.

Is this true of all small subsets or only a select subset or only some small subsets?  Breaking up the data into a bunch of small subsets and running them all should identify a problematic subset.  I imagine there's a batch of data that conflicts with the priors (or with all the rest of the data).  If the the munging isn't too heavy it would be worth it to go this way rather than try to find a solution that works with conflicting/difficult data.  

...Also, what does a N(-6,3) prior look like when you combine it with the rest, plug it through a softmax and the multinomial? I have some intuition about that but I would simulate from that mess before I spent a lot of time on it.  Is category 1 the non-disease/most-common category?

Krzysztof

David Westergaard

unread,
Mar 29, 2016, 11:38:52 AM3/29/16
to stan-...@googlegroups.com
On 29 Mar 2016, at 17:21, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:



On Tuesday, March 29, 2016 at 7:58:02 AM UTC-4, David Westergaard wrote:
OK, a bit more of experimenting, and I noticed that there is (apparently) a lot of correlation between some variables. I removed the Age^2 term since it was fully correlated with the Age term.

In the simple model, with no shared parameters, the pairs plot show that there is a near perfect correlation between the coefficients for intercept (beta_0) and age (beta_A) (fig1.png). Age is a continous variable that is taken as the mean value of the persons in the particular group, scaled to have mean=0 and sd=1. I tried to switch to another configuration, where I had a age parameter for each age interval. This had no real effect, and lead to a lot of iterations hitting tree depth. After reading some users posts on the forum, I tried switching to a formulation where I had a multi_normal prior on the coefficients, but this did not really help (fig3.png). I’ve also tried adding priors to the coefficient scales, but this did not really help, and the Neff for a lot of the scale priors were quite low (0.3-0.4). Switching to a non-centered parameterisation did not help either.

On a small subset of the data the models are converging very fast (within 50 warmup iterations), but on larger sets it hits tree depth a lot during warmup, and is taking very small step sizes.

Is this true of all small subsets or only a select subset or only some small subsets?  Breaking up the data into a bunch of small subsets and running them all should identify a problematic subset.  I imagine there's a batch of data that conflicts with the priors (or with all the rest of the data).  If the the munging isn't too heavy it would be worth it to go this way rather than try to find a solution that works with conflicting/difficult data.  

So far I’ve tried with ~15 different subsets, each consisting of 10 combination (Di = 10). I had some problems prior to adding a pseudo count of +1 to all observations, since there was quite a few groups with 0 individuals. I’m not really sure how I would identify the batch of data conflicting with priors, since all traces and pairs plots look OK. Some of the parameters takes the value of the prior due to a low amount of data, but that should be OK. 


...Also, what does a N(-6,3) prior look like when you combine it with the rest, plug it through a softmax and the multinomial? I have some intuition about that but I would simulate from that mess before I spent a lot of time on it.  

I set the center at -6 since this correspond to a probability of 0.25% of having any disease (irrespective of gender and age), which seems to fit the data when I just calculate #people with disease A / #people without disease A for all diseases. 

Some simple simulation in R,

softmax <- function(row) {
    # Softmax row to be between 0 and 1. 
    exp_row <- exp(row)
    return( exp_row / sum(exp_row) )
}
P <- softmax(c(0, rnorm(1, -6, 3), rnorm(1, -6, 3), rnorm(1, -11, 3), rnorm(1, -11, 3), rnorm(1, -11, 3)))
rmultinom(1, size=5.5*10^6, P)

Indicates that the low probability could be a problem if there was a low number of data points, but the number of individuals (~5.5 millions) is quite large, so I’m not sure if there is a problem. 


Is category 1 the non-disease/most-common category?

Yes. I’m working on data from a full population, so there’s about ~5 million people (divided over the different groups) in category 1 all the time, depending on how rare a condition is. 

Could the large number of individuals in the reference group be a problem? And if so, is there a way to control for this?


David



 
If I discard age and sex covariates, 5000 combinations takes about 1 hour to run. However, if I include covariates, 100 combinations takes overnight to finish (6-8 hours).

So, to sum up the questions,

1. Does this correlation between input variables affect the sampler? Is there an identifiability issue? And if so, how can I improve?
2. Is there a clever way of including the Age^2 term and breaking the correlation to age?

If it helps, I can provide a synthetic data set. Also, the models are right now written for clarity and not computational efficiency. :-)

Best,
David



 


Krzysztof Sakrejda

unread,
Mar 29, 2016, 12:02:04 PM3/29/16
to Stan users mailing list


On Tuesday, March 29, 2016 at 11:38:52 AM UTC-4, David Westergaard wrote:

On 29 Mar 2016, at 17:21, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:



On Tuesday, March 29, 2016 at 7:58:02 AM UTC-4, David Westergaard wrote:
OK, a bit more of experimenting, and I noticed that there is (apparently) a lot of correlation between some variables. I removed the Age^2 term since it was fully correlated with the Age term.

In the simple model, with no shared parameters, the pairs plot show that there is a near perfect correlation between the coefficients for intercept (beta_0) and age (beta_A) (fig1.png). Age is a continous variable that is taken as the mean value of the persons in the particular group, scaled to have mean=0 and sd=1. I tried to switch to another configuration, where I had a age parameter for each age interval. This had no real effect, and lead to a lot of iterations hitting tree depth. After reading some users posts on the forum, I tried switching to a formulation where I had a multi_normal prior on the coefficients, but this did not really help (fig3.png). I’ve also tried adding priors to the coefficient scales, but this did not really help, and the Neff for a lot of the scale priors were quite low (0.3-0.4). Switching to a non-centered parameterisation did not help either.

On a small subset of the data the models are converging very fast (within 50 warmup iterations), but on larger sets it hits tree depth a lot during warmup, and is taking very small step sizes.

Is this true of all small subsets or only a select subset or only some small subsets?  Breaking up the data into a bunch of small subsets and running them all should identify a problematic subset.  I imagine there's a batch of data that conflicts with the priors (or with all the rest of the data).  If the the munging isn't too heavy it would be worth it to go this way rather than try to find a solution that works with conflicting/difficult data.  

So far I’ve tried with ~15 different subsets, each consisting of 10 combination (Di = 10). I had some problems prior to adding a pseudo count of +1 to all observations, since there was quite a few groups with 0 individuals. I’m not really sure how I would identify the batch of data conflicting with priors, since all traces and pairs plots look OK. Some of the parameters takes the value of the prior due to a low amount of data, but that should be OK. 


...Also, what does a N(-6,3) prior look like when you combine it with the rest, plug it through a softmax and the multinomial? I have some intuition about that but I would simulate from that mess before I spent a lot of time on it.  

I set the center at -6 since this correspond to a probability of 0.25% of having any disease (irrespective of gender and age), which seems to fit the data when I just calculate #people with disease A / #people without disease A for all diseases. 

Some simple simulation in R,

softmax <- function(row) {
    # Softmax row to be between 0 and 1. 
    exp_row <- exp(row)
    return( exp_row / sum(exp_row) )
}
P <- softmax(c(0, rnorm(1, -6, 3), rnorm(1, -6, 3), rnorm(1, -11, 3), rnorm(1, -11, 3), rnorm(1, -11, 3)))
rmultinom(1, size=5.5*10^6, P)

Indicates that the low probability could be a problem if there was a low number of data points, but the number of individuals (~5.5 millions) is quite large, so I’m not sure if there is a problem. 
 
The counts don't matter as much as the geometry of the parameter space.

f <- function(N) softmax(cbind(0, rnorm(N, -6, 3), rnorm(N, -6, 3), rnorm(N, -11, 3), rnorm(N, -11, 3), rnorm(N, -11, 3)))

tp <- f(1000)

colnames(tp) <- letters[1:6]

tpm <- reshape2::melt(tp)

ggplot(data=tpm, aes(x=value)) + geom_histogram() + facet_wrap( ~ Var2) + scale_y_log10()


So the result here is that this prior is really squished up against zero with the softmax.  It's very possible that with a prior like this you need a large step size to explore the lower end of the posterior (when probabilities are VERY small) and a tiny step size to explore the very upper end (when probabilities are closer to 2%).  By itself that could result in the sampler choosing a smallish step size to deal with the upper tail of the posterior and then skating along on the lower tail until it hits the max treedepth.  Have you looked at what sort of step size the tunning produces when you have problematic runs?  With CmdStan you can just open the .csv files to look at this so maybe for diagnosing this I would run rstan with the option to produce .csv files and check out what happends for runs that seem to fail or take too long. 

To fix the problem you would want to fix the prior so that the sampler doesn't have to go so far along a flat surface towards zero.  Naively (on my end, I just have experience from log/logit transforms) the standard deviation of three is too big for the low end of the softmax transform at a mean of -6.

I feel like a broken record on this particular issue with priors b/c it usually comes up with logistic regressions so it's really common but this is a good softmax example of it.
 


Is category 1 the non-disease/most-common category?

Yes. I’m working on data from a full population, so there’s about ~5 million people (divided over the different groups) in category 1 all the time, depending on how rare a condition is. 

Could the large number of individuals in the reference group be a problem? And if so, is there a way to control for this?


David



 
If I discard age and sex covariates, 5000 combinations takes about 1 hour to run. However, if I include covariates, 100 combinations takes overnight to finish (6-8 hours).

So, to sum up the questions,

1. Does this correlation between input variables affect the sampler? Is there an identifiability issue? And if so, how can I improve?
2. Is there a clever way of including the Age^2 term and breaking the correlation to age?

If it helps, I can provide a synthetic data set. Also, the models are right now written for clarity and not computational efficiency. :-)

Best,
David



 


--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/y6PPWzu7xfw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

David Westergaard

unread,
Mar 29, 2016, 12:51:45 PM3/29/16
to stan-...@googlegroups.com
I’m always running cmdstan. Haven’t gotten around to changing my compiler options in R to use ICC. :-)

Anywho, that's an interesting take on it. I see what you mean now by the probabilities squeezing up against zero.

Step sizes with the normal(-6,3) and normal(-11, 3) priors were in the range 0.001 - 0.0001. The major problem is also that the tree depth is capping, but I guess this is due to the step size. 


To fix the problem you would want to fix the prior so that the sampler doesn't have to go so far along a flat surface towards zero.  Naively (on my end, I just have experience from log/logit transforms) the standard deviation of three is too big for the low end of the softmax transform at a mean of -6.

Right. I guess my approach at just entering the base probability at the point estimate was too naive. Now the question is, how do I incorporate that I actually know this information? From the point estimate I know that,

Class 2 and 3 should have an intercept in the range -11 to -1
Class 4, 5, and 6 should have an intercept in the range -12 to -4. 

My first intuition would be to use student_t with a low degree of freedom, e,g 3., to support this wide range. For instance,

student_t(3, 6, 1) has 99% of its mass in the -10 to -1 range 
student_t(3, 8, 1) has 99% of its mass in the -12 to -3 range

If I plug this in to the code you wrote above, I end at the same result: the mass is concentrated at zero. I’m not really sure which other priors to use, in this case. I guess a cauchy(0, 1) would give the same result. Any suggestions on other priors, or how to trouble shot this further?

Thanks,
David  




I feel like a broken record on this particular issue with priors b/c it usually comes up with logistic regressions so it's really common but this is a good softmax example of it.
 


Is category 1 the non-disease/most-common category?

Yes. I’m working on data from a full population, so there’s about ~5 million people (divided over the different groups) in category 1 all the time, depending on how rare a condition is. 

Could the large number of individuals in the reference group be a problem? And if so, is there a way to control for this?


David



 
If I discard age and sex covariates, 5000 combinations takes about 1 hour to run. However, if I include covariates, 100 combinations takes overnight to finish (6-8 hours). 

So, to sum up the questions, 

1. Does this correlation between input variables affect the sampler? Is there an identifiability issue? And if so, how can I improve? 
2. Is there a clever way of including the Age^2 term and breaking the correlation to age? 

If it helps, I can provide a synthetic data set. Also, the models are right now written for clarity and not computational efficiency. :-) 

Best, 
David 



  


-- 
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/y6PPWzu7xfw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/y6PPWzu7xfw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Krzysztof Sakrejda

unread,
Mar 29, 2016, 1:17:36 PM3/29/16
to Stan users mailing list
Yes, those go hand in hand and are also connected to very slow models (you can increase the tree depth but then the model just takes a lot of steps to explore the flat part near zero).
 


To fix the problem you would want to fix the prior so that the sampler doesn't have to go so far along a flat surface towards zero.  Naively (on my end, I just have experience from log/logit transforms) the standard deviation of three is too big for the low end of the softmax transform at a mean of -6.

Right. I guess my approach at just entering the base probability at the point estimate was too naive. Now the question is, how do I incorporate that I actually know this information? From the point estimate I know that,

Class 2 and 3 should have an intercept in the range -11 to -1
Class 4, 5, and 6 should have an intercept in the range -12 to -4. 

My first intuition would be to use student_t with a low degree of freedom, e,g 3., to support this wide range. For instance,

student_t(3, 6, 1) has 99% of its mass in the -10 to -1 range 
student_t(3, 8, 1) has 99% of its mass in the -12 to -3 range

I guess the source of the issue is the different curvature between the lower and upper end of this range regardless of which prior you use.  There are alternatives to softmax out on google/arxiv the look like they have less curvature in this region and would allow the sampler to choose a larger step size but I don't know know enough to make a recommendation. (Bob? Michael?) 

Bob Carpenter

unread,
Mar 29, 2016, 6:33:10 PM3/29/16
to Stan users mailing list
If variables are getting pushed up against a lower bound of
zero, it may just be where the probability mass is. In such cases,
it might make sense to manually replace them with zero (and if they
are hierarchical scale terms, replace their density with some kind of complete
pooling).

I have no experience with alternatives to softmax / multinomial logistic.

Are you running into the same issue with simulated data? These
problems can arise in well-formulated and well-coded models if
the model is highly misspecified for the data.

- Bob
>>> To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.
>>> To post to this group, send email to stan-...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>> You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
>> To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/y6PPWzu7xfw/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.
>> To post to this group, send email to stan-...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

Krzysztof Sakrejda

unread,
Mar 29, 2016, 9:37:34 PM3/29/16
to Stan users mailing list
Bob, sure it's where the mass is but if the curvature changes there it still makes for crappy sampling! K

David Westergaard

unread,
Mar 30, 2016, 5:34:36 AM3/30/16
to stan-...@googlegroups.com
Some questions inline

On 30 Mar 2016, at 00:32, Bob Carpenter <ca...@alias-i.com> wrote:

If variables are getting pushed up against a lower bound of
zero, it may just be where the probability mass is.  In such cases,
it might make sense to manually replace them with zero

You mean, if the probability for some class if unreasonably low, say 10^-10th (or in that range), just set it to 0?

(and if they
are hierarchical scale terms, replace their density with some kind of complete
pooling).

I am not quite sure I follow this one. Could you elaborate a bit? Do you mean, instead of having a per group scale prior,have a global prior that is shared between all groups? 


I have no experience with alternatives to softmax / multinomial logistic.

Are you running into the same issue with simulated data?  These
problems can arise in well-formulated and well-coded models if
the model is highly misspecified for the data.

Simulated data behaves the same. I still find it odd, though, that I can easily fit 5000 combinations in ~1hour if I only have an intercept, but it slows down so badly if I add more covariates. From what I’ve seen so far, the covariates does not really have a large effect (somewhere in the -2 to 2 range for age, and -0.5 to 0.5 for sex).

From these discussions, I guess there are only two reasonable approaches:

1. Get the baseline risk closer to 0
2. Split up the joint model hierarchical model into an individual regression per group

(1) I could solve by sampling matched controls from the population, instead of just taking all individuals. 

(2) I am not sure how good idea this is. One (amongst many) of the arguments for using Bayesian is the joint posterior which safeguards against false positives and the pitfalls of multiple testing. If I ran, say, 300.000 individual regression models, I would also run into a multiple testing problem, wouldn’t I?

Maybe a better approach to (2) would be to fixate one disease. Say I am investigating diseases A, B, C, and D. The joint hierarchical model would include the pairs (A,B), (A, C), (A, D), (B, C), (B, D), (C, D). Splitting this up into individual models would give 6 individual regressions. But what if I fixate one disease, say A, and fit a model that includes pairs (A, B), (A, C), (A, D), and another that includes pairs (B, C), (B, D), and a third that includes (C, D). Any thoughts on this? I am not really sure it would fix the problem with softmax going towards zero, but at least the model would be done before I finish my PhD. :-)

And thanks a lot for the answers, much appreciated!

David

David Westergaard

unread,
Mar 30, 2016, 10:19:39 AM3/30/16
to stan-...@googlegroups.com
On another note, I think you are absolutely right about the problem being that many of the classes are bounding up against zero. I tried another two other kinds of softmax, called spherical and taylor softmax (http://arxiv.org/pdf/1511.05042v3.pdf), and now the chains are moving a lot faster, at a much lower tree depth, and converging. The downside is that I don’t think this kind of softmax makes much sense in a probabilistic setting, and it looses all interpretability, unless I can somehow calculate back, but I don’t think that’s possible. :-)

David

Krzysztof Sakrejda

unread,
Mar 30, 2016, 1:07:24 PM3/30/16
to Stan users mailing list
On Wednesday, March 30, 2016 at 10:19:39 AM UTC-4, David Westergaard wrote:
On another note, I think you are absolutely right about the problem being that many of the classes are bounding up against zero. I tried another two other kinds of softmax, called spherical and taylor softmax (http://arxiv.org/pdf/1511.05042v3.pdf), and now the chains are moving a lot faster, at a much lower tree depth, and converging. The downside is that I don’t think this kind of softmax makes much sense in a probabilistic setting, and it looses all interpretability, unless I can somehow calculate back, but I don’t think that’s possible. :-)

That's really cool! I love it when mathematical intuition actually gets you somewhere useful!  Thanks so much for checking it out and reporting back. What kind of interpretability are you thinking of?  The spherical/taylor softmax are pretty close mathematically to the usual one so I think there should be a way to get comparable outputs...

p.s.-I can put the alternative softmaxs on my longer-term list for trying to get this into the Stan math and language libraries but I will need to come up with a simple problem that demonstrates the problem/solution so not anytime soon (it's a long list anyway...).


 

David

Some questions inline

To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/y6PPWzu7xfw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/y6PPWzu7xfw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

Bob Carpenter

unread,
Mar 31, 2016, 6:08:20 PM3/31/16
to stan-...@googlegroups.com
Sorry for the confusion. If they're probability variables, then
you probably want some kind of prior that keeps the intervals from
collapsing. We usually run into this problem with hierarchical priors
going to zero, which is where you can replace with complete pooling.

And thanks for the arXiv pointer.

- Bob
Reply all
Reply to author
Forward
0 new messages