Extremely slow categorical logit model

ricka...@gmail.com

unread,

Jan 8, 2016, 4:03:31 PM1/8/16

to stan-...@googlegroups.com

Hello,

I've created a categorical logit model that, like the title says, is extremely slow to finish sampling; my other models take no longer than an hour, and this one takes roughly about 24 hours to converge.

In each trial I have 6 subjects: 3 of subject type I and 3 of subject type II. One of those six is indicated by the response variable and the different groups should have a different intercept, as the task is harder for type II than it is type I. Each subject participates as both types and there are ~400 subjects and ~120,000 trials. I also wanted to use zero-centering on the individual subject effects.

Here's the model code:

data {
    int<lower = 0> n_subjects;
    int<lower = 0> n_trials;
    int<lower = 0, upper = n_subjects - 1> subject_type1[n_trials, 3];
    int<lower = 0, upper = n_subjects - 1> subject_type2[n_trials, 3];
    int<lower = 1, upper = 6> trials[n_trials];
}
parameters {
    vector[n_subjects] raw_subject_type1_effects;
    vector[n_subjects] raw_subject_type2_effects;

    real subject_type1_intercept;
    real subject_type2_intercept;
    real<lower = 0> sigma_subject_type1;
    real<lower = 0> sigma_subject_type2;
}
transformed parameters {
    vector[n_subjects] subject_type1_effects;
    vector[n_subjects] subject_type2_effects;

    subject_type1_effects <- raw_subject_type1_effects - mean(raw_subject_type1_effects);
    subject_type2_effects <- raw_subject_type2_effects - mean(raw_subject_type2_effects);
}
model {
    vector[6] alpha[n_trials];

    subject_type1_intercept ~ normal(0, 10);
    subject_type2_intercept ~ normal(0, 10);
    sigma_subject_type1 ~ cauchy(0, 2.5);
    sigma_subject_type2 ~ cauchy(0, 2.5);

    raw_subject_type1_effects ~ normal(0, sigma_subject_type1);
    raw_subject_type2_effects ~ normal(0, sigma_subject_type2);

    for (i in 1:n_trials) {
        for (j in 1:3) {
            alpha[i][j] <- subject_type1_intercept + subject_type1_effects[subject_type1[i][j] + 1];
        }
        for (j in 1:3) {
            alpha[i][j + 3] <- subject_type2_intercept + subject_type2_effects[subject_type2[i][j] + 1];
        }

        trials[i] ~ categorical_logit(alpha[i]);
    }
}

My first question is: is it so slow because categorical_logit cannot be vectorized? I originally tried doing this, but the Stan parser told me that categorical_logit could not take a vector of vectors and thus I couldn't do what I did in different models with bernoulli_logit.

Secondly, whether or not that's the case, is there any way to speed this up? I'm a novice to using Stan, so it's very possible there's some glaring inefficiency that I just don't see.

Any tips conceptually and regarding efficiency would be much appreciated.

Thanks,
Rick

Bob Carpenter

unread,

Jan 8, 2016, 4:43:14 PM1/8/16

to stan-...@googlegroups.com

You should switch to a non-centered parameterization.
That's also about centering the standard deviation on
the log scale, which is the real killer. So this:

> raw_subject_type1_effects ~ normal(0, sigma_subject_type1);
> raw_subject_type2_effects ~ normal(0, sigma_subject_type2);

should be ~ normal(0, 1) and then multiply the sigma_xxx back
in on the outside.

Also, this

> subject_type2_effects <- raw_subject_type2_effects - mean(raw_subject_type2_effects);

is not what you want to do. You already gave them a zero-mean
prior --- that's enough. What you're doing will hurt HMC because
it collapses multiple parameters and leads to non-identifiability.

The other problem with identifiability is the K vs. (K-1) linear
predictor parameterizations of a K-way categorical outcome. Without strong
priors, the categorical logit parameters aren't well identified.
You can move to a (K-1)-predictor parameterization rather
than a K-predictor parameterization --- we still haven't gotten
around to building that in, but there's a discussion in the manual
of how to do it --- see the note on identifiability in the
categorical_logit section.

- Bob

> On Jan 8, 2016, at 4:03 PM, ricka...@gmail.com wrote:
>
> Hello,
>
> I've created a categorical logit model that, like the title says, is extremely slow to finish sampling; my other models take no longer than an hour, and this one takes roughly about 24 hours to converge.
>

> In each trial I have 6 subjects: 3 of subject type I and 3 of subject type II. One of those six is indicated by the response variable and the different groups should have a different intercept, as the task is harder for type II than it is type I. Each subject participates as both types and there are ~400 subjects and ~120,000 trials. I also wanted to use zero-centering on the individual level effects.

> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

ricka...@gmail.com

unread,

Jan 10, 2016, 1:12:00 AM1/10/16

to stan-...@googlegroups.com

Hey Bob,

Thanks for the tips - I have a few questions:

Can you expound a little bit upon why centering the standard deviation on the log scale is so inefficient? I'm wondering because I've done this on other models and didn't realize it was so problematic.

When you suggest that the subject effects should be ~ normal(0, 1) and multiply the sigma on the outside - is this what I've seen referred to as the Matt trick? Would this look like:

sigma_subject_type1 ~ cauchy(0, 2.5);
raw_subject_type1_effects ~ normal(0, 1);

And in transformed parameters:

subject_type1_effects <- raw_subject_type1_effects * sigma_subject_type1;

Is that the correct application of the Matt trick?

Also, can you explain why K zero-centered predictors approach has trouble with identifiability as opposed to K - 1? I thought that I had resolved the identifiability problems by zero-centering them. The Kruschke chapter on softmax regression seems to offer both of those as solutions.

Thanks,

Rick

Bob Carpenter

unread,

Jan 10, 2016, 1:26:15 AM1/10/16

to stan-...@googlegroups.com

I can do better than expounding:

http://arxiv.org/abs/1312.0906

Yes, we used to call it the "the Matt trick" until we realized
it already had a name. Matt independently discovered the
benefit of non-centered parameterizations. Matt was also playing
with reparameterizations based on inverse CDFs, but we never
implemented any of those.

And yes, that's the correct application.

The manual discusses the problem with K vs. K-1, but the
basic idea is that softmax is invariant under adding a constant
to all the values.

- Bob

> sigma_subject_type2 ~ cauchy(0, 2.5);

>
> raw_subject_type1_effects ~ normal(0, sigma_subject_type1);
> raw_subject_type2_effects ~ normal(0, sigma_subject_type2);
>
> for (i in 1:n_trials) {
> for (j in 1:3) {
> alpha[i][j] <- subject_type1_intercept + subject_type1_effects[subject_type1[i][j] + 1];
> }
> for (j in 1:3) {
> alpha[i][j + 3] <- subject_type2_intercept + subject_type2_effects[subject_type2[i][j] + 1];
> }
>
> trials[i] ~ categorical_logit(alpha[i]);
> }
> }
>
> My first question is: is it so slow because categorical_logit cannot be vectorized? I originally tried doing this, but the Stan parser told me that categorical_logit could not take a vector of vectors and thus I couldn't do what I did in different models with bernoulli_logit.
>
> Secondly, whether or not that's the case, is there any way to speed this up? I'm a novice to using Stan, so it's very possible there's some glaring inefficiency that I just don't see.
>
> Any tips conceptually and regarding efficiency would be much appreciated.
>
> Thanks,
> Rick
>

ricka...@gmail.com

unread,

Jan 10, 2016, 1:42:56 AM1/10/16

to stan-...@googlegroups.com

Thanks a ton, that clears up quite a bit for me.

What I meant about the K vs. K - 1 identifiability: I knew about the issue with K predictors being invariant upon adding a constant, but I thought that was resolved by zero-centering the predictors. Is that not the case? Or did you mean that because you originally suggested I should switch to a non-centered parameterization then I would thus need to go to K - 1 for identifiability? Sorry for the confusion here, I just want to make sure I understand.

One last question - when I was trying a centered parameterization, why is it that having the prior be ~ normal(0, ...) is enough and I don't need to subtract the mean? I've seen examples where this is done: i.e. http://marktheballot.blogspot.com/2015/07/is-stan-man-or-should-i-junk-jags.html (EDIT: originally chose a bad example). Did I misunderstand something or was it unnecessary on his part to subtract the mean?

-Rick

Bob Carpenter

unread,

Jan 10, 2016, 2:55:17 AM1/10/16

to stan-...@googlegroups.com

What sort of works for Gibbs won't work for Stan. You can't
sample from a non-identified posterior. Some people (like
Andrew in some of his books) have tried to use Gibbs to sample
from an improper prior and then try to normalize, like zero-centering
the predictors.

You can weakly identify the model by using a prior that controls
the coefficients going into the predictors, but it tends to be
better behaved with the K-1 parameterization.

You don't ever want to be doing the normalization after the fact.
If you sample from a prior with mean of zero, you don't need to then
further subtract the mean to get it exact. Same principle as
above --- you can't fix an improper posterior, but it's not even
improper in this case.

Zero-centering is usually just the default for regression. That
often presupposes that the predictors are themselves standardized.

- Bob

> On Jan 10, 2016, at 1:42 AM, ricka...@gmail.com wrote:
>
> Thanks a ton, that clears up quite a bit for me.
>
> What I meant about the K vs. K - 1 identifiability: I knew about the issue with K predictors being invariant upon adding a constant, but I thought that was resolved by zero-centering the predictors. Is that not the case? Or did you mean that because you originally suggested I should switch to a non-centered parameterization then I would thus need to go to K - 1 for identifiability? Sorry for the confusion here, I just want to make sure I understand.
>

> One last question - when I was trying a centered parameterization, why is it that having the prior be ~ normal(0, ...) is enough and I don't need to subtract the mean? I've seen examples from Kruschke where he does this: i.e. https://github.com/boboppie/kruschke-doing_bayesian_data_analysis/blob/master/2e/Jags-Ymet-Xnom1fac-MnormalHom.R). Did I misunderstand something or was it unnecessary on his part to subtract the mean?
>
> -Rick
>

Andrew Gelman

unread,

Jan 10, 2016, 4:03:58 PM1/10/16

to stan-...@googlegroups.com

What Bob said. Sampling from an improper posterrior and post-processing can work in Gibbs but it won’t work for autotuned HMC as we have in Stan.

Reply all

Reply to author

Forward