Help speeding up Stan code?

1,415 views
Skip to first unread message

Avi Feller

unread,
Jul 18, 2013, 8:30:09 AM7/18/13
to stan-...@googlegroups.com
Dear Wonderful Stan Mailing List,

As part of a larger project, I've been working on a causal inference problem known as "model-based Instrumental Variables". I had pulled together a decent MCMC algorithm by hand, but Andy convinced me that I should try it out with Stan. I'm happy to report that I've already seen a dramatic speed-up, even though the model isn't a natural candidate for Stan---so I'm very pleased!

I'm writing because I need to greatly expand the current model (e.g., increase the number of mixture components; add in hierarchical effects; etc.). But, before I do, I want to make sure that I have the stripped-down model running as efficiently as possible. While I've done my best, I'm pretty new to Stan and generally lousy at coding. This is especially true since the model is a combination of mixture modeling and hierarchical regression. And, of course, there's the distinct possibility that I'm also making an inferential mistake...

I'm particularly interested in ways of speeding up the current "transformed parameters" section, which seems to be the slowest part. In a usual MCMC, this part is executed via data augmentation, which is obviously impossible in Stan. 

Attached are (1) R script that generates fake data (based largely on an article by Imbens & Rubin, 1997) and runs the model; and (2) the stripped-down Stan model.

I really do appreciate the help from this exceptional group.

Many thanks,
Avi

stan_script_for_list.R
stan_mixture_iv_no_re.stan

Marcus Brubaker

unread,
Jul 18, 2013, 3:39:07 PM7/18/13
to stan-...@googlegroups.com
Hi Avi,

I took a quick look at your model.  The only thing I could see to speed it up is that your priors could be vectorized at least partially.  In 1.3 vectorization over matrices was not supported, but you could certainly remove the inner loop which would speed things up a bit.

Otherwise it looks to be pretty efficient as far as I can see.

Cheers,
Marcus



--
You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Ben Goodrich

unread,
Jul 18, 2013, 6:19:39 PM7/18/13
to stan-...@googlegroups.com
On Thursday, July 18, 2013 8:30:09 AM UTC-4, Avi Feller wrote:
I'm particularly interested in ways of speeding up the current "transformed parameters" section, which seems to be the slowest part. In a usual MCMC, this part is executed via data augmentation, which is obviously impossible in Stan. 

This isn't the slowest part

transformed parameters{
  simplex
[K] theta[N];
 
// set up probs
 
// used 'softmax' previously, but this seemed more flexible
 
for(n in 1:N){
   
for(k in 1:(K-1)){
      theta
[n,k] <- exp(gamma[k] * x[n]);
   
}
    theta
[n,K] <- 1;
    theta
[n] <- theta[n] / sum(theta[n]);
 
}
}

but it should be a little faster if you use the softmax() function. What you have here relies on the auto-differentiation mechanism to do the derivatives, while most named functions in Stan have analytic derivatives. The auto-differentiation requires memory allocations and so forth.

Also, I don't think it is out of the question to do data augmentation in Stan, although I haven't seen anyone do it yet and the performance may not be great. But you could just declare those as parameters, although you would have to write a bunch of code to restrict their support and adjust lp__ due to the change-of-variables.

Ben

Bob Carpenter

unread,
Jul 18, 2013, 7:51:52 PM7/18/13
to stan-...@googlegroups.com


On 7/18/13 6:19 PM, Ben Goodrich wrote:
> On Thursday, July 18, 2013 8:30:09 AM UTC-4, Avi Feller wrote:
>
> I'm particularly interested in ways of speeding up the current "transformed parameters" section, which seems to be
> the slowest part. In a usual MCMC, this part is executed via data augmentation, which is obviously impossible in Stan.

You can tell this will be the slow part because it's N * K
exp() evals, plus N size-K sums.

...

Your code looks pretty clean (if oddly R-like in spacing
to my eyes; pardon my reversion to C++ style).

Using a sum() of an explicitly calculated vector should work.

You can make this faster:

for(k in 1:(K-2))
for(d in 1:D)
gamma[k,d] ~ cauchy(0, 2.5);

by vectorizing. Same for the beta distributed normal.

for (k in 1:(K-2))
gamma[k] ~ cauchy(0, 2.5);

And this is most efficient as an array of vectors (whhat
you have for beta, but not gamma). We'll eventually vectorize
this all more efficiently so you can just write gamma ~ cauchy(0,2.5)
and then the most efficient version will be a matrix. (Sorry
it's a moving target for efficiency issues.)

> it should be a little faster if you use the softmax() function. What you have here relies on the
> auto-differentiation mechanism to do the derivatives, while most named functions in Stan have analytic derivatives. The
> auto-differentiation requires memory allocations and so forth.

I'm afraid softmax() isn't one of them, so you're unlikely
to see any speedup at all.

But it is built in, so it should be a little less error prone
than coding your own softmax in the model.

We should add faster auto-diff for softmax --- I was too lazy to
compute and code the derivatives when I coded it initially.

And you're actually taking the log of a softmax, which would
be another op that itself could be sped up and made more arithmetically
stable. Do you really need the normalization supplied by softmax?
We only need probabilities up to a proportion, and it looks like you
get the same denominator in each term in the if either way.

For simplicity, you could pull this term

if (cat[n] == 1)
log_p_vec[n] <- log(theta[n,1])
+ normal_log(y[n], beta[1] * x[n], sigma[1]);

Note that you can put the plus sign on the next line, unlike in
R, which conforms to standard math typesetting practice, which
is aimed at making the formulas easier to scan.

...

log_p_vec[n] <- log(theta[n,cat[n]])
+ normal_log(y[n], beta[cat[n]] * x[n], sigma[cat[n]]);

if (cat[n] == 3)
log_p_vec[n] <- log_sum_exp(log_p_vec[n],
... other big term ...);

else if (cat[n] == 4)
log_p_vec[n] <- log_sum_exp(log_p_vec[n],
... other term ...);

And given the form, you could combine the cat[n] == 3 and cat[n] == 4
cases. But this might not generalize to your richer model.

- Bob

Avi Feller

unread,
Jul 19, 2013, 8:53:22 AM7/19/13
to stan-...@googlegroups.com
Hi all,

Thanks so much for the thoughtful responses. I've cleaned up my code and have found a decent speed increase---so much appreciated.

On the various recommendations:
  • Vectorizing priors. Absolutely. This was a silly oversight on my part and was easy to fix.
  • Combining categories. This was an incredibly helpful suggestion and dramatically streamlined my code. Much appreciated.
  • Softmax. Curious that there's no speed increase from using this (I'll be on the lookout for when this makes it in a subsequent version). As a practical matter, I want the flexibility to add several terms together in each element in the vector, and this proved clunky via the current parameterization of the softmax function. Of course, it's possible that I'm just implementing it wrong...
  • Not normalizing probability vectors. I absolutely agree that, in theory, we only need to know probabilities up to a normalizing constant. But when I dropped the normalizing step, I kept getting non-sensical results. Normalizing the vector seems to only lead to a slight speed decrease (and the estimates themselves are substantively useful), so it's not a huge concern---though I'm wondering what exactly is going wrong.
Finally, I'm interested in what Ben has in mind on data augmentation. I know I've talked this through a bit with some of the Stan folks and we never really arrived at a sensible solution. 

Thanks again for the help---and for the reassurance that my code isn't totally embarrassing (if a bit R-like; guilty as charged on that one).

Best,
Avi

Bob Carpenter

unread,
Jul 19, 2013, 1:12:53 PM7/19/13
to stan-...@googlegroups.com


On 7/19/13 8:53 AM, Avi Feller wrote:
> Hi all,
>
> Thanks so much for the thoughtful responses. I've cleaned up my code and have found a decent speed increase---so much
> appreciated.
>
> On the various recommendations:
>
> * *Vectorizing priors.* Absolutely. This was a silly oversight on my part and was easy to fix.
>
> * *Combining categories.* This was an incredibly helpful suggestion and dramatically streamlined my code. Much
> appreciated.
>
> * *Softmax.* Curious that there's no speed increase from using this (I'll be on the lookout for when this makes it in
> a subsequent version). As a practical matter, I want the flexibility to add several terms together in each element
> in the vector, and this proved clunky via the current parameterization of the softmax function. Of course, it's
> possible that I'm just implementing it wrong...

There shouldn't be any improvement in softmax with the
way it's currently implemented. When we unfold the
derivatives for softmax, we'll get a speed improvement.

I'll see if anyone has any time to do it for the next
release, which is slated to be Stan 2.0 and hopefully
not too many weeks in the future.

> * *Not normalizing probability vectors.* I absolutely agree that, in theory, we only need to know probabilities up to
> a normalizing constant. But when I dropped the normalizing step, I kept getting non-sensical results. Normalizing
> the vector seems to only lead to a slight speed decrease (and the estimates themselves are substantively useful), so
> it's not a huge concern---though I'm wondering what exactly is going wrong.

The problem is that sum(exp(theta)) varies as theta varies.
I was pretty tired last night when I was looking at it. It's
the same denominator in each term, but it's not constant overall!

> Finally, I'm interested in what Ben has in mind on data augmentation. I know I've talked this through a bit with some of
> the Stan folks and we never really arrived at a sensible solution.
>
> Thanks again for the help---and for the reassurance that my code isn't totally embarrassing (if a bit R-like; guilty as
> charged on that one).

I just meant the syntactic conventions, like "for(". In R,
everyone writes everything like it's a function, even return.

- Bob

Leo Bastos

unread,
Apr 3, 2014, 7:27:07 AM4/3/14
to stan-...@googlegroups.com
Hi,

I know this is an old thread, but I am facing a problem where using a data augmentation could in theory improve my MCMC. Because in my model (depending on the prior) I would have very nice full conditional distributions which I could sample directly from, whereas without the data augmentation step I must sample from proposals instead.

Anyway, my code seems to work nicely in stan without the data augmentation step, but I am puzzled with the sentece in the email "this part is executed via data augmentation, which is obviously impossible in Stan.". I wonder why it is obviously? Could someone give me a hint why is that? I tried to do the data augmentation in stan and I failed, mainly because I would have an integer parameter and that was the problem. Is that the reason? Or is there something else that explains why data augmentation cannot be done in stan.

Cheers,
Leo Bastos

Bob Carpenter

unread,
Apr 3, 2014, 7:35:12 AM4/3/14
to stan-...@googlegroups.com
I don't see the context of that quote on data augmentation
being impossible in Stan. You can certainly add parameters
in Stan, but as you note, not if they're integers (though often
you can marginalize out the integer parts).

- Bob
> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages