Stan sampling speed with large dataset

1,253 views
Skip to first unread message

Tom Wallis

unread,
Jan 8, 2013, 11:47:39 AM1/8/13
to stan-...@googlegroups.com
I am attempting to model binary responses of 3 human subjects under a number of conditions using a GLM, with additional upper (lambda) and lower (gamma) asymptotes. There is quite a lot of data (16740 responses) and I would like to test for interactions amongst a number of factors, meaning that the linear model design matrix is quite large (37 columns). This model takes longer than I expected to sample -- still running in parallel on a 2.3 GHz quad core i7 drawing 2000 iterations from each of 4 chains, after 8 or 9 hours. The model file and data are attached.

In general, how will stan's sampling speed scale with the size of the dataset? Are datasets / models this big going to simply require a lot of time to sample? Should I try to reduce the number of coefficients I'm trying to estimate (columns in design matrix) to make this more feasible? Sorry for these newbie questions; I am not a computer scientist.

I've done my best to vectorise the actual model where possible, but if anyone has further suggestions please let me know.

Thanks as always for your support.

Tom
model.txt
test_data.txt

Ben Goodrich

unread,
Jan 8, 2013, 4:07:17 PM1/8/13
to stan-...@googlegroups.com
The wall time is going to increase with the number of observations but it isn't the dominant factor. Reducing the number of parameters would help with speed, but then you might be estimating a model that you don't like.

One thing you can do as of v1.1.0 is replace

  for (s in 1:S) {
    row
(e_beta,s) ~ normal(0,1);
 
}

with just

e_beta ~ normal(0,1);

I think we found before with your model that it was somewhat faster to use vectors of arrays rather than a matrix because the leading index of the former can be accessed more efficiently. Is there some way to sum over the levels of ss and use the binomial density rather than a bernoulli (see section 13.3 of the manual; it depends on whether X varies by individual or only by level of ss)?

Ben

Bob Carpenter

unread,
Jan 8, 2013, 4:23:05 PM1/8/13
to stan-...@googlegroups.com
Are you running optimized code (O=3)? Are you running
from R or from the command line, and if so, how is the
memory on the machine doing overall? It can be a problem
in general if you get close to swapping and a big problem
if you start swapping (memory into and out of files).

I'm afraid the simple answer is yes, bigger models
take longer to sample. The time that it takes to
calculate a log probability and gradient
for a model with a fixed set of parameters and priors
will be proportional to the data size plus a constant
for the work done for the priors. Now if the data's
much larger than the set of parameters, the data size
should dominate the calculations.

The issue is with the posterior geometry. In some
cases, more data can make the problem faster by more
tightly constraining the posterior. In other problems,
the hierarchical structure can present problems for the
sampler getting stuck in regions of low hierarchical
variance (see the discussion of Neal's funnel in the manual).

The code's pretty tight as you wrote it. The only suggestions
I have for improving its speed are to:

1. Further vectorize this operation:

for (d in 1:D){
for (s in 1:S){
beta[s,d] <- mu[d] + e_beta[s,d]*sigma[d]; # "matt trick" beta.
}
}

to

for (d in 1:D)
beta[s] <- mu + e_beta[s] .* sigma;

The ".*" is componentwise multiplication as in MATLAB.
This may not even be a noticeable speed improvement, but
it simplifies the model.

2. Reparameterize your gamma prior:

sigma ~ gamma(1,0.5);

which is equivalently

sigma ~ exponential(0.5);

in Stan's notation. This won't save you much time if any
because the gamma implementation's clever enough not to
compute the Gamma function for fixed alpha. This won't
help much, but then you can go to this:

parameters {
real<lower=exp(0),upper=exp(20)>[D] sigma_raw;
...
transformed parameters {
real<lower=0,upper=20>[D] sigma;
for (d in 1:D) sigma[d] <- log(sigma_raw[d]);
...
model {
sigma_raw ~ uniform(exp(0),exp(20));
...

This probably also won't make much of a difference because sigma
isn't a very long vector. It might help with the geometry, though.

3. Convert x and beta to types:

vector[D] x[N];
vector[D] beta[S];

That is make them arrays of vectors. The reason is that it
doesn't require any copying or allocation in the loops to pull
out a vector member of an array. Calling row(x,n) has to
do two bad things: access the values of x in non-local order [because
the matrices are stored column major], and create an intermediate
row vector data structure to hold the results.

4. Switch to a logistic regression from a probit. The
cumulative density function Phi() is a lot less efficient
than the inverse logit.


5. Aggregate the data in y and replace the Bernoulli
with a binomial. There's an example in the manual section
of optimization on this. It looks possible because it looks like
x has many repeated rows, and we know ss[n] only takes on
three values for the three subjects.

- Bob
> --
>
>

Ben Goodrich

unread,
Jan 8, 2013, 4:38:21 PM1/8/13
to stan-...@googlegroups.com
On Tuesday, January 8, 2013 4:23:05 PM UTC-5, Bob Carpenter wrote:
4.  Switch to a logistic regression from a probit.  The
cumulative density function Phi() is a lot less efficient
than the inverse logit.

There is a paper here

http://www.jiem.org/index.php/jiem/article/download/60/27

that approximates the normal CDF with, in Stan notation,

inv_logit(0.07056 * pow(z, 3) + 1.5976 * z)

where z is the linear predictor in your case. That would be essentially as fast as logit but still essentially a probit model.

Ben

 

Ben Goodrich

unread,
Jan 8, 2013, 4:43:29 PM1/8/13
to stan-...@googlegroups.com
On Tuesday, January 8, 2013 4:07:17 PM UTC-5, Ben Goodrich wrote:
One thing you can do as of v1.1.0 is replace

  for (s in 1:S) {
    row
(e_beta,s) ~ normal(0,1);
 
}

with just

e_beta ~ normal(0,1);

Actually, this doesn't parse in Stan v1.1.0. Never mind.

Ben

Tom Wallis

unread,
Jan 8, 2013, 5:19:28 PM1/8/13
to stan-...@googlegroups.com
Thanks for your suggestions Bob and Ben. 

I'm running with O=3 optimisation through R. Memory didn't seem to be an issue.

I've implemented all except for Bob's second suggestion regarding reparameterising the gamma (model and data files attached)  This compiles, and hopefully runs much faster (will report back). Thanks for your help! The collapsing into a binomial is a good suggestion and works here, but for some other datasets I have where there are many variable combinations it won't work so well.

Are there plans in the new version to vectorise the binomial function?

Thanks again!


On Tuesday, January 8, 2013 4:23:05 PM UTC-5, Bob Carpenter wrote:
model.txt
test_data.txt

Ben Goodrich

unread,
Jan 8, 2013, 5:29:05 PM1/8/13
to stan-...@googlegroups.com
On Tuesday, January 8, 2013 5:19:28 PM UTC-5, Tom Wallis wrote:
Are there plans in the new version to vectorise the binomial function?

The binomial density accepts vector arguments now.

Ben

Bob Carpenter

unread,
Jan 8, 2013, 5:30:32 PM1/8/13
to stan-...@googlegroups.com

On 1/8/13 5:19 PM, Tom Wallis wrote:
> Thanks for your suggestions Bob and Ben.
>
> I'm running with O=3 optimisation through R. Memory didn't seem to be an issue.
>
> I've implemented all except for Bob's second suggestion regarding reparameterising the gamma (model and data files
> attached)

Let us know if it works better.

Given that it's a regression, I should've also mentioned
initialization. I don't know how long it's taking your
models to converge, but it can be problematic if you start
with our default over-dispersed initializations. You might
want to try your own inits, or at least try init=0 to use
all-zero inits, which are usually more stable for regression
problems.

Also, those upper-and-lower-bounded variables will all start
with values around the middle of their ranges.

> This compiles, and hopefully runs much faster (will report back). Thanks for your help! The collapsing into a
> binomial is a good suggestion and works here, but for some other datasets I have where there are many variable
> combinations it won't work so well.
>
> Are there plans in the new version to vectorise the binomial function?

Yes, we'll have fully vectorized binomial and binomial_logit
in the next release. I've been considering adding
a binomial_probit and bernoulli_probit, too.

That should make things much faster.

- Bob

Tom Wallis

unread,
Jan 8, 2013, 5:51:14 PM1/8/13
to stan-...@googlegroups.com
Hi Ben,

Instead of

for (n in 1:N) {
    #eta[n] <- gamma + (1 - gamma - lambda[ss[n]]) * inv_logit( dot_product(x[n],beta[ss[n]]) );
    y[n] ~ binomial(n_trials[n], (gamma + (1 - gamma - lambda[ss[n]]) * inv_logit( dot_product(x[n],beta[ss[n]]) )) ); 
  }

I tried writing the last lines of the model as

for (n in 1:N) {
    eta[n] <- gamma + (1 - gamma - lambda[ss[n]]) * inv_logit( dot_product(x[n],beta[ss[n]]) );
  }

y ~ binomial(n_trials,eta); 

But got an error about "couldn't find the function binomial_log". Are you sure it's vectorised, or did I make a mistake there?

Ben Goodrich

unread,
Jan 8, 2013, 6:10:09 PM1/8/13
to stan-...@googlegroups.com
On Tuesday, January 8, 2013 5:51:14 PM UTC-5, Tom Wallis wrote:
Hi Ben,

Instead of

for (n in 1:N) {
    #eta[n] <- gamma + (1 - gamma - lambda[ss[n]]) * inv_logit( dot_product(x[n],beta[ss[n]]) );
    y[n] ~ binomial(n_trials[n], (gamma + (1 - gamma - lambda[ss[n]]) * inv_logit( dot_product(x[n],beta[ss[n]]) )) ); 
  }

I tried writing the last lines of the model as

for (n in 1:N) {
    eta[n] <- gamma + (1 - gamma - lambda[ss[n]]) * inv_logit( dot_product(x[n],beta[ss[n]]) );
  }

y ~ binomial(n_trials,eta); 

But got an error about "couldn't find the function binomial_log". Are you sure it's vectorised, or did I make a mistake there?

The C++ implementation of the binomial density in Stan looks like it should accept vector arguments, but maybe that hadn't been exposed to the parser as of v1.1.0.

But if I understand your model correctly, you shouldn't have to loop over N once you redefine y as the number of successes and use binomial(). You would just loop over 3 subjects or maybe 3 by something experimental conditions or whatever is the unit of analysis that experiences multiple independent trials with the same probability.

Ben
 

Tom Wallis

unread,
Jan 8, 2013, 6:48:26 PM1/8/13
to stan-...@googlegroups.com
Ah, right you are. I could loop over the three subjects.

That's running much faster. Did 10,000 iterations of 4 chains in around 45 minutes. Thanks!


--
 
 

Bob Carpenter

unread,
Jan 8, 2013, 6:53:40 PM1/8/13
to stan-...@googlegroups.com


On 1/8/13 6:48 PM, Tom Wallis wrote:
> Ah, right you are. I could loop over the three subjects.
>
> That's running much faster. Did 10,000 iterations of 4 chains in around 45 minutes. Thanks!

Glad to hear it and thanks for reporting back.

I'm guessing most of the speed gain comes from
the binomial conversion.

- Bob
Reply all
Reply to author
Forward
0 new messages