step function in Stan

1,069 views
Skip to first unread message

Linas Mockus

unread,
Dec 19, 2013, 4:58:07 PM12/19/13
to stan-...@googlegroups.com
rstan users:

Is it possible implement the attached WinBUGS model on Rstan?

model {
    for (i in 1 : N) {
        y[i]~dnorm(mu[i],tau)
        mu[i]<-(alpha1*x[i]+beta1)*step(x[i]-gamma)+(alpha2*x[i]+beta2)*(1-step(x[i]-gamma))
    }
    alpha1~dnorm(0,1.0E-6)
    alpha2~dnorm(0,1.0E-6)
    beta1~dnorm(0,1.0E-6)
    gamma~dunif(0,1)
    beta2<-alpha1*gamma+beta1-alpha2*gamma
    tau~dgamma(1.0E-3,1.0E-3)
    sigma<-1/sqrt(tau)
}
piecewise.odc

Michael Betancourt

unread,
Dec 19, 2013, 5:04:45 PM12/19/13
to stan-...@googlegroups.com
Step functions are discontinuous and hence not appropriate for the HMC sampler underneath Stan.
> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
> <piecewise.odc>

Andrew Gelman

unread,
Dec 19, 2013, 5:09:51 PM12/19/13
to stan-...@googlegroups.com
Michael:

Maybe that reaction is too strong? After all, to the extent that HMC fails, it turns into Metropolis, so Stan shouldn't really perform much worse than a traditional MCMC algorithm, right? And I think the quick answer to the question is that the model _can_ be written in Stan easily enough; we do actually have the ability to implement a step function.

Linas:

Finally, channelling Bob channeling me, let me say:
(a) You should probably use stronger priors for alpha1, alpha2, and beta1. These super-flat priors can allow your sampler to waste its time moving around in parts of parameter space that you have no interest in.
(b) Don't use that gamma(.001,.001) prior. That's bad bad news (see my 2006 paper).

Bob Carpenter

unread,
Dec 19, 2013, 5:11:44 PM12/19/13
to stan-...@googlegroups.com


On 12/19/13, 4:58 PM, Linas Mockus wrote:
> rstan users:
>
> Is it possible implement the attached WinBUGS model on Rstan?

You could translate this one line-for-line into Stan.
Presumably x[i] == gamma won't come up, but if it did,
you'd wind up using both terms.

Conditionals are more efficient in Stan, and arguably easier
to understand:

for (i in 1:N) {
if (x[i] > gamma)
y[i] ~ normal(alpha1 * x[i] + beta1, sigma);
else
y[i] ~ normal(alpha2 * x[i] + beta2, sigma);
}

Note that I'm skipping the case where x[i] == gamma, because
presumably you didn't want both terms used in that case.

Note that Stan's normal() takes a standard deviation parameter,
not a precision parameter.

For most efficiency, you'd vectorize:

for (i in 1:N) {
if (x[i] > gamma)
mu[i] <- alpha1 * x[i] + beta1;
else
mu[i] <- alpha2 * x[i] + beta2;
}
y ~ normal(mu, sigma);

We've already translated all of the BUGS models if you
want more examples. And there's an appendix in the manual comparing
BUGS and Stan's modeling.

We'd recommend at least weakly informative priors instead
of the gamma and normals you're using. BUGS has changed their
examples away from those gammas. Stan supports improper priors,
so you can also just drop them as long as the posterior is proper.

- Bob

>
> model {
> for (i in 1 : N) {
> y[i]~dnorm(mu[i],tau)
> mu[i]<-(alpha1*x[i]+beta1)*step(x[i]-gamma)+(alpha2*x[i]+beta2)*(1-step(x[i]-gamma))
> }
> alpha1~dnorm(0,1.0E-6)
> alpha2~dnorm(0,1.0E-6)
> beta1~dnorm(0,1.0E-6)
> gamma~dunif(0,1)
> beta2<-alpha1*gamma+beta1-alpha2*gamma
> tau~dgamma(1.0E-3,1.0E-3)
> sigma<-1/sqrt(tau)
> }
>

Michael Betancourt

unread,
Dec 19, 2013, 5:21:38 PM12/19/13
to stan-...@googlegroups.com
Yeah, but then we have a false expectation for HMC-like performance and I don't want
anyone thinking ill of Stan because of that. Such models can also significantly stress
adaptation (namely the step size), upping the potential for poor results.

Marcus Brubaker

unread,
Dec 19, 2013, 7:54:37 PM12/19/13
to stan-...@googlegroups.com
Because this comes up a lot (people asking about fitting discontinuous models with Stan) I thought I'd expand on *why* it will be bad.

If the posterior is very discontinuous or has no gradient, then Stan (well, HMC/NUTS specifically) is going to be, at the very least, extremely inefficient.  To be clear, technically the Markov Chain that Stan simulates will eventually mix for most such models...it's just that none of us are likely to be alive when that finally happens.  You'd be better off actually using Random-Walk Metropolis (one reason we should have something like a good adaptive RWM implemented in Stan).  The inefficiency comes from at least two places:

- Unnecessary use of autodiff.  Considering this is probably 95% or more of the time spent sampling, that's big.
- Unnecessary evaluations of the log probability caused by unnecessarily long trajectories. If the gradient is useless, then your best bet is to just take a single step and accept or reject.

A model like this would be one where the parameters are, e.g., rounded to the nearest integer or piped through a step function, so that they have no gradient.

Now, maybe your posterior is *mostly* continuous but has a few discontinuities.  (E.G., a mixture model of a Gaussian and a uniform over a small region.)  Such a model can work OK with Stan (using HMC/NUTS), however, it may cause adaptation to break or result in inefficient sampling of some parts of the space.  (E.G., you may see a lot of parameter "sticking" in your trace plots.)

At any rate, Linus' model is somewhere between the extremes.  It's possible that Stan would work ok, but for all such models you can't expect the same level of performance and should approach the samples with a healthy amount of skepticism.  As the WinBUGS manual says "Beware: MCMC sampling can be dangerous!"  Stan is no exception to that, although we like to think that maybe its a little bit less so.

Cheers,
Marcus



Linas Mockus

unread,
Dec 19, 2013, 8:55:28 PM12/19/13
to stan-...@googlegroups.com
Thank you all.  Technically such functions model actual chemical processes.  In reality the step is not so sharp.  One way is to add some artificial parameter to underlying model to "soften" the step so it becomes "acceptable" by algorithm.  Another way is to implement step function as a "soft" step - not a mathematical step.  Because all measurements have error perhaps it doesn't matter which way to approach the discontinuity in the derivative. 

Marcus Brubaker

unread,
Dec 19, 2013, 8:59:15 PM12/19/13
to stan-...@googlegroups.com
If you soften the step (e.g., use a sigmoid or tanh type function) it should work a bit better (although adaptation may be problematic due to huge curvature variations).  You will still have similar issues if the soft step function becomes very steep because the derivatives become numerically (if not analytically) zero everywhere.

Cheers,
Marcus



--

Bob Carpenter

unread,
Dec 19, 2013, 10:05:48 PM12/19/13
to stan-...@googlegroups.com
I added warning sections in the manual for if_else(), step()
and the rounding functions round(), floor() and ceil(), indicating
that they'll break derivatives and hence might slow down the
efficiency of HMC sampling.

I don't have very concrete expectations of speed for HMC sampling,
and I suspect most of our users are in the same boat.
In Stan, efficiency's tied up with HMC in terms of the geometry of the
unconstrained parameter space induced by the model (including the
effect of the size and shape of data), and the time to evaluate the
log probability function depends on the size of the expression tree
implied by the model and data and the efficiency of auto-diff on
the particular formulation of the model.

- Bob

Bob Carpenter

unread,
Dec 19, 2013, 10:21:15 PM12/19/13
to stan-...@googlegroups.com


On 12/19/13, 7:54 PM, Marcus Brubaker wrote:

...

> You'd be better
> off actually using Random-Walk Metropolis (one reason we should have something like a good adaptive RWM implemented in
> Stan).

This would be a great project for somebody to pick up.

The infrastructure's all in place now to add new sampling
techniques.

I think Peter got stuck on which form of adaptation
to use. And the fact that it was so slow it was hard
to test that it was getting the right answers.

Do you have a suggestion for what kind of adaptation to
use?

- Bob

P.S. Another good project: add ensemble approaches like Goodman/Weare
and differential evolution/DREAM, though those are harder
in both infrastructure and require some more thought on posterior
analysis.


Bob Carpenter

unread,
Dec 19, 2013, 10:28:20 PM12/19/13
to stan-...@googlegroups.com


On 12/19/13, 7:54 PM, Marcus Brubaker wrote:
> ...
> Now, maybe your posterior is *mostly* continuous but has a few discontinuities. (E.G., a mixture model of a Gaussian
> and a uniform over a small region.)

Given that we don't have discrete parameters, usershave
to marginalize out the mixture indicator.

So rather than

z[n] ~ bernoulli(theta);
if (z[n])
y ~ uniform(L,U);
else
y ~ normal(mu,sigma);

we're suggesting writing a new distribution

p(y) =def= theta * uniform(y|L,U) + (1 - theta) * normal(y|mu,sigma)

implemented on the log scale as

increment_log_prob(log_sum_exp(log(theta) + uniform_log(y[n],L,U),
log1m(theta) + normal_log(y[n],mu,sigma));

I'm pretty sure the result's continuous.

(And relevant to your message in the other thread on CAR, you'd want
to cache the log(theta) and log1m(theta) values into a local variable
and reuse in the loop over n.)

- Bob

Marcus Brubaker

unread,
Dec 20, 2013, 7:56:52 PM12/20/13
to stan-...@googlegroups.com
Indeed, both would be a very good thing for someone to work on, although the ensemble samplers would be much harder from an engineering perspective.


--
You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

Marcus Brubaker

unread,
Dec 20, 2013, 7:59:23 PM12/20/13
to stan-...@googlegroups.com
On Thu, Dec 19, 2013 at 10:28 PM, Bob Carpenter <ca...@alias-i.com> wrote:


On 12/19/13, 7:54 PM, Marcus Brubaker wrote:
...

Now, maybe your posterior is *mostly* continuous but has a few discontinuities.  (E.G., a mixture model of a Gaussian
and a uniform over a small region.)

Given that we don't have discrete parameters, usershave
to marginalize out the mixture indicator.

So rather than

  z[n] ~ bernoulli(theta);
  if (z[n])
    y ~ uniform(L,U);
  else
    y ~ normal(mu,sigma);

we're suggesting writing a new distribution

  p(y) =def= theta * uniform(y|L,U) + (1 - theta) * normal(y|mu,sigma)

implemented on the log scale as

  increment_log_prob(log_sum_exp(log(theta) + uniform_log(y[n],L,U),
                     log1m(theta) + normal_log(y[n],mu,sigma));

I'm pretty sure the result's continuous.


The resulting posterior is not continuous because the uniform distribution has finite support, but I would expect NUTS/Stan to work ok on it.
 
(And relevant to your message in the other thread on CAR, you'd want
to cache the log(theta) and log1m(theta) values into a local variable
and reuse in the loop over n.)

Yup, caching like that is always worth it in Stan.

Cheers,
Marcus

Bob Carpenter

unread,
Dec 21, 2013, 12:56:56 PM12/21/13
to stan-...@googlegroups.com


On 12/20/13, 7:59 PM, Marcus Brubaker wrote:
> On Thu, Dec 19, 2013 at 10:28 PM, Bob Carpenter <ca...@alias-i.com <mailto:ca...@alias-i.com>> wrote:
...

> increment_log_prob(log_sum___exp(log(theta) + uniform_log(y[n],L,U),
> log1m(theta) + normal_log(y[n],mu,sigma));
>
> I'm pretty sure the result's continuous.
>
>
>
> The resulting posterior is not continuous because the uniform distribution has finite support, but I would expect
> NUTS/Stan to work ok on it.

Thanks for the clarification. Forgot about that
aspect of it.

- Bob

Reply all
Reply to author
Forward
0 new messages