Questions before I present shinyStan at nyhackR meetup

326 views
Skip to first unread message

Jonah Sol Gabry

unread,
Feb 21, 2015, 4:00:16 PM2/21/15
to stan...@googlegroups.com
Hi everyone, 

I was hoping to get your opinions on a few things before I present the new version of shinyStan at the nyhackR meetup on Wednesday. (There are already 60 people signed up!). 

1) Which model (or models) to use for demonstration  
I'm hoping to use one with at least somewhat broad appeal. Are there any Stan demo models that you think would work well or does anyone have a non-demo model that they think would be a good choice? 

I was also thinking it could be nice to use two versions of the same model but with different parameterizations (e.g. no Matt trick vs. Matt trick) where shinyStan reveals problems with the first one and shows much better results for the second. 

2) What to say about Stan 
I’m assuming that not everyone there will be familiar with Stan, although many will. So I think some time should be spent describing (and pitching) Stan so that anyone who’s not familiar can catch up a bit before I dive into shinyStan. Two possibilities: I see that Daniel and Rob are signed up (thanks guys!); would either of you want to speak about Stan for a few minutes before I present shinyStan? If not then I’m happy to do it, but I could use some suggestions, i.e. what would you say if you had 5 minutes to talk about Stan? 

So… fire away. Thanks!

Jonah

p.s. I’m going to make a separate post later with instructions for trying out the new version. Of course, I’d love some feedback on that too. 

Ben Goodrich

unread,
Feb 21, 2015, 5:44:12 PM2/21/15
to stan...@googlegroups.com
If we are pro-horseshoe now,

http://andrewgelman.com/2015/02/17/bayesian-survival-analysis-horseshoe-priors/

then Allen's example

https://ariddell.org/horseshoe-prior-with-stan.html

might be good. It doesn't converge by anyone's standard with the plain horseshoe, but you fail to reject the null hypothesis that all the chains have the same distribution if you Matt trick the prior. However, it still diverges a lot unless you set the acceptance target very close to 1. See attached, but you would have to run it in advance because it is slow.

With only 5 minutes, I would just say that Stan implements a new MCMC algorithm but that shinyStan can work with any mcmc.list.

Ben

horeshoe.R

Bob Carpenter

unread,
Feb 22, 2015, 4:29:14 AM2/22/15
to stan...@googlegroups.com
I'd like to stress that I think ShinyStan should
be (a) uncoupled from depending on RStan, and (b) renamed.
I think it would be OK to have RStan depend on ShinyStan,
but I wouldn't want to insist on it being the only way to
look at posteriors.

The "competing" piece of software is Coda, maintained by
Martyn Plummer (the inventor of JAGS). Jonah's a musician,
so maybe he can come up some other name that's related. I'd
suggest "refrain", but I don't like it's non-musical connotations.

Or we could try to team up with Martyn. The key is that
we want to get in Split R-hat (which Martyn said he was going
to use for Coda), and the n_eff calcs based on cross-chain
means and we want Andrew to be able to regularly update his
thinking on what the defaults should be.

And if it wasn't clear from the above, I think the coda-like
bits of RStan should be moved into this package, or yet a third
package. The point being that analyzing the posterior doesn't
need to be part of RStan itself --- it should be flexible enough
to be used with anyone's MCMC.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <horeshoe.R>

Jonah

unread,
Feb 22, 2015, 2:41:19 PM2/22/15
to stan...@googlegroups.com
Hi Bob,

Some comments inline, below.

On Sunday, February 22, 2015 at 4:29:14 AM UTC-5, Bob Carpenter wrote:
> I'd like to stress that I think ShinyStan should
> be (a) uncoupled from depending on RStan, and (b) renamed.
> I think it would be OK to have RStan depend on ShinyStan,
> but I wouldn't want to insist on it being the only way to
> look at posteriors.

From an R package perspective shinyStan no longer depends on RStan (RStan is listed under 'suggests' rather than 'depends' or 'imports') so the user doesn't need to have RStan installed to use shinyStan. That said, there are a few features of shinyStan that are only for stanfit objects (e.g. the summaries of HMC/NUTS sampler parameters) and I'd like to add some more HMC-related content, but in general shinyStan works great with Coda's mcmc.list objects and, more generally, with any 3D array (iterations x chains x parameters) or a list of 2D arrays (iterations x parameters) where each list element is a different chain. The only other thing that's only for RStan users (or soon enough for rstanDemo users) is that in the new shinyStan the launch_shinystan_demo function gives the user the choice of launching the default demo (no need for RStan) or running any of the Stan demo models and using that with shinyStan.

>
> The "competing" piece of software is Coda, maintained by
> Martyn Plummer (the inventor of JAGS). Jonah's a musician,
> so maybe he can come up some other name that's related. I'd
> suggest "refrain", but I don't like it's non-musical connotations.

I'm all for changing the name if we can come up with something good. I like "refrain" but I agree about the other connotations.

>
> Or we could try to team up with Martyn. The key is that
> we want to get in Split R-hat (which Martyn said he was going
> to use for Coda), and the n_eff calcs based on cross-chain
> means and we want Andrew to be able to regularly update his
> thinking on what the defaults should be.

I think shinyStan already does more or less everything Coda does plus a ton more and is much easier to use. It'd be awesome to collaborate in some way with Martyn, but I wonder what that collaboration would involve. Did you have something in mind?

shinyStan already uses the same Rhat and n_eff formulas as RStan. And that can easily be updated if Andrew has new ideas for defaults. The CRAN version of Coda doesn't look like it's been updated since 2012, but it'd be good if it were doing the same calculations (both for statistical reasons and to reduce confusion for people who might be transitioning to Stan from JAGS or BUGS).

>
> And if it wasn't clear from the above, I think the coda-like
> bits of RStan should be moved into this package, or yet a third
> package. The point being that analyzing the posterior doesn't
> need to be part of RStan itself --- it should be flexible enough
> to be used with anyone's MCMC.

I completely agree, which is why shinyStan can accept output in standard R formats rather than just stanfit objects. Maybe there should be an R package that includes the few posterior analysis functions from RStan, any other useful convenience functions, and also includes shinyStan (or whatever it ends up being called).

Jonah

Andrew Gelman

unread,
Feb 22, 2015, 4:15:24 PM2/22/15
to stan...@googlegroups.com
Here I am, joining in on the silliest parts of the conversation. I’d just like to say that I’d like to avoid nerdy puns and I would prefer the programs to be either descriptive (“ShinyStan”, “StanDisplay”, “SimOutput”, whatever) or neutral (“R-hat”, etc). I am not averse to a descriptive name that is clever, but I _don’t_ like the idea of naming it “refrain” or whatever.
A

Jonah

unread,
Feb 22, 2015, 6:12:49 PM2/22/15
to stan...@googlegroups.com
On Saturday, February 21, 2015 at 5:44:12 PM UTC-5, Ben Goodrich wrote:
> If we are pro-horseshoe now,
>
> http://andrewgelman.com/2015/02/17/bayesian-survival-analysis-horseshoe-priors/
>
> then Allen's example
>
> https://ariddell.org/horseshoe-prior-with-stan.html
>
> might be good. It doesn't converge by anyone's standard with the plain horseshoe, but you fail to reject the null hypothesis that all the chains have the same distribution if you Matt trick the prior. However, it still diverges a lot unless you set the acceptance target very close to 1.

Hey Ben, I like the horseshoe example, thanks. I noticed that if you set the acceptance target close to 1 like you did it does help cut down on the diverging errors but even still there are a few during sampling (post-warmup). Are we comfortable saying it's converged unless there are exactly zero?

Ben Goodrich

unread,
Feb 22, 2015, 6:22:20 PM2/22/15
to stan...@googlegroups.com
On Sunday, February 22, 2015 at 6:12:49 PM UTC-5, Jonah wrote:
Hey Ben, I like the horseshoe example, thanks. I noticed that if you set the acceptance target close to 1 like you did it does help cut down on the diverging errors but even still there are a few during sampling (post-warmup). Are we comfortable saying it's converged unless there are exactly zero?

Not really. Also the p-value of a convergence test is close to 0.05 and whether it is above or below that depends on how much you thin. It would probably be better to jack up the acceptance rate even more or to inverse CDF into those Cauchies.

Ben

Jonah

unread,
Feb 22, 2015, 6:40:59 PM2/22/15
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Sunday, February 22, 2015 at 4:15:24 PM UTC-5, Andrew Gelman wrote:
> Here I am, joining in on the silliest parts of the conversation. I’d just like to say that I’d like to avoid nerdy puns and I would prefer the programs to be either descriptive (“ShinyStan”, “StanDisplay”, “SimOutput”, whatever) or neutral (“R-hat”, etc). I am not averse to a descriptive name that is clever, but I _don’t_ like the idea of naming it “refrain” or whatever.
> A

Yeah, you're probably right about going for descriptive. And I do like the fact that if you google shinyStan we have the top 5 results, and even if you google 'shiny stan' as two separate words we have the top result. I'm sure that with a nerdy pun for a name we'd do much worse on google.

Andrew Gelman

unread,
Feb 22, 2015, 6:47:39 PM2/22/15
to stan...@googlegroups.com
Umm, no p-values, please.

Ben Goodrich

unread,
Feb 22, 2015, 6:54:05 PM2/22/15
to stan...@googlegroups.com, gel...@stat.columbia.edu
Then please specify a value of B / W at which to decide that the chains do not have the same distribution.
Umm, no p-values, please.

To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.

Andrew Gelman

unread,
Feb 22, 2015, 7:13:08 PM2/22/15
to Ben Goodrich, stan...@googlegroups.com
split R-hat < 1.1

Ben Goodrich

unread,
Feb 22, 2015, 7:32:28 PM2/22/15
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Sunday, February 22, 2015 at 7:13:08 PM UTC-5, Andrew Gelman wrote:
split R-hat < 1.1

Okay, that criterion will be satisfied for many Stan runs where the tails of the chains are different, the posterior standard deviations and other measures of uncertainty are wrong, the MCMC-CLT may not hold, and / or the user would be well-advised to respecify the Stan program to get better performance. It would even be satisfied in Allen's original horseshoe example, which I suggested Jonah use as an illustration of what can go wrong when you don't Matt trick the horsehoe prior.

On the other hand, if a user were only interested in the posterior median, then it is possible that chains --- which differ in distribution --- have similar medians that coincide with the posterior median. That could hold for the mean also, although it is harder to say without a MCMC-CLT.

This is essentially changing the question of whether the chains have converged in distribution to the posterior distribution to whether the median / mean has converged in probability to the posterior median / mean. Why do you find that preferable?

I find it unpreferable, in large part, because if you write a good Stan program for a dataset / model that is amenable to NUTS, then you can usually fail to reject the stricter criterion that all the chains have the same distribution. And I think we should be putting in mechanisms that force users to learn how to write good Stan programs, instead of watering down the convergence criteria so they can happily move on to making inferences.

Ben

Andrew Gelman

unread,
Feb 22, 2015, 7:42:04 PM2/22/15
to stan...@googlegroups.com
I don’t want to water down criteria, and I’m fine with distributional measures.  I just don’t want to express them as significance tests.  The issue is not “Do the different chains have different distributions?”; it’s “How far apart are these distributions?”  In education research we talk about treatment effects in standard-deviation scales.  Similarly, if the potential scale reduction factor is less than 1.1, then not much is to be gained by further simulations.

I accept your point, Ben, that R-hat, or even split R-hat, does not always give a good estimate of the potential scale reduction factor.  It’s possible to have split R-hat < 1.1 but still to have poor mixing.  So I’m fine with other summaries as well.  I just would like our rule to be based on an absolute or proportional scale of mixing or misfit or whatever, not a p-value or a significance test.

Ben Goodrich

unread,
Feb 22, 2015, 10:22:33 PM2/22/15
to stan...@googlegroups.com, gel...@stat.columbia.edu
The B / W ratios pop up along with the p-values. So, if that ratio is greater than 1, then the distances among draws of different thinned chains is greater than the distance among draws within the same chain. I just don't know any way of saying that a B / W ratio <= 1 + t is acceptable but if the B / W ratio is > 1 + t, then you should not make inferences with those draws. I can say how often the B / W ratio is > 1 + t  under the null hypothesis that the thinned chains are exchangeable and have the same distribution. I can say a B / W ratio > 1.25 will typically result in a p-value < 0.05. I can say a B / W ratio under 1 will always result in a p-value > 0.05. But when measured with energy distance, B / W is not a potential scale reduction factor for anything in particular so I don't know any other way to use it for decision-making.

Ben

Bob Carpenter

unread,
Feb 22, 2015, 11:30:15 PM2/22/15
to stan...@googlegroups.com
I thought "Stan" itself was a nerdy pun (Eminem song
and Stanislaw Ulam).

Having Stan named "Stan" is a pain --- I like the
name, but it's an impossible search. RStan and
PyStan and CmdStan are all OK.

So whatever we name it, I would strongly recommend
something that's (a) easy to pronounce, and (b) currently
has zero hits on Google.

- Bob

Jonah

unread,
Feb 23, 2015, 12:25:36 AM2/23/15
to stan...@googlegroups.com
> Not really. Also the p-value of a convergence test is close to 0.05 and whether it is above or below that depends on how much you thin. It would probably be better to jack up the acceptance rate even more or to inverse CDF into those Cauchies.
>
> Ben

I tried the inverse CDF method with the Cauchys and increased the acceptance rate and it seems to have made it worse for some reason. Unless I made a typo somewhere, doing this

parameters {
vector<lower=0>[p] lambda_raw;
real<lower=0> tau_raw;
}
transformed parameters {
vector<lower=0>[p] lambda;
real<lower=0> tau;
for (j in 1:p) lambda[j] <- tan(pi() * (Phi_approx(lambda_raw[j]) - 0.5)) ;
tau <- tan(pi() * (Phi_approx(tau_raw) - 0.5)) ;
}

with lambda_raw and tau_raw standard normal leads to a larger divergence pct and a lot more dependence in terms of distance correlation. A bit surprising.

Ben Goodrich

unread,
Feb 23, 2015, 12:34:35 AM2/23/15
to stan...@googlegroups.com

I noticed that too (the whole posterior blows up for some chains). I even tried it with plain Phi() in case the fast approximation was not accurate enough and it still blew up. This is definitely a concerning aspect of horseshoe prior, since I have never seen the geometry get worse when you do the normal -> uniform -> inverse CDF dance.

Ben

Ben Goodrich

unread,
Feb 23, 2015, 12:48:49 AM2/23/15
to stan...@googlegroups.com
On Monday, February 23, 2015 at 12:34:35 AM UTC-5, Ben Goodrich wrote:
I noticed that too (the whole posterior blows up for some chains). I even tried it with plain Phi() in case the fast approximation was not accurate enough and it still blew up. This is definitely a concerning aspect of horseshoe prior, since I have never seen the geometry get worse when you do the normal -> uniform -> inverse CDF dance.

The ratio-of-normals approach works better (although there is still divergence ...)

mc2 <-
'
data {
  int<lower=0> n;
  int<lower=0> p;
  matrix[n,p] X;
  vector[n] y;
}
parameters {
  vector[p] z;
  vector<lower=0>[p] lambda_num;
  vector<lower=0>[p] lambda_denom;
  real<lower=0> z_tau;
  real<lower=0> sigma;
}
transformed parameters {
  vector[p] beta;
  real<lower=0> tau;
  tau <- tan(pi() * (Phi_approx(z_tau) - 0.5));
  beta <- z .* lambda_num ./ lambda_denom * tau;
}
model {
  lambda_num ~ normal(0, 1);
  lambda_denom ~ normal(0,1);
  z_tau ~ normal(0, 1);
  z ~ normal(0,1);
  y ~ normal(X * beta, sigma);
}
'


Ben

Aki Vehtari

unread,
Feb 23, 2015, 10:13:39 AM2/23/15
to stan...@googlegroups.com
How about Tomi's version?
From http://becs.aalto.fi/en/research/bayes/diabcvd/wei_hs.stan

beta_biom <- hs_prior_lp(tau_s1_biom_raw, tau_s2_biom_raw, tau1_biom_raw, tau2_biom_raw) .* beta_biom_raw;

where

beta_biom_raw ~ normal(0.0, 1.0);

and

vector hs_prior_lp(real r1_global, real r2_global, vector r1_local, vector r2_local) {
r1_global ~ normal(0.0, 1.0);
r2_global ~ inv_gamma(0.5, 0.5);

r1_local ~ normal(0.0, 1.0);
r2_local ~ inv_gamma(0.5, 0.5);

return (r1_global * sqrt(r2_global)) * r1_local .* sqrt_vec(r2_local);
}

Aki

Ben Goodrich

unread,
Feb 25, 2015, 2:29:56 AM2/25/15
to stan...@googlegroups.com
On Monday, February 23, 2015 at 10:13:39 AM UTC-5, Aki Vehtari wrote:
How about Tomi's version?
From http://becs.aalto.fi/en/research/bayes/diabcvd/wei_hs.stan

I'm seeing about the same thing. To summarize for these data (attached),
  • With the literal horseshoe prior in terms of a normal scale-mixture, NUTS doesn't converge. Basically, there is a funnel problem
  • With the Matt-trick horseshoe prior but literal Cauchy priors on the positive parameters, NUTS does OK. But there are a few divergent transition even after the warmup period. This looks akin to the funnel-reparameterization example.
  • With the Matt-trick horseshoe prior but indirect Cauchy priors on the positive parameters (using the Cauchy inverse CDF to transform the standard normal CDF of a priori standard half-normal), NUTS blows up for some chains and returns parameter values on the order of 10^20.
  • With the Matt-trick horseshoe prior but indirect Cauchy priors on the positive parameters (using the ratio-of-half-normals approach), NUTS seems to yield reasonable posterior draws but there are a lot of divergent transitions
  • With the Tomi-trick horseshoe prior, NUTS again seems to yield reasonable posterior draws but there are a lot of divergent transitions

We talked about this briefly with Mike at the Stan meeting today. Without any data, the last three should be fine because the priors amount to products of independent standard normals (or some inverse gammas in Tomi's parameterization). But with the data, the posterior distribution is less amenable to NUTS than the prior distribution.

One issue might be that a horseshoe prior may not reasonable for an intercept, since there is no obvious reason to concentrate prior beliefs about it near zero. But even if I center the outcome variable, the above holds. Another issue might be that some prior is needed on the standard deviation of the errors, but I have tried a Jeffreys prior and a exponential prior for that without success.

Under the ratio of half-normals approach to tau, you get this pairs plot (with non-divergent transitions below the diagonal and divergent transitions above. It seems that for small (but not too small) values of the denominator, there is a concentration of divergent transitions that coincides with above average values of lp__. I think this suggests that the curvature near the mode along this dimension is great while the curvature is minimal away from the mode, and thus NUTS cannot adapt well. 

Thus, I think we need to see if this is a more general issue with situations like this before we recommend that horseshoe priors be used in something like stan_lm, particularly if we are not expecting users of stan_lm to do much in the way of diagnostics.

Ben


lars.R.dump

Bob Carpenter

unread,
Feb 25, 2015, 6:13:16 AM2/25/15
to stan...@googlegroups.com

> On Feb 25, 2015, at 6:29 PM, Ben Goodrich <goodri...@gmail.com> wrote:
>
> On Monday, February 23, 2015 at 10:13:39 AM UTC-5, Aki Vehtari wrote:
> ...
> Under the ratio of half-normals approach to tau, you get this pairs plot (with non-divergent transitions below the diagonal and divergent transitions above.

What exactly gets plotted in the divergent transitions case? Is it
just where the parameters started before running into numerical
issues? Or where they wind up if there was some progress in half
the binary tree when there was a divergence in the second half?

Given how far a single transition can move in HMC, neither of these
may be very informative to where the values were at just before
the divergent step.

A side question for Michael: why are these called "divergent"?
I believe it means we get non-finite values (NaN, +inf, -inf) on
the unconstrained scale, which can mean +inf or -inf on the
constrained scale, but not necessarily (e.g., positive constrained
-inf values mean underflow to 0 on the constrained scale).

> It seems that for small (but not too small) values of the denominator, there is a concentration of divergent transitions that coincides with above average values of lp__. I think this suggests that the curvature near the mode along this dimension is great while the curvature is minimal away from the mode, and thus NUTS cannot adapt well.

That seems like what it's suggesting. Are the scales the same on
the vertical and horizontal? It's hard to tell with R's default
labels.

> Thus, I think we need to see if this is a more general issue with situations like this before we recommend that horseshoe priors be used in something like stan_lm, particularly if we are not expecting users of stan_lm to do much in the way of diagnostics.

Agreed, but then it's always easy to agree to more research!

- Bob


Michael Betancourt

unread,
Feb 25, 2015, 6:32:25 AM2/25/15
to stan...@googlegroups.com
> What exactly gets plotted in the divergent transitions case? Is it
> just where the parameters started before running into numerical
> issues? Or where they wind up if there was some progress in half
> the binary tree when there was a divergence in the second half?
>
> Given how far a single transition can move in HMC, neither of these
> may be very informative to where the values were at just before
> the divergent step.

Once a divergence is detected the tree building stops and the
current sample is returned — that sample could be from anywhere
in the tree built up to that point.

The difference between the sampled point and the points that
actually diverged is something to be weary of, but given how
concentrated the divergent samples are here I think it’s safe
to say that we’ve identified the pathological region.

> A side question for Michael: why are these called "divergent"?
> I believe it means we get non-finite values (NaN, +inf, -inf) on
> the unconstrained scale, which can mean +inf or -inf on the
> constrained scale, but not necessarily (e.g., positive constrained
> -inf values mean underflow to 0 on the constrained scale).

Technically a divergence is defined not as the state reaching infinity
or NaN but rather the state diverging away from the true level set.
This is diagnosed as the change in Hamiltonian exceeding the
sampled slice, log_u + max_delta. The relevant code is

double h = this->hamiltonian_.H(this->z_);
if (boost::math::isnan(h))
h = std::numeric_limits<double>::infinity();

util.criterion = util.log_u + (h - util.H0) < this->max_delta_;
if (!util.criterion) ++(this->n_divergent_);


Andrew Gelman

unread,
Feb 25, 2015, 4:02:32 PM2/25/15
to stan...@googlegroups.com
This is interesting, the interaction between modeling choices and computation.

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<lars.R.dump>

Ben Goodrich

unread,
Feb 25, 2015, 4:20:57 PM2/25/15
to stan...@googlegroups.com
On Wednesday, February 25, 2015 at 6:13:16 AM UTC-5, Bob Carpenter wrote:
> It seems that for small (but not too small) values of the denominator, there is a concentration of divergent transitions that coincides with above average values of lp__. I think this suggests that the curvature near the mode along this dimension is great while the curvature is minimal away from the mode, and thus NUTS cannot adapt well.

That seems like what it's suggesting.  Are the scales the same on
the vertical and horizontal?  It's hard to tell with R's default
labels.

The axis values are printed in the external margins. They seem to be the same in both triangles, although I don't recall forcing them to be the same.

Ben
Reply all
Reply to author
Forward
0 new messages