stopping rule for sampling

118 views
Skip to first unread message

Luca Campanelli

unread,
Nov 2, 2016, 6:19:19 PM11/2/16
to Stan users mailing list
Hello all,
I'm planning to preregister an experiment and to use the "sequential Bayes Factor" approach as sampling plan. In a few words, for those who are not familiar with that, participants keep being recruited and tested till an a priori decided level of evidence is reached (e.g., BF 10 or 1/10).
I don’t know if it mattes, but the model that I will fit is a mixed-effects model with crossed random effects.

I'm aware of some limitations of BF and that it’s not easy to compute. But, as you know, I can easily get the WAIC index from Stan outputs.
My question is if you are aware of any method based on the WAIC index (or some other method) that I can use to quantify the degree of evidence of one model over another. That degree of evidence would then be used as stopping criterion.

Thank you for any suggestion you may have.
Luca

Aki Vehtari

unread,
Nov 3, 2016, 4:45:26 AM11/3/16
to Stan users mailing list
You could examine the posterior distribution of the effect directly without need for model comparison. For example, you could compute the probability that you can infer the sign of the effect (making type-S error small).

WAIC and LOO are asymptotically equal (if n->infty while p_eff->finite). In finite case they mostly differ in the computational approximation PSIS-LOO being more reliable than WAIC (See Vehtari, Gelman and Gabry (2016). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. In Statistics and Computing, doi:10.1007/s11222-016-9696-4. Preprint http://arxiv.org/abs/1507.04544).

For small number of observations LOO and WAIC have large variance (also shown in the above mentioned paper and see also, e.g., Piironen and Vehtari (2016). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, doi:10.1007/s11222-016-9649-y. http://link.springer.com/article/10.1007/s11222-016-9649-y), and it is difficult to estimate this variance reliably for n<<100, which makes LOO and WAIC model comparison more difficult for n<<100. Currently I can't recommend using LOO or WAIC for your sequential decsiion making unless n>100. We are looking for n<100 case, too.

Aki

Avraham Adler

unread,
Nov 3, 2016, 10:22:39 AM11/3/16
to Stan users mailing list
On Wednesday, November 2, 2016 at 6:19:19 PM UTC-4, Luca Campanelli wrote:

I'm aware of some limitations of BF and that it’s not easy to compute. But, as you know, I can easily get the WAIC index from Stan outputs.
My question is if you are aware of any method based on the WAIC index (or some other method) that I can use to quantify the degree of evidence of one model over another. That degree of evidence would then be used as stopping criterion.

Thank you for any suggestion you may have.
Luca

As WAIC/LOO are estimates on the same scale as A/DIC and are estimators of the same K-L distance, to the best of my understanding, perhaps you could use what Burnham and Anderson call 'Akaike weights'. Let d_i be the difference between the XAIC measure of model i from the minimal one (so D_i{min} = 0) then w_i = exp(-d_i / 2) / SUM{over models} ( exp(-d_i / 2). For example, if your WAIC (on deviance scale) for two models are 250 and 252, then weight for model 1 is ~73.1% (exp(0) / {exp{0} + exp{-1}). Their specific language (2002, pg. 75) is that a "given w_i is considered as the weight of evidence in favor of model i being the actual K-L best model for the situation at hand _given_ that one of the R models must be the K-L best model of that set of R models." Perhaps that will help you.

Thanks,

Avi

Michael Betancourt

unread,
Nov 3, 2016, 1:08:40 PM11/3/16
to stan-...@googlegroups.com
I agree with Aki — what you really want to do for any kind of adaptive
trial is stop based on some posterior expectation, not a posterior
predictive expectation.  

In addition to the variability issues with WAIC and LOO that Aki notes,
there’s a deeper problem.  These measures do not have a natural
calibration, so it’s hard to assign any meaning to the explicit values
outside of their ordering.  In particular, while a larger difference between
WAIC or LOO of two models means that one is getting better than
the other, it doesn’t inform how much better one is than another.

Bayes factors are actually similar in that any threshold decision is
arbitrary.  The advantage of Bayes factors are that they are probabilities 
(or at least proportional to probabilities) and that helps in informing that
decision.  

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stephen Martin

unread,
Nov 4, 2016, 12:19:44 AM11/4/16
to Stan users mailing list
I felt compelled to reply to this, with regard to sequential BFs. In the psych methods (or psychMAP?) facebook group, someone explored whether terminating collection upon reaching a BF threshold would bias results just as it would using an NHST criterion would, and it does indeed bias results. This is *especially* true with a threshold of 3, but still occurs with BF thresholds of 10.
All that to say --- Using these thresholds as stopping rules, in the long term, can bias your inference. I would instead recommend using a stopping rule based on posterior credible interval width (analogous to using SE size as an NHST stopping criterion); it will still bias effect size estimates a bit, but not as much.

Luca Campanelli

unread,
Nov 4, 2016, 2:30:23 PM11/4/16
to Stan users mailing list
Thank you all for your very helpful answers.

The fact that WAIC and LOO are not recommended for n < 100 is not ideal as we may have samples smaller than that, in particular when working on clinical populations.

Michael, the "Akaike weights", mentioned by Avi, have some meaningful interpretation; in my understanding they range from 0 to 1 and can be interpreted as the probability that a model is the best model. Does this answer part of your question about the meaning of WAIC values?

Aki and Michael, thank you for your suggestion to examine directly the posterior distribution, instead of posterior predictive expectation. That would be easy to compute and interpret. Are you aware of any article that used that approach as stopping rule in sequential testing?

Stephen, I looked at that facebook post, thank you. I like the idea of using the width of the posterior credible interval as stopping rule, but I'm not sure how I would quantify it, that is, what would a "small enough CI" be? Is there any publication you can suggest about it?

Thank you
Luca

Stephen Martin

unread,
Nov 4, 2016, 5:03:27 PM11/4/16
to Stan users mailing list
I don't know of any publication about it. I'm fairly certain it would be problem-specific. How much error are you, the decision maker, willing to have before making a decision? That's really the question. It's scale-specific, so given your scale, what range do you want to be included in the 95% credible interval?

Michael Betancourt

unread,
Nov 4, 2016, 7:35:59 PM11/4/16
to stan-...@googlegroups.com
>
> Michael, the "Akaike weights", mentioned by Avi, have some meaningful interpretation; in my understanding they range from 0 to 1 and can be interpreted as the probability that a model is the best model. Does this answer part of your question about the meaning of WAIC values?

No, they absolutely cannot be interpreted that way. It’s easy to see why —
w_{n} \propto exp(- LOO_{n}) is a valid weight, but so is w_{n} \propto exp( - alpha LOO_{n}).
Everything choice of alpha yields a valid set of weights that admit each model
different influence. This is just a manifestation of the scaling ambiguity I mentioned
before.

> Aki and Michael, thank you for your suggestion to examine directly the posterior distribution, instead of posterior predictive expectation. That would be easy to compute and interpret. Are you aware of any article that used that approach as stopping rule in sequential testing?

There is a huge literature on Bayesian adaptive clinical trials. Just google
“adaptive clinical trial”. The stuff from M. D. Anderson was particularly
influential.

Avraham Adler

unread,
Nov 6, 2016, 12:38:55 AM11/6/16
to Stan users mailing list


On Friday, November 4, 2016 at 7:35:59 PM UTC-4, Michael Betancourt wrote:
>
> Michael, the "Akaike weights", mentioned by Avi, have some meaningful interpretation; in my understanding they range from 0 to 1 and can be interpreted as the probability that a model is the best model. Does this answer part of your question about the meaning of WAIC values?

No, they absolutely cannot be interpreted that way.  It’s easy to see why —
w_{n} \propto exp(- LOO_{n}) is a valid weight, but so is w_{n} \propto exp( - alpha LOO_{n}).
Everything choice of alpha yields a valid set of weights that admit each model
different influence.  This is just a manifestation of the scaling ambiguity I mentioned
before.

I cannot hold a candle to your understanding, Michael, but are you saying Burnham & Anderson are wrong? That's a bit stronger of a claim to make :)

Avi

Bob Carpenter

unread,
Nov 6, 2016, 11:42:29 AM11/6/16
to stan-...@googlegroups.com
You might want to check out the chapter on model comparison
in Gelman et al.'s Bayesian Data Analysis, 3rd edition. We
really need to get working on an open-access version of this
material in the Stan book. We're getting serious about getting
down to writing it.

One basic problem is that you only wind up comparing a subset
of models---if you add more models, those probabilities change.

- Bob

Aki Vehtari

unread,
Nov 6, 2016, 2:38:55 PM11/6/16
to Stan users mailing list
Burnham and Andreson don't give theoretical justification for Akaike weights (and thus they never claimed to be right in the first place).

Aki

Michael Betancourt

unread,
Nov 6, 2016, 4:48:08 PM11/6/16
to stan-...@googlegroups.com
+1

Avraham Adler

unread,
Nov 7, 2016, 11:13:27 AM11/7/16
to stan-...@googlegroups.com
On Sunday, November 6, 2016 at 2:38:55 PM UTC-5, Aki Vehtari wrote:
Burnham and Andreson don't give theoretical justification for Akaike weights (and thus they never claimed to be right in the first place).

Aki

In their defense, not that I really can nor do they need it, they (at least Burnham) provide some justification for these weights in Buckland et al. (1997) (linked below), but your point is clear. Thanks for the explanation!

Avi

Reference:

Luca Campanelli

unread,
Nov 9, 2016, 8:35:24 PM11/9/16
to Stan users mailing list
Thank you all for your comments.
The literature on "adaptive clinical trial" looks relevant, I wasn't aware of it.

Luca

Aki Vehtari

unread,
Nov 10, 2016, 3:23:10 PM11/10/16
to Stan users mailing list
Thanks for the reference. I just checked it. The paper has similar justification as other AIC weight references, ie
1) Bayes factor is ratio of marginal likelihoods
2) marginal likelihood can be approximated with BIC which has a form -2 log(L)+q (with q=p log(n))
3) AIC has a form -2 log(L) + q (with q=2p), and thus it is plausible to use AIC weights

I don't think this analogy between BIC and AIC is a sufficient theoretical justification.
The same problem holds for LOO weights. I haven't seen theoretical justification for replacing BMA with LOO weighting 
(replacing Bayes factors with pseudo Bayes factors). LOO weighting may work, but it hasn't been studied sufficiently (even empirically).

Aki
Reply all
Reply to author
Forward
0 new messages