chain length, burn in, and thinning

1,002 views
Skip to first unread message

Philippe Domenech

unread,
Nov 23, 2012, 6:16:17 AM11/23/12
to hddm-...@googlegroups.com
Hi,

I was wondering what parameters do you use to sample from the parameter posteriors relative to the complexity of the model you estimate to achieve convergence ?
I think it would be interesting to pool user experiences here in a single thread.

Best,

Philippe

Thomas Wiecki

unread,
Nov 23, 2012, 9:28:27 AM11/23/12
to hddm-...@googlegroups.com
Good idea!

As a first try I sample 20000 with 15000 burn and no thinning (autocorrelation is often not that big a problem and with the gibbs sampling is much better too). This often leads satisfying results.

What I found critical for complex models like regressions or inter-trial variabilities is to put those into group_only_nodes (see the howto) which helps with convergences a lot.

Guido Biele

unread,
Nov 23, 2012, 9:42:48 AM11/23/12
to hddm-...@googlegroups.com
hi,

I work with large data sets (30 participants) and models with a 3by3
design for v,a,z, and also include sv,st,sz (which in my experience are
usefull).

using less than 30 000 or 40 000 samples for burn in doesn't seem to be
enough for such models. so i use 70 000 for burn in and another 15 000
after that with a thinning of 2.

these models take a while (4+days), but at the moments thats ok for me
...

cheers - guido
--
------------------------------------------------------------------------

Guido Biele
Email: g.p....@psykologi.uio.no
Phone: +47 228 45172
Website <https://sites.google.com/a/neuro-cognition.org/guido/home>

Visiting Address
Psykologisk Institutt
Forskningsveien 3 A
0373 OSLO







Mailing Address
Psykologisk Institutt
Postboks 1094
Blindern 0317 OSLO




Thomas Wiecki

unread,
Nov 23, 2012, 9:53:26 AM11/23/12
to hddm-...@googlegroups.com
Btw. has anyone tried the new outlier model (setting p_outlier=.05 or including it; see the howto) and has any success (or not) stories to share?

DunovanK

unread,
Mar 18, 2013, 12:04:52 PM3/18/13
to hddm-...@googlegroups.com
Hi Guido,

I just came across this post and was wondering what your typical DIC info looks like.  I run similarly complex models on sample sizes similar to yours and was just wondering what your DIC range looks like from model to model.  

Best,

Kyle

Guido Biele

unread,
Mar 20, 2013, 5:24:52 AM3/20/13
to hddm-...@googlegroups.com
Hi Dunovan,

here are my DIC values (for hierachical models for data with 30
participants, each with 900 trials)

DIC: -4390.248172
DIC: -4390.655998
DIC: -4385.025212
DIC: -4395.612273
DIC: -4384.519836
DIC: -4401.738157
DIC: -4395.389531
DIC: -4395.995150
DIC: -1773.622434
DIC: -7650.505652
DIC: -4402.018514
DIC: -1698.706209
DIC: -445.140448
DIC: -1205.503123
DIC: -5645.927820
DIC: -3842.773749
DIC: -4461.878244
DIC: -2692.211549
DIC: 6153.742797
DIC: 8585.168133
DIC: 9609.054522
DIC: 8837.099278
DIC: -2459.955351
DIC: 9327.937881

I am surprised by how much they differ (I haven't experienced this for
AIC and BIC) and will loo into that later.

cheers - Guido
> <tele...@gmail.com <javascript:>
> > <mailto:tele...@gmail.com <javascript:>>> wrote:
> >
> > Hi,
> >
> > I was wondering what parameters do you use to sample from the
> > parameter posteriors relative to the complexity of the model
> you
> > estimate to achieve convergence ?
> > I think it would be interesting to pool user experiences
> here in a
> > single thread.
> >
> > Best,
> >
> > Philippe
> >
> >
>
>
>
> --
> ------------------------------------------------------------------------
>
>
> Guido Biele
> Email: g.p....@psykologi.uio.no <javascript:>
> Phone: +47 228 45172
> Website
> <https://sites.google.com/a/neuro-cognition.org/guido/home
> <https://sites.google.com/a/neuro-cognition.org/guido/home>>
>
> Visiting Address
> Psykologisk Institutt
> Forskningsveien 3 A
> 0373 OSLO
>
>
>
>
>
>
>
> Mailing Address
> Psykologisk Institutt
> Postboks 1094
> Blindern 0317 OSLO
>
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "hddm-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to hddm-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

DunovanK

unread,
Mar 20, 2013, 7:41:27 PM3/20/13
to hddm-...@googlegroups.com
Hi Guido,

Yes that is interesting.  Are all of those DIC values from the same model configurations?  Or do they have different design matrices (depends_on)?  I'm fitting hierarchical models to 22 subjects, each with 500 trials and am seeing a very different range of DIC values - usually between 32,000 and 35,000.  But I've also noticed a great deal of variation in DIC estimates across models with different sampling parameters (same design matrix).  Unfortunately, fitting complex models to this many subjects takes quite a long time so I don't really have the liberty of experimenting with too many different sampling parameters.  

A related question - is it possible to OVERsample?  Say 300k, 250 burn-in, thin=4?  I noticed in a previous post you mentioned using around 95,000 with 15 burn-in.  Any particular reason you settled on those values?  

Anyway, I do appreciate the feedback.  And I'd be very curious to hear about anything interesting you find related to the DIC variation.

Very best,

Kyle

Michael J Frank

unread,
Mar 20, 2013, 9:10:57 PM3/20/13
to hddm-...@googlegroups.com
Hi Kyle, Guido - 

absolute values of DIC are not all that informative - they are highly dependent on the number of subjects, trials, and the task - conditions, difficulty, number of parameters, etc.  So it is difficult to compare values between experiments. Also, while relative DIC values (for the same dataset) are meant to be used for model comparison, it is known that the standard form of DIC (reported by PyMC)  is a little biased toward models with complexity, though there are alternative forms that are less biased - e.g. see Plummer 2008, where he uses "popt" to more appropriately penalize complexity. 

DIC values between alternative models can vary a lot more than e.g. AIC or BIC, because the DIC is computed over the entire posterior distribution rather than just at the ML value. The deviance is a function of probability density, so if the posterior is very peaked, the density can be > 1 and hence the DIC is negative (as in many of Guido's cases). It may be surprising that some of the values for alternative models of the same dataset are massively different, but this can occur if the posterior is much more spread out across a range of parameter values. 

On the other hand, pD itself (the term in DIC used to penalize complexity) can also be negative. This is because standard DIC assumes that the mean of the posterior is a good estimator, i.e. an optimistic measure of the degree to which the model fits the data, where the difference between the (mean deviance across the posterior) and the (deviance at the mean value) is meant to indicate the degree to which we may have overfit the data by using the mean as the estimator. But the mean is not always a good estimator. (The easiest way to see this is if the posterior is bimodal,  the mean between two modes may have zero likelihood, and hence the deviance at the mean value would be very large, leading to negative pD (and potentially negative DIC). Bimodality is not the only case that this can occur, though. 

So in general, for more complex models than simple linear models, the DIC should be used with caution. One could still use it if they check some of the above issues, but I think it should be supplemented with posterior predictive checks, or using the Bayes Factor.  (There are a few tricks to get the bayes factor from MCMC, though this is not currently supported directly in HDDM).  The BF is not immune to problems either, as it can be sensitive to the priors, but  informative priors can help a lot (e.g. from other data sets or from a training sample within the current data set). 

Michael

guido biele

unread,
Mar 20, 2013, 10:26:45 PM3/20/13
to hddm-...@googlegroups.com, DunovanK
hi Dunovan,
I tink I misunderstood you. the DICs were from models with different designs (I used depends on for z and/or v and/or and also varied inclusion of vs).
guido
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Message has been deleted

DunovanK

unread,
Mar 22, 2013, 3:05:26 PM3/22/13
to hddm-...@googlegroups.com, michae...@brown.edu
Hi Michael, 

Thank you for the explanation.  I found it extremely helpful and it made the concepts of pD and deviance much more intuitive for me.  One issue that remains a bit puzzling is that for a single model, depending on how many MCMC samples I draw from the posterior, I get one of two outcomes.  That is, if I sample somewhere around 100K (burn 70K-75K, thin 2) multiple times, I get consistent param means and what appear to be well converged posteriors.  However if I increase the number of samples to 250K (burn 200K-225K, tried with thin=2 and thin=4) and run the same model I get a set of parameters means that is quite different from those estimated from the same model when 100K samples were used.  Again, if I run the model with these sampling parameters (250K, etc...) multiple times, the estimated parameter means are consistent with all the other 250K models but different from the estimates from the 100K models.  Also, the MCMC chains show higher levels of autocorrelation when sampled with 250K and generally have a higher variance.  I'm sorry if this is redundant, I'm just having difficulty being both clear and concise in explaining this.  It occurred to me that maybe I'm oversampling with 250K samples and the MCMC chain is getting bumped out of the initial stationary distribution and into some other state (i.e. like getting stuck in a local minima).  I don't have the MCMC knowledge or background to even know if that's plausible, it's just the only explanation I could come up with.  I should note that the posterior predictive analyses seem to support the parameter values estimated from the models using 100K, despite a lower DIC for the 250K models (usually 6000-9000 smaller values for the 250K models).  Do you have any thoughts about why this might be occurring?  The reason I ran the models using 250K in the first place was just to make sure I was getting similar parameter means to those estimated with my usual sampling parameters (the 100K, burn=75K, thin=2).  Now that I see this is not the case, it's got me a bit worried.

I apologize for the length, please only respond if you have the time.

Very best,

Kyle

Michael J Frank

unread,
Mar 23, 2013, 10:28:42 PM3/23/13
to hddm-...@googlegroups.com
Hi Kyle, that does sound like an odd state of affairs to me that the means among different runs are similar for 100K samples but are also similar to eachother among runs for 250K samples, but yet the different n_samples lead to divergent results. Did you check the gelman r statistic? 250K samples is really a lot, but if the DIC is substantially better for the very same model using more samples than less, that itself certainly would sound like you just weren't sampling from the posterior with less samples, even though 100K is also a lot. It is technically possible for the posterior predictive to work better for a model that doesn't fit as well according to DIC or BF, because the posterior predictive tends to focus on a certain feature of the data - perhaps one that you care more about - whereas the other measures reflect the overall fit of the model to all of the data, and certain parameter values that work well for the PPC could produce poor fit to other aspects of the data (or to outliers - did you use the outlier model?)

Another possibility is that your model parameters are collinear or difficult to identify given your experimental conditions. In my view, one of the most useful exercises for model fitting is to first generate data from your model with reasonable sets of parameters, where the data correspond to the same experimental conditions you use to fit (number of conditions, trials, subjects etc). Then see how well you can recover those same parameters given that you know the true parameters and the true model (since you generated the data). This can really help identify whether the model is identifiable given the task, and also would allow you to test whether n_samples matters in that case as well. If this doesn't work well, then there is a problem, as this is the idealized case where you know the true model. 

Michael

DunovanK

unread,
Mar 24, 2013, 12:45:25 AM3/24/13
to hddm-...@googlegroups.com, michae...@brown.edu
Hi Michael,

Thank you for the insight.  If I understand your last point correctly, I should fit a model to the data, simulate a replicate dataset using the estimated parameters, and then fit the model to the simulated data?  I just want to make sure that I'm following you correctly.  If so, that will certainly be my next mode of attack.  I haven't computed Gelman's R yet, but I always check that the Geweke statistic < 2.

As for the differences between the 100K and 250K models, it just occurred to me today that there was another difference between them: p_outlier was included in all of the 250K models and none of the 100K models.  I was using <include=('all')> in the 250K models and manually calling all the the inter-trial variance parameters in the 100K models.  After looking through some of the documentation today I realized that p_outlier is now called along with all the other optional parameters when using the convenience argument.  

To check if p_outlier was critical difference I ran a model (one that I have run quite a lot and have a fairly lengthy record of DIC info for, both using 100K and 250K samples) sampling 100K (burn=75K, thin=2) and sure enough, the parameter estimates were in the ballpark of what I was seeing in the 250K models earlier.  Also, there was only about a 50-100 pt. difference between the DIC info for the 100K and 250K models that both included p_outlier; whereas, models not including p_outlier are all consistently about 6000-9000 pts. higher.  

In the models including p_outlier I'm seeing high levels of autocorrelation, skewed and occasionally even bimodal posteriors, where models excluding p_outlier return nice normal posteriors and minimal autocorrelation.  An alternative way I've been assessing model fit is to simulate a dataset for each subject using their individual parameter estimates (from a hierarchical model) and doing standard behavioral analyses on those simulations just as I did for my empirical data (i.e. compare rt and accuracy means and sds across subjects, do an ANOVA, etc..).  In doing this, I see much more similarity between the simulated and empirical datasets when using parameters estimated in models excluding p_outlier.  Does it make sense that using the outlier model would produce these differences?  

Again, I'm sorry about the length. And I really do appreciate your feedback, it's been tremendously helpful.  

Very best,

Kyle

Thomas Wiecki

unread,
Mar 24, 2013, 10:45:48 AM3/24/13
to hddm-...@googlegroups.com
Hi Kyle,

Your results with the p_outlier model are a little concerning; after that many samples the chains should converge just fine (unless the posterior really is bimodal...). 

What is ter estimated in both models?

Do the posterior predictive pdfs look better in the p_outlier model?

You can also set p_outlier=0.05 manually and see whether that leads to better results.

Thomas

DunovanK

unread,
Mar 24, 2013, 12:46:29 PM3/24/13
to hddm-...@googlegroups.com
Hi Thomas,

The mean ter is typically lower for models including p_outlier ~.5-.69 compared to ~.8-.9 (mean rts up to 3.0s) to .  It seems that estimates for boundary separation show the largest difference between models (usually high 2's low 3's when p_outlier is excluded, and high 4's low 5's when included).  This increase in 'a' for outlier models is what I assume leads them to overestimate the accuracy for each condition, although they do a decent job of recovering the empirical RT range.  This is based off of looking at the subject-level qp-plots as well as the comparison of mean rt and accuracy (averaged over subjects) for simulated and empirical data.  All in all, the posterior predictive quantiles look much better for the models not including p_outlier. 

One reason I'm hesitant to include p_outlier is that I've already removed (fast) outliers based on DMAT's EWMA procedure. I believe it just detects the RT at which accuracy hits chance level and I just removed trials below that criterion for each subject.  Would you recommend using my original data, including the trials I removed, and instead using p_outlier to account for contaminants? 

I've only seen the bimodal posteriors on one occasion.  However with the outlier models it is quite common to see severe autocorrelation with generally more irregular looking posteriors, compared to the symmetrical normal posteriors I'm seeing in the models excluding p_outlier.

Best,

Kyle
Reply all
Reply to author
Forward
0 new messages