posterior predictive check for hddmRegression model

621 views
Skip to first unread message

Guido Biele

unread,
Jul 3, 2013, 5:31:31 PM7/3/13
to hddm-...@googlegroups.com
Hi,

I tried to implement a simple posterior predictive check for a simple hddm regression model (using patsy to implement contrast coding of a within 3 by 3 design).
However, the results don't look as expected, so I'd be happy to get feedback on my general approach, so as to know if I am roughly on the right path: 

Here are my steps:
(1) As contrast coding does not directly return ddm parameter distributions for the 9 conditions of my 3x3 design, I first generate these from the traces (which have parameter distributions for the contrasts) using the contrast matrix I initially used to set up the patsy model (). I am confident that this step is working, so that I now have for each parameter a,v,z, a parameter distribution for each of the 9 conditions. Note that a distribution for one condition is simply a new trace, which I obtained by multiplicating the model-traces with the relevant column of my contrast matrix. t was fixed across conditions). 
(2) to generate predicted data, I 
a) randomly sample from the just generated ddm parameters*
b) use hddm.generate.gen_rand_data to generate the predicted data. I submit the parameters for all 9 conditions simultaneously, such that one data frame with data from all conditions is returned.

However, when I compared predicted and observed data, they don't look similar (the predicted RTs are much faster ...).
Before I check the code once more carefully for bugs, it would be nice to learn if you agree or disagree with the general approach.


cheers-guido

*such that the parameters for all conditions are from the same sample of the original traces. this would be important if parameters were correlated (which they are not...)

Thomas Wiecki

unread,
Jul 3, 2013, 5:35:57 PM7/3/13
to hddm-...@googlegroups.com
Hi Guido,

I just added support for regression models to post_pred_check.
To generate data, try:
hddm.utils.post_pred_check(m, compute_stats=False, append_data=True)

Make sure to sample the regression model using keep_regressor_trace=True (this will use more memory but is necessary for PPC).

Let me know if it produces the same results and I'll take a closer look at your protocol.

Thomas


--
You received this message because you are subscribed to the Google Groups "hddm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hddm-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Guido Biele

unread,
Jul 4, 2013, 9:12:11 AM7/4/13
to hddm-...@googlegroups.com
Hi Thomas,

Thanks for including the PPC for the regression model!
I could start the ppc with post_pred_check(m, compute_stats=False, append_data=True), but it is taking time and I have 2 questions in the meantime:
is there a way to say how long the sampling should take?
where will I find the generated data when the sampling is completed (I couldn't figure this out from reading the code ...)?

cheers - guido

Guido Biele

unread,
Jul 4, 2013, 10:45:58 AM7/4/13
to hddm-...@googlegroups.com
Hi thomas,
my last questions were answered.

ppc_data = post_pred_check(m, compute_stats=False, append_data=True)

makes a pandas dataframe containing both the observed data (rt) and the predicted  data (rt_sampled).
running the ppc for 50 samples while using 10 bins took approx. an hour (for an experiment with ca 25 participants and 900 trials/participant on a 3.5Ghz processor).
so sampling definitively is time consuming.
I'll run it with more samples now. for the quick sampling i did now, sampled RTs are nearly twice as slow as observed RTs and accuracy is higher--85% compared to 80%. I hope that this is only an effect of my coarse sampling, as I know from running the same model with depends_on that observed and predicted data are very similar


some clarification questions:
- when I submit the flag samples = 50, that means that 50 datasets will be generated with 50 different thetas drawn from the posterior, correct?
- in your code for the posterior predictive plots, do you make the predicted RT distribution (including "confidence intervals") as the mean of 50 histograms for each rt (and their 2.5 and 97.5 percentiles). or do you use another approach.
- relatedly, when you calculate the percentiles of mean RTs and other statistics, do you calculate the statistic for each sample and then calculate the percentile over the sample-statistics?

cheers - guido

ps: the pandas frame has a name only for the first of three index variables. i would have understood the dataframe faster if 2 and 3 were labled as "sample" and "trial" or similar.

Thomas Wiecki

unread,
Jul 6, 2013, 6:54:09 PM7/6/13
to hddm-...@googlegroups.com
On Thu, Jul 4, 2013 at 10:45 AM, Guido Biele <g.p....@psykologi.uio.no> wrote:
Hi thomas,
my last questions were answered.

ppc_data = post_pred_check(m, compute_stats=False, append_data=True)

makes a pandas dataframe containing both the observed data (rt) and the predicted  data (rt_sampled).
running the ppc for 50 samples while using 10 bins took approx. an hour (for an experiment with ca 25 participants and 900 trials/participant on a 3.5Ghz processor).
so sampling definitively is time consuming.
I'll run it with more samples now. for the quick sampling i did now, sampled RTs are nearly twice as slow as observed RTs and accuracy is higher--85% compared to 80%. I hope that this is only an effect of my coarse sampling, as I know from running the same model with depends_on that observed and predicted data are very similar

Hm, or it might imply a problem with the model. Is this for the CPT? 


some clarification questions:
- when I submit the flag samples = 50, that means that 50 datasets will be generated with 50 different thetas drawn from the posterior, correct?

Correct.
 
- in your code for the posterior predictive plots, do you make the predicted RT distribution (including "confidence intervals") as the mean of 50 histograms for each rt (and their 2.5 and 97.5 percentiles). or do you use another approach.

Actually those are using the likelihood function instead of sampling which might not be the best approach.  I think sampling as you describe is probably the best way.

- relatedly, when you calculate the percentiles of mean RTs and other statistics, do you calculate the statistic for each sample and then calculate the percentile over the sample-statistics?
 
Yes, exactly.

ps: the pandas frame has a name only for the first of three index variables. i would have understood the dataframe faster if 2 and 3 were labled as "sample" and "trial" or similar.

Good point, totally agree. 

Thomas

Guido Biele

unread,
Jul 18, 2013, 5:32:46 PM7/18/13
to hddm-...@googlegroups.com
Hi Thomas,

My initial problems had to do with that I set most variables to be estimated for the group-level only, so no wonder that the posterior predictive plots didn't look great for individuals.
I've now done this with estimation of individual-level parameters, and it looks much better (not perfect yet, because I didn't run the chain very long ...).
in case you are interested, here are the plots: https://www.dropbox.com/s/3fk2ry6pty5zd7w/PPP.png

cheers-guido

Thomas Wiecki

unread,
Jul 19, 2013, 1:25:00 PM7/19/13
to hddm-...@googlegroups.com
Hi Guido,

Some of those look quite poor indeed. Do you run with p_outlier? I suppose that's not supported with the regression model... But you could run a regular model with p_outlier and remove those that are identified.

It's kinda hard to see so I'm not sure if it's really an outlier problem.

Thomas

guido biele

unread,
Jul 19, 2013, 1:51:27 PM7/19/13
to hddm-...@googlegroups.com
hi Thomas,
I think the problem really is more about convergence.
I previously estimated the model with depends_on, and the posterior predictive plot look much better there (but I ultimately have to use the regression model to accurately estimate the variance of contrasts for our within design).
it seems that I need 5000 or more samples for burn in with the regression model, which takes ca. 2 days. I haven't done this yet with a model that saves the info I need for the posterior predictive check.
more generally, I have difficulties fitting my complex model.
I'll start a different thread for that when I have exauhsted the possibilities I know.
cheers-guido
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Reply all
Reply to author
Forward
0 new messages