Using HDDMRegressor to fit between-person effects

2,058 views
Skip to first unread message

Nate Hall

unread,
Jan 11, 2017, 1:52:05 PM1/11/17
to hddm-users
Hi Thomas et. al.,

I'm a new user of your package and would like to use the HDDMRegressor function to fit a between-person effect. Essentially, what my advisor and I would like to do is consider whether DDM model parameters vary as a function of within-subject stimulus condition (congruent vs incongruent on standard flanker and go/no-go tasks) and a continuous self-report measure of personality (which is between-subject). We found some discussion of related questions on the group discussion, but are uncertain whether we’ve identified the correct solution. A number of the HDDMRegressor examples use a continuous measure of interest that is within-person (e.g., from your group, frontal theta on a trial-by-trial basis). In our case, however, it makes sense to treat our continuous variable of interest (e.g., level of disinhibition) at the between-person level. In order to implement this in your package, we setup the following syntax:

model_reg = hddm.HDDMRegressor(data, "a ~ DISINH", depends_on={'a': 'stim'}, group_only_nodes=['a_DISINH'])

Does this seem correct to you if we want to see how disinhibition (self-report) influences the threshold parameter? 
After sampling, a plot of the DISINH posterior indicates a negative parametric relationship between disinhibition and threshold, which is what we had predicted (lower threshold in more impulsive individuals). Are we on the right track here or does our syntax need to be reworked? Related: is the intercept of the regression implicitly a within-person effect (akin to a random intercept per subject in standard multilevel models)?
 
Thanks, 
Nate

 
Screen Shot 2017-01-11 at 1.47.48 PM.png

Sam Mathias

unread,
Jan 11, 2017, 2:42:04 PM1/11/17
to hddm-users
Hi Nate,

I'm puzzled how you generated that posterior plot using your code, which should fail because `depends_on` cannot be used within HDDMRegressor.

--
Samuel R. Mathias, Ph.D.
Associate Research Scientist (ARS)
Neurocognition, Neurocomputation and Neurogenetics (n3) Division
Yale University School of Medicine
40 Temple Street, Room 694
New Haven CT 06511
http://www.srmathias.com


 

--
You received this message because you are subscribed to the Google Groups "hddm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hddm-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nate Hall

unread,
Jan 11, 2017, 6:20:41 PM1/11/17
to hddm-users
Hi Sam,
Thanks for your comment, I'm not sure if you mean `group_only_nodes` can't be used in HDDMRegressor? The `depends_on` call seems to work fine in the syntax from the 2013 HDDM paper and the online demo materials. Maybe I'm missing something but I thought I'd be safe using depends_on but am a bit more hesitant about if we are successfully treating DISINH as a between subject effect by using group_only_nodes within HDDMRegressor (which ran without an error). The subsequent syntax that yielded the figure I shared is:

disinh = model_reg.nodes_db.node[['a_DISINH']]
hddm.analyze.plot_posterior_nodes (disinh)

This also seemed to run without a problem. I'm not sure, however, if I'm off base with this.

Thanks,
Nate
To unsubscribe from this group and stop receiving emails from it, send an email to hddm-users+...@googlegroups.com.

Michael J Frank

unread,
Jan 11, 2017, 6:48:37 PM1/11/17
to hddm-...@googlegroups.com
Hi, "depends_on" is technically ok within HDDMRegressor but is less ideal because it doesn't capitalize on the power you get for a within-subject design. i.e., with depends_on,  a separate posterior will be estimated for each stim condition, with each of these drawn from the group distribution, but overall between subject variability would then make it harder to detect reliable within-subject differences. see http://ski.clps.brown.edu/hddm_docs/tutorial_python.html#within-subject-effects


Michael

Michael J Frank, PhD, Professor
Laboratory for Neural Computation and Cognition
Brown University
http://ski.clps.brown.edu
(401)-863-6872

To unsubscribe from this group and stop receiving emails from it, send an email to hddm-users+unsubscribe@googlegroups.com.

Nate Hall

unread,
Jan 11, 2017, 10:12:11 PM1/11/17
to hddm-users, Michae...@brown.edu
Hi Michael, 

Thanks very much for clarifying this difference in the use of depends_on (separate posteriors per condition) versus a dummy coding scheme to estimate a single intercept per person (with greater accuracy) with offsets for conditions. In terms of the inclusion of a between-subjects regressor, would the best way to combine this into a single specification be: 

model_reg = hddm.HDDMRegressor(data, "a ~ C(stim, Treatment(‘Congruent’)) + DISINH", group_only_nodes=[‘a_DISINH’])

And would this specification correctly model that DISINH only influences threshold at the between-person level?

Thanks,
Nate

Michael J Frank

unread,
Jan 14, 2017, 8:42:26 AM1/14/17
to hddm-...@googlegroups.com
Hi Nate,  group_only_nodes indicates that the model estimates a single posterior for an effect across the whole group, but usually this is used for a factor that still varies within subject (e.g., EEG activity that varies trial to trial, and then group_only_nodes means that the regression coefficient for that factor is estimated as a fixed effect).  The problem is that the model would then still estimate an intercept in the regression per subject and that would prevent you from exploring the effect of the between subjects factor (DISINHIB). 

It is certainly desirable to have a between-subjects factor that could be estimated to parametrically affect the group distribution. You can try 

a ~ 0 + C(stim, Treatment(‘Congruent’)) + DISINH

the 0 forces the intercept to be 0 and then the only factors driving changes in threshold would be the stimulus and the between subjects factors, and then you would set (probably both of these) to be group_only_nodes.  

My suggestion is to first generate data from the DDM where you have a between subjects factor that  linearly alters the parameter (ie., generate data from 100 simulated subjects where you sample the thresholds from a gaussian and where the mean of the gaussian varies with another factor). Then fit that generated data with the same code below and check if it recovers the between subjects factor properly. I would start with the simple case of just a between subjects factor and only a single stimulus condition per subject, and then add the within subjects factor and see if both are recovered properly. Would be great to know how this works out!  (to my knowledge there has been little attempt to model continuous between subjects factors in HDDM)

Michael 

To unsubscribe from this group and stop receiving emails from it, send an email to hddm-users+unsubscribe@googlegroups.com.

Nate Hall

unread,
Jan 17, 2017, 11:23:01 AM1/17/17
to hddm-users, Michae...@brown.edu
Hi Michael,
Per your advice we simulated data from 100 random subjects in the code below with a between subjects scaling factor (beta_a) with a simple linear effect on threshold. I wanted to see if you think we are on the right track with this:

###########################################################
import hddm
import pandas as pd
#####simulate model data for proof of concept 
 #test param 'a' in 10 bins ranging 0-10 by 1, with 100 subjs/bin (thus, 1100 total subjs) and 150 trials/subj/bin...output will be a dataframe with 165000 rows
beta_a = 0.1        #between subjects scaling factor
a_int = 0.5
x_range = range(11)        
trials_per_level=150
n_conditions = 1        
subjs_per_bin = 100
data_group = pd.DataFrame()     #empty df to append to 
for x in x_range:
  a = a_int + beta_a*(x - 5)
  parvec = {'v':.3, 'a':a, 't':.3, 'sv':0, 'z':.5, 'sz':0, 'st':0}
  
  data_a, params_a = hddm.generate.gen_rand_data({'level1': parvec}, size=trials_per_level, subjs=subjs_per_bin)
  #concatenate data_a with group data
  data_a_df = pd.DataFrame(data=data_a)
  data_group = data_group.append(data_a, ignore_index=True)

path_d = '/Users/nth7/Box Sync/EF BPD HDDM/'
data_group.to_csv(path_d + 'data_group')
###########################################################3

If this makes sense to you, our plan is to fit the HDDM to these data using the a ~ 0 + X syntax to see whether we recover an estimate of 0.1 for the between-subjects a effect. Then we were considering adding in a within-person effect and examining whether both are recovered.

Thanks,
Nate

Michael J Frank

unread,
Jan 17, 2017, 2:56:18 PM1/17/17
to hddm-...@googlegroups.com
Looks good to me (though you might also try recovering the regression coefficient using the same sample size, number of trials, etc that you are using empirically). 

I'm still not sure on the added within subject factor because the intercept on that factor (e.g. the congruent condition) could be enough to capture the individual difference effect, so not sure how the 0+ will work with both the within and between component, but as before I agree it is best to first try the between factor alone anyway.
 

To unsubscribe from this group and stop receiving emails from it, send an email to hddm-users+unsubscribe@googlegroups.com.

Nate Hall

unread,
Jan 18, 2017, 10:43:22 AM1/18/17
to hddm-users, Michae...@brown.edu
We seem to be running into an issue when we sample:

ZeroProbability: Stochastic wfpt.0's value is outside its support, or it forbids its parents' current values.
 
I'm not sure if anyone has run into this issue? I did a bit of digging but haven't been able to come up with much. I've included the code here:

###########################################################
beta_a = 0.1        #between subjects scaling factor
a_int = 0.5
x_range = range(11)        
trials_per_level=150
n_conditions = 1        
subjs_per_bin = 100

data_group = pd.DataFrame()     #empty df to append to 

for x in x_range:
  a = a_int + beta_a*(x - 5)
  parvec = {'v':.3, 'a':a, 't':.3, 'sv':0, 'z':.5, 'sz':0, 'st':0}
  
  data_a, params_a = hddm.generate.gen_rand_data({'level1': parvec}, size=trials_per_level, subjs=subjs_per_bin)
  #concatenate data_a with group data
  data_a_df = pd.DataFrame(data=data_a)
  data_group = data_group.append(data_a, ignore_index=True)

path_d = '/Users/nth7/Box Sync/EF BPD HDDM/'
data_group.to_csv(path_d + 'data_group')    
    
a_reg = {'model': 'a ~ 0 + X', 'link_func': lambda x: x}
v_reg = {'model': 'v ~ 1', 'link_func': lambda x: x}
reg_comb = [a_reg, v_reg]
m_reg = hddm.HDDMRegressor(data_group, reg_comb, group_only_nodes = ['a_x'])

m_reg.find_starting_values()
m_reg.sample(5000, burn = 200)
###########################################################

Thanks,
Nate

Michael J Frank

unread,
Jan 18, 2017, 11:52:00 AM1/18/17
to hddm-...@googlegroups.com
This error sometimes comes up when there are outliers in the data, which would be odd if you generate the data from the DDM.  Can you verify that the stats of generated RTs  make sense (e.g. that no RT should be < t)?
 


To unsubscribe from this group and stop receiving emails from it, send an email to hddm-users+unsubscribe@googlegroups.com.

Michael J Frank

unread,
Jan 26, 2017, 2:06:59 PM1/26/17
to hddm-...@googlegroups.com
 Nate, and other users interested in individual differences: 

We've been interested to add capability for estimating regression coefficients for between subject effects, e.g. to see if we can estimate within HDDM the impact of a continuous measure x, where x can be age, IQ, etc on DDM params.  The upshot is that the answer is yes, at least in theory:  I ran some simulations and could accurately recover the coefficients. including when there are additional within-subject condition effects, which are also accurately recovered in the same model.  (I reiterate a point from a previous post though, based on this article, that it is actually fine to fit the model without x as a regressor at all, and then to correlate the individual subject intercepts with the between subjects factor after the fact  -- but that involves a classical correlation test whereas the regression approach here estimates the effect and its uncertainty within the bayesian parameter estimation itself). 

Here are some issues to consider when doing this (followed by code, modified from Nate's earlier post) - we will also add to the hddm docs.

1. a central issue that came up (and I believe was one factor driving the error Nate was getting) is that when estimating parameters, they should still be informed by, and in the range of, the priors. The intercept parameters for the main DDM parameters are informed by empirical priors (see the supplement of the 2013 paper), and so my earlier recommendation to try a ~ 0 +X was misguided because it would enforce the intercept to be 0 which has very low prior density for a  (this is not a problem for v ~ 0 +X, given that 0 is in the range of priors for drift, but it turns out there is no need to enforce  intercepts to be 0 after all -- which is good because one would usually like to still estimate indiv subject parameters that reflect variability beyond that explained by the single x factor).   

2. Relatedly, a key recommendation is to first z-score (or at least mean-center) the between subjects factor x and to then estimate e.g., a ~ 1 +X. This way every subject still has an intercept drawn from an overall group distribution, with priors of that distribution centered at empirical priors, but then we can also estimate the degree to which this intercept is modified up or down by the (z-scored / mean-centered) between-subjects factor.  

I ran generate and recover simulations, code below, and confirmed that the beta coefficient on the x factor is recovered quite well at least for the various cases I tried. (In the generative data, the intercepts should also be within the range of priors for this to work reliably (e.g. a_int > 0), which makes sense).   I also confirmed that this works for both threshold and drift (and I image t should be fine too), and when allowing both to vary freely during fitting it correctly estimated the magnitude of the specific factor that varied in generative data, and pinned the other one at ~zero). 


* Note one limitation in the implementation is that while we allow x to affect the mean of the group distribution on the intercept, we assume the variance of that distribution is constant across levels of x - ie. there is no way currently to allow a_std to vary with x. But that's not such a terrible limitation (it just means that a_std for the overall group parameters will need to be large enough to accommodate the most variable group - but it should still go *down* relative to not allowing x to influence the mean at all).  

anyway here is example code

## Generate data

beta_a = 0.1      #between subjects scaling factor - adjust this and should be able to recover its true value
a_int = 1 # set intercept within range of empirical priors 
v_int = 0.3
x_range = range(11)    # set values of between subject measures, here from 0 to 10
trials_per_level=100    
subjs_per_bin = 10

data_group = pd.DataFrame()     #empty df to append to  

for x in x_range:

  xx=(x-mean(x_range))/std(x_range) # z-score the x factor
  a = a_int+ beta_a*xx #  indiv subj param values that are centered on intercept but deviate from it up or down by z-scored x
 # v = v_int+ beta_a*xx  # can also do for drift, here using same beta coeff
   
# parvec = {'v':.3, 'a':a, 't':.3, 'sv':0, 'z':.5, 'sz':0, 'st':0}
# parvec2 = {'v':.3, 'a':a+.5, 't':.3, 'sv':0, 'z':.5, 'sz':0, 'st':0}
  parvec = {'v':.3, 'a':a, 't':.3}   # set a to value set by regression, here v is set to constant

# note that for subjs_per_bin > 1, these are just the mean values of the parameters; indiv subjs within bin are sampled from distributions with the given means, but can still differ within bin around those means. not including sv, sz, st in the statement ensures those are actually 0.
 
 data_a, params_a = hddm.generate.gen_rand_data({'level1': parvec}, size=trials_per_level, subjs=subjs_per_bin)
 
  # can also do with two levels of within-subj conditions
  #data_a, params_a = hddm.generate.gen_rand_data({'level1': parvec,'level2': parvec2}, size=trials_per_level, subjs=subjs_per_bin)

 data_a['x'] = Series(xx*np.ones((len(data_a))), index=data_a.index) # store the (z-scored) between subjects factor in the data frame, same value for every row for each subject in the bin

 data_a['subj_idx'] = Series(x*subjs_per_bin+data_a['subj_idx']*np.ones((len(data_a))), index=data_a.index) # store the correct subj_idx when iterating over multiple subjs per bin
  
  #concatenate data_a with group data
  data_a_df = pd.DataFrame(data=[data_a])
  data_group = data_group.append([data_a], ignore_index=True)

## Recover

a_reg = {'model': 'a ~ 1+x', 'link_func': lambda x: x}
#a_reg_within = {'model': 'a ~ 1+x + C(condition)', 'link_func': lambda x: x}
# for including and estimating within subject effects of  condition

v_reg = {'model': 'v ~ 1+x', 'link_func': lambda x: x}
reg_comb = [a_reg, v_reg]
#m_reg = hddm.HDDMRegressor(data_group, reg_comb, group_only_regressors=['true']) 

m_reg = hddm.HDDMRegressor(data_group, a_reg, group_only_regressors=['true'])
m_reg.find_starting_values()
m_reg.sample(1500, burn = 200)

m_reg.print_stats() # check values of reg coefficients against the generated ones


Note also that in addition to checking the regression coefficient, if the recovery works, then all of the variance due to X should be soaked up by the regression coefficient, and there should be no residual correlation between the individual estimated subject intercepts and x. 


## check for residual correlation with x 
p =[]
pp=[] 
from scipy import stats
for i in range(0,(1+max(x_range))*subjs_per_bin -1):
    a='a_Intercept_subj.'
    a+=str(i)
    a+='.0'
    xx=i//subjs_per_bin
    p1=m_reg.nodes_db.node[a] 
    p.append(p1.trace().mean()) 
    pp.append(xx)
     
scatter(pp,p)
stats.pearsonr(pp,p) # should be zero correlation if entirely accounted for by x regressor - this correlation should instead be positive if x is removed from the model fit


--

Nate, here are a couple of other specific issues in the code from the earlier post that I had missed initially:

in your code for generating data: you loop over levels of x and for each level you have multiple subjects per bin, but then the data frame re-uses the subject indices for each level of x (e.g., you have subject 1 thru 100 for x=0 and then subject 1 thru 100 again for each of levels x=1 to 11 - so when the model fits the data, but then the data frame labels multiple subjects with "1"   even though they were generated with different params. 

This was fixed in the generate code above with the line editing 'subj_idx' in data_a
   

Michael

Michael Hallquist

unread,
Feb 10, 2017, 11:00:22 AM2/10/17
to hddm-users, Michae...@brown.edu
Hi Michael,

Thanks very much for extending Nate's code and testing the parameter recovery for between-subjects effects! We're excited to see how well the B/w subs effects are recovered in the simulations. (I was getting correlations approaching 1.0 under several conditions!) I'm Nate's advisor on the project and had a chance to go over your code and his. Mean centering the between-subjects regressor makes very good sense to me in order to allow for individual differences in the parameter intercept that are centered around empirical priors. Thanks for your note about assumed homoscedasticity in this parameterization (where the variance of a is assumed constant across values of x) -- I agree that is fine in most cases, including the ones we plan to test.

Anyhow, I don't have any major conceptual innovations to add on at the moment, but I did amend the script somewhat to save various model outputs and to use the simulated data when comparing the recovered parameters to the population. I also added a scatter plot of simulated versus recovered a values for the simple test case (and it's a very pretty straight line). Attached is the modified script in case it is helpful to other users to as you update the documentation.

Best wishes,
Michael
hddm_bwsubs_simulation.py

Michael J Frank

unread,
Feb 10, 2017, 1:12:57 PM2/10/17
to hddm-...@googlegroups.com
Thanks Michael!!   we'll add to the hddm_docs.  

M

 

Thomas Wiecki

unread,
Feb 14, 2017, 3:46:53 AM2/14/17
to hddm-...@googlegroups.com
To add this to the docs it would preferably be a Jupyter Notebook with some text explaining things.

hcp...@gmail.com

unread,
Dec 3, 2020, 9:56:09 PM12/3/20
to hddm-users
Thank you all for this great thread, it's really informative to me!

But I have an issue when trying to extend the simulation to a more complicated situation: parameter recovery was not satisfying when simulating the interaction between the between-subjects regressor and model parameter. I am not sure whether I did the simulation correctly. Please see below for the details.

The example code showed how to conduct a parameter recovery for the main effect of between-subjects regressor, but in some situation, we may interest in the interaction between the model parameter(s) and the between-subjects regressor. For example, the regression coefficients between a between-subject regressor and a parameter (e.g., a) may vary cross different within-subject conditions. I tried to simulate this situation but found that the parameter recovery is not so good.

Here is the code I used:

```
beta_a1 = 0.4   # between subjects scaling factor - adjust this and should be able to recover its true value
beta_a2 = 0.1

a_int = 1      # set intercept within range of empirical priors
v_int = 0.3
x_range = range(11)   # set values of between subject measures, here from 0 to 10
trials_per_level = 100    
subjs_per_bin = 10

data_group = pd.DataFrame()  # empty df to append to  

for x in x_range:
    xx = (x - mean(x_range)) / std(x_range)  # z-score the x factor
    a1 = a_int + beta_a1 * xx  #  indiv subj param values that are centered on intercept but deviate from it up or down by z-scored x
    a2 = a_int + beta_a2 * xx
    # v = v_int+ beta_a*xx  # can also do for drift, here using same beta coeff
    
    parvec = {'v':.3, 'a':a1, 't':.3, 'sv':0, 'z':.5, 'sz':0, 'st':0}
    parvec2 = {'v':.3, 'a':a2, 't':.3, 'sv':0, 'z':.5, 'sz':0, 'st':0}

   
    # note that for subjs_per_bin > 1, these are just the mean values of the parameters; indiv subjs within bin are sampled from distributions with the given means, but can still differ within bin around those means.
   
    # can also do with two levels of within-subj conditions
    data_a, params_a = hddm.generate.gen_rand_data({'level1': parvec,'level2': parvec2}, size=trials_per_level, subjs=subjs_per_bin)

    data_a['z_x'] = Series(xx * np.ones((len(data_a))), index=data_a.index)  # store the (z-scored) between subjects factor in the data frame, same value for every row for each subject in the bin
    data_a['x'] = Series(x * np.ones((len(data_a))), index=data_a.index)  # store the (z-scored) between subjects factor in the data frame, same value for every row for each subject in the bin
   
    data_a['a_population'] = Series(a1 * np.ones((len(data_a))), index=data_a.index)  # store the (z-scored) between subjects factor in the data frame, same value for every row for each subject in the bin
    data_a.loc[data_a['condition'] == 'level2', 'a_population'] = a2
    data_a['subj_idx'] = Series(x * subjs_per_bin + data_a['subj_idx'] * np.ones((len(data_a))), index=data_a.index)  # store the correct subj_idx when iterating over multiple subjs per bin
     
    # concatenate data_a with group data

    data_a_df = pd.DataFrame(data=[data_a])
    data_group = data_group.append([data_a], ignore_index=True)

#write original simulated data to file
data_group.to_csv('data_group_int.csv')

# try main effect of within and interaction between within and between-subject
a_reg = {'model': 'a ~ 1 + z_x:condition + C(condition)', 'link_func': lambda x: x}

v_reg = {'model': 'v ~ 1 + z_x:condition + C(condition)', 'link_func': lambda x: x}
reg_comb = [a_reg, v_reg]

# Do not estimate individual subject parameters for all regressors.

m_reg = hddm.HDDMRegressor(data_group, a_reg, group_only_regressors=['true'])
m_reg.find_starting_values()
m_reg.sample(2000, burn=500, dbname='a_bwsubs_int.db', db='pickle')
m_reg.save('a_bwsubs_int')


m_reg.print_stats()  # check values of reg coefficients against the generated ones
```

I found that for level 1 (beta_a1 = 0.4), the parameter recovery is quit good, the correlation between a_population and a_pred was 0.962.  But for level 2 (beta_a2 = 0.1), the correlation between a_population and a_pred was much lower: 0.634.

I have thought of this issue for a while but have not clue how to proceed. I've uploaded the jupyter notebook here: https://github.com/hcp4715/SalientGoodSelf/blob/master/HDDM/hddm_bwsubs_interact_simulation.ipynb

Thanks in advance.




To unsubscribe from this group and stop receiving emails from it, send an email to hddm-users+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "hddm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hddm-users+...@googlegroups.com.

Michael J Frank

unread,
Dec 4, 2020, 7:24:07 AM12/4/20
to hddm-...@googlegroups.com
Hi Chuan-Peng, 

it looks to me like this might simply be a restriction of range issue. The recovered regression coefficient is fairly accurate (.8 and includes 0.1 in the tail of posterior density), which means you can recover the average influence of that between subjects factor on the parameter for this condition (and for the other condition).  But since for level 2 it is a small coefficient, the pearson correlation of recovered and simulated across the group for this case is smaller simply because there is always some error in recovery at the individual subject level, and so even if that error is the same across level 1 and level 2, it will have a relatively greater impact on the correlation coefficient for level 2.  To my eye the mean error is about the same in your plots. Have you tried redoing this but making level 2 have a larger coefficient?
 
Michael

Michael J Frank, PhD | Edgar L. Marston Professor
Brown University
website 



hcp...@gmail.com

unread,
Dec 4, 2020, 9:49:25 PM12/4/20
to hddm-users
Hi, Michael,

Thanks for your reply!
Yes, you're right, when increase the beta for level 2 from 0.1 to 0.2, the correlation between a_pred and a_population increased to 0.89!
But I have a further question: can we change some parameters of the simulation so that we can reduce the error in parameter recovery? That is, to increase the "power", in a frequentist's term, of the model we build?

Best,
Chuan-Peng

Michael J Frank

unread,
Dec 4, 2020, 9:56:51 PM12/4/20
to hddm-...@googlegroups.com
yes, the error in any given subject;s estimate will go down with both the number of trials for that subject and the number of other subjects (assuming they come from the same distribution). See the original Wiecki et al 2013 HDDM paper for quantification

M
 


Pushkarskaya, Helen

unread,
Dec 5, 2020, 1:29:20 PM12/5/20
to hddm-...@googlegroups.com

Thank you.

Was any study done specifically on the trade off # of participants x # of trials?

 

Sent from Mail for Windows 10

Michael J Frank

unread,
Dec 5, 2020, 1:35:49 PM12/5/20
to hddm-...@googlegroups.com
The Wiecki study showed effects of both subjects and trials but exactly how they trade off in any one study will depend on for example, the variance across subject parameters. In the extreme if all participants are the same you could have one trial per subject and have 1000 subjects. In the other extreme, if participants are completely unrelated to each other, adding subjects doesn't help reduce error due to less trials.

Pushkarskaya, Helen

unread,
Dec 5, 2020, 1:40:22 PM12/5/20
to hddm-...@googlegroups.com

So, basically – the key is to specify the groups/covariates well, right? Then large samples are likely to be helpful.

Michael J Frank

unread,
Dec 5, 2020, 1:43:47 PM12/5/20
to hddm-...@googlegroups.com
 What I mean is large samples are helpful to the extent that participants in those samples are similar to each other to some extent, at least in some of the parameters. But exactly how many subjects you need to compensate for fewer trials will depend on the similarity across the group. I suppose one could collect a sub-sample first to estimate the group variances and covariances and use that to inform a power analysis on how many subjects you need. But ideally you still collect enough trials for individuals at least to observe sufficient responses on both boundaries and with within subject manipulations.

Pushkarskaya, Helen

unread,
Dec 5, 2020, 1:55:31 PM12/5/20
to hddm-...@googlegroups.com

Yes, thank you. I understand that.

We are in the process of finalizing design for a large scale data collection and aiming at ~ 700-800 people. My goal is to also reduce the subjects’ fatigue. We started with 80 trials per each of 4 blocks (slightly different conditions in each of them) in an initial sample of 125 people from the Census matching demographics (aiming to get a heterogeneous sample).

I run several models to derive estimates from 50 trials per block, 60, 70, and full 80 right now – this probably can give some sense of how robust the estimates are. But I would feel more comfortable also using something more theoretically supported.

 

Do I use a usual power analysis to estimate effect sizes in group differences? What is a better approach with the Bayesian estimation?

 

Thank you.

Michael J Frank

unread,
Dec 5, 2020, 3:24:34 PM12/5/20
to hddm-...@googlegroups.com
You can basically use something similar to the posterior predictive check, but instead use the prior predictive.  ie where your hypotheses are formulated as a prior on the model parameters, and you sample data from that prior (including uncertainty about the prior). you could also do that after collecting a sub-sample and simulating from the posterior of that sub-sample (retrospective power*). Then you would fit the model to those simulated PPC data and see  how likely you are to replicate the effect (ie how many times the posterior parameters fall within the same range). I think Kruschke has a section in his bayesian data analysis book on this.

* I don't advocate ever doing this after collecting data for the purpose of a power analysis on replicating the effect to put in a paper. That is post-hoc power and is redundant with the effect itself. But it should be fine for making predictions about new experimental data.
 

Pushkarskaya, Helen

unread,
Dec 5, 2020, 3:27:11 PM12/5/20
to hddm-...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages