posterior predictive for multiple drift rates

adrianas...@gmail.com

unread,

May 3, 2021, 12:03:28 PM5/3/21

to hddm-users

Hi Everyone,

Does anyone know how to generate simulated data via post_prod_gen when there are multiple groupby values?

I've run the following model using normal HDDM:

m_av_acc = hddm.HDDM(av_cong_interaction, depends_on={'v' ['Vis_Coh','Aud_Coh']},include=('a','v','z'))

m_av_acc.sample(n_samples,burn=n_burn,thin=n_thin, dbname='traces.db',db='pickle')

When I try to simulate data from the model like so:

av_acc_ppc_data = hddm.utils.post_pred_gen(m_av_acc,groupby=['Vis_Coh','Aud_Coh'])

I get the following error:

[-------------------112%-------------------] 28 of 25 complete in 349.5 sec

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

<ipython-input-15-aeabbc6047c9> in <module>()

----> 1 av_acc_ppc_data = hddm.utils.post_pred_gen(m_av_acc,groupby=['Vis_Coh','Aud_Coh'])

/opt/conda/lib/python3.6/site-packages/kabuki/analyze.py in post_pred_gen(model, groupby, samples, append_data, progress_bar)

345 bar.update(bar_iter)

346

--> 347 return pd.concat(results, names=['node'])

348

349

/opt/conda/lib/python3.6/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)

223 keys=keys, levels=levels, names=names,

224 verify_integrity=verify_integrity,

--> 225 copy=copy, sort=sort)

226 return op.get_result()

227

/opt/conda/lib/python3.6/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)

376 self.copy = copy

377

--> 378 self.new_axes = self._get_new_axes()

379

380 def get_result(self):

/opt/conda/lib/python3.6/site-packages/pandas/core/reshape/concat.py in _get_new_axes(self)

456 new_axes[i] = ax

457

--> 458 new_axes[self.axis] = self._get_concat_axis()

459 return new_axes

460

/opt/conda/lib/python3.6/site-packages/pandas/core/reshape/concat.py in _get_concat_axis(self)

512 else:

513 concat_axis = _make_concat_multiindex(indexes, self.keys,

--> 514 self.levels, self.names)

515

516 self._maybe_check_integrity(concat_axis)

/opt/conda/lib/python3.6/site-packages/pandas/core/reshape/concat.py in _make_concat_multiindex(indexes, keys, levels, names)

593

594 return MultiIndex(levels=levels, labels=label_list, names=names,

--> 595 verify_integrity=False)

596

597 new_index = indexes[0]

/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/multi.py in __new__(cls, levels, labels, sortorder, names, dtype, copy, name, verify_integrity, _set_identity)

232 if names is not None:

233 # handles name validation

--> 234 result._set_names(names)

235

236 if sortorder is not None:

/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/multi.py in _set_names(self, names, level, validate)

670 raise ValueError('Length of names must match length of level.')

671 if validate and level is None and len(names) != self.nlevels:

--> 672 raise ValueError('Length of names must match number of levels in '

673 'MultiIndex.')

674

ValueError: Length of names must match number of levels in MultiIndex.

I've run other models that dont have drift rate depend on multiple conditions without issues with post_prod_gen, so I assume the issue is because I have multiple dependencies in the groupby function of post_prod_gen. Does anyone know how to generate simulated data via post_prod_gen when there are multiple groupby values?

Thanks,

Adriana Schoenhaut

hcp...@gmail.com

unread,

May 4, 2021, 11:27:10 PM5/4/21

to hddm-users

Hi, Adriana,

If I understand the tutorial correctly (see http://ski.clps.brown.edu/hddm_docs/tutorial_post_pred.html), you don't need `groupby` when using in the `post_pred_gen()` in your case, because:

"groupby : list Alternative grouping of the data. If not supplied, uses splitting of the model (as provided by depends_on)."

I guess the error was generated because in `post_pred_gen()` the input of `groupby` is not specified for drift rate, but in your model, only drift rate is grouped by two conditions.

best,

Chuan-Peng

Adriana Schoenhaut

unread,

May 5, 2021, 6:22:19 PM5/5/21

to hddm-...@googlegroups.com

Hi Chuan-Peng,

Thanks for your thoughtful reply. Unfortunately, I still have a problem when I don’t use groupby, which is why I tried using it.

When I don’t use it, I get the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in _get_list_axis(self, key, axis)
   2069         try:
-> 2070             return self.obj._take(key, axis=axis)
   2071         except IndexError:

/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in _take(self, indices, axis, is_copy)
   2788                                    axis=self._get_block_manager_axis(axis),
-> 2789                                    verify=True)
   2790         result = self._constructor(new_data).__finalize__(self)

/opt/conda/lib/python3.6/site-packages/pandas/core/internals.py in take(self, indexer, axis, verify, convert)
   4529         if convert:
-> 4530             indexer = maybe_convert_indices(indexer, n)
   4531 

/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in maybe_convert_indices(indices, n)
   2479     if mask.any():
-> 2480         raise IndexError("indices are out-of-bounds")
   2481     return indices

IndexError: indices are out-of-bounds

During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)
<ipython-input-15-7fe111934f22> in <module>()
----> 1 av_acc_ppc_data = hddm.utils.post_pred_gen(m_av_acc2)


/opt/conda/lib/python3.6/site-packages/kabuki/analyze.py in post_pred_gen(model, groupby, samples, append_data, progress_bar)

    326         iter_data = model.data.groupby(groupby)
    327 
--> 328     for name, data in iter_data:
    329         node = model.get_data_nodes(data.index)
    330 

/opt/conda/lib/python3.6/site-packages/kabuki/analyze.py in <genexpr>(.0)
    322 
    323     if groupby is None:
--> 324         iter_data = ((name, model.data.iloc[obs['node'].value.index]) for name, obs in model.iter_observeds())
    325     else:
    326         iter_data = model.data.groupby(groupby)

/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1476 
   1477             maybe_callable = com._apply_if_callable(key, self.obj)
-> 1478             return self._getitem_axis(maybe_callable, axis=axis)
   1479 
   1480     def _is_scalar_access(self, key):

/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   2089         # a list of integers
   2090         elif is_list_like_indexer(key):
-> 2091             return self._get_list_axis(key, axis=axis)
   2092 
   2093         # a single integer

/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in _get_list_axis(self, key, axis)
   2071         except IndexError:
   2072             # re-raise with different error message
-> 2073             raise IndexError("positional indexers are out-of-bounds")
   2074 
   2075     def _getitem_axis(self, key, axis=None):

IndexError: positional indexers are out-of-bounds

Any thoughts?

Thanks again,

Adriana

--
You received this message because you are subscribed to a topic in the Google Groups "hddm-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hddm-users/Is6AM7eN0fo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hddm-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hddm-users/1f83a101-f01a-4c92-810f-b2ac3f150cb0n%40googlegroups.com.

hcp...@gmail.com

unread,

May 5, 2021, 9:55:03 PM5/5/21

to hddm-users

Hi, Adriana,

I played around the function `utils.post_pred_gen()` a bit, and found that the error 'positional indexes are out-of-bounds' might be caused by `iloc` of pandas in line 325 of analyze.py of the kabuki package (https://github.com/hddm-devs/kabuki/blob/master/kabuki/analyze.py#L325). The difference between `loc` and `iloc` is explained here: https://stackoverflow.com/questions/44123056/indexerror-positional-indexers-are-out-of-bounds-when-theyre-demonstrably-no/44123390

I copy-paste the necessary functions in jupyter notebook and tried to reproduce your model, now there was no error (but it doesn't mean the results are correct, I haven't checked it yet), see below for the detail:

HDDM version: 0.8 in docker image (tag: hcp4715/hddm:viz, https://hub.docker.com/r/hcp4715/hddm)

Data: example data included in that docker image

Code in jupyter notebook:

exp_name = 'example'

# USE the absolute directory in docker.
fname = '/home/jovyan/example/df_' + exp_name + '.csv'
df = hddm.load_csv(fname)

df_subj = df['subj_idx'].unique()

# random select without repetition
random.seed(10)
df_test_list = []
for i in range(10):
pos = random.randint(0, (len(df_subj)-1))
df_test_list.append(df_subj[pos])

df_test = df[df['subj_idx'].isin(df_test_list)]

m = hddm.HDDM(df_test,
              depends_on={'v':['match','val', 'id']},
              include=('a','v','z'),
              p_outlier=0.05)
m.find_starting_values()
m.sample(2000, burn=1000, dbname='example', db='pickle')

####### re-define functions we need for convinience #######

# https://github.com/hddm-devs/kabuki/blob/master/kabuki/analyze.py#L420

def _parents_to_random_posterior_sample(bottom_node, pos=None):
    """Walks through parents and sets them to pos sample."""
    import pymc as pm
    import numpy as np
    for i, parent in enumerate(bottom_node.extended_parents):
        if not isinstance(parent, pm.Node): # Skip non-stochastic nodes
            continue

        if pos is None:
            # Set to random posterior position
            pos = np.random.randint(0, len(parent.trace()))

        assert len(parent.trace()) >= pos, "pos larger than posterior sample size"
        parent.value = parent.trace()[pos]

# https://github.com/hddm-devs/kabuki/blob/master/kabuki/analyze.py#L271
def _post_pred_generate(bottom_node, samples=500, data=None, append_data=False):
    """Generate posterior predictive data from a single observed node."""
    datasets = []

    ##############################
    # Sample and generate stats
    for sample in range(samples):
        _parents_to_random_posterior_sample(bottom_node)
        # Generate data from bottom node
        sampled_data = bottom_node.random()
        if append_data and data is not None:
            sampled_data = sampled_data.join(data.reset_index(), lsuffix='_sampled')
        datasets.append(sampled_data)

    return datasets

# https://github.com/hddm-devs/kabuki/blob/master/kabuki/analyze.py#L287

def post_pred_gen(model, groupby=None, samples=500, append_data=False, progress_bar=True):
    """Run posterior predictive check on a model.
    :Arguments:
        model : kabuki.Hierarchical
            Kabuki model over which to compute the ppc on.
    :Optional:
        samples : int
            How many samples to generate for each node.

        groupby : list
            Alternative grouping of the data. If not supplied, uses splitting
            of the model (as provided by depends_on).

        append_data : bool (default=False)
            Whether to append the observed data of each node to the replicatons.
        progress_bar : bool (default=True)
            Display progress bar
    :Returns:
        Hierarchical pandas.DataFrame with multiple sampled RT data sets.
        1st level: wfpt node
        2nd level: posterior predictive sample
        3rd level: original data index
    :See also:
        post_pred_stats
    """
    import pymc.progressbar as pbar
    results = {}

    # Progress bar
    if progress_bar:
        n_iter = len(model.get_observeds())
        bar = pbar.progress_bar(n_iter)
        bar_iter = 0
    else:
        print("Sampling...")

    if groupby is None:

#### here I changed `iloc` to `loc`

        iter_data = ((name, model.data.loc[obs['node'].value.index]) for name, obs in model.iter_observeds())
    else:
        iter_data = model.data.groupby(groupby)

for name, data in iter_data:

        node = model.get_data_nodes(data.index)

        if progress_bar:
            bar_iter += 1
            bar.update(bar_iter)

        if node is None or not hasattr(node, 'random'):
            continue # Skip

        ##############################
        # Sample and generate stats
        datasets = _post_pred_generate(node, samples=samples, data=data, append_data=append_data)
        results[name] = pd.concat(datasets, names=['sample'], keys=list(range(len(datasets))))

    if progress_bar:
        bar_iter += 1
        bar.update(bar_iter)

return pd.concat(results, names=['node'])

####### End of re-defining functions #######

# run the post_pred_gen defined above

m_ppc_data = post_pred_gen(m)

ppc_compare = hddm.utils.post_pred_stats(df_test, m_ppc_data)
print(ppc_compare)

Reply all

Reply to author

Forward