Bug fix for posterior predictive checks using regression

1,023 views
Skip to first unread message

jae-yo...@brown.edu

unread,
Aug 21, 2017, 7:27:18 PM8/21/17
to hddm-users

As discussed in another thread on this forum, I was having trouble attaching trial-by-trial measures to my simulated PPC data using hddm.utils.post_pred_gen(m, append_data=True). Below is representative output:


rt_sampledindexsubj_idxrtresponserevNum
nodesample
wfpt.10011031.533658NaNNaNNaNNaNNaN
11040.908446NaNNaNNaNNaNNaN
1105-1.542960NaNNaNNaNNaNNaN
11061.115957NaNNaNNaNNaNNaN
11070.708157NaNNaNNaNNaNNaN
11082.129046NaNNaNNaNNaNNaN
11091.220014NaNNaNNaNNaNNaN
11100.765557NaNNaNNaNNaNNaN
11110.674214NaNNaNNaNNaNNaN
11123.590046NaNNaNNaNNaNNaN



This caused post_pred_stats to fail spectacularly. Below is representative output:

           accuracy   mean_ub    std_ub    10q_ub    30q_ub    50q_ub  \
wfpt.10 0  0.000000       NaN       NaN       NaN       NaN       NaN   
        1  0.000000       NaN       NaN       NaN       NaN       NaN   
wfpt.11 0  0.000000       NaN       NaN       NaN       NaN       NaN   
        1  0.000000       NaN       NaN       NaN       NaN       NaN   
wfpt.12 0  0.000000       NaN       NaN       NaN       NaN       NaN   
        1  0.000000       NaN       NaN       NaN       NaN       NaN   
wfpt.13 0  0.000000       NaN       NaN       NaN       NaN       NaN   
        1  0.000000       NaN       NaN       NaN       NaN       NaN



To remedy this problem, I wrote some code that re-attaches the original dataframe's columns to the PPC dataframe.
I should disclaim that I am a Python novice, and I don't pretend to have an especially good handle on working with Pandas.
The following code is horribly inefficient (i.e. for N samples, operating time increases by a multiplier of N)... but it works. If anyone who's better with Python wants to revise my code, that'd be great.

Hope people find this helpful!

# Import requisite packages
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import hddm
import pickle
from patsy import dmatrix

# Get around a problem with saving regression outputs in Python 3
def savePatch(self, fname):
    import pickle
    with open(fname, 'wb') as f:
        pickle.dump(self, f)
hddm.HDDM.savePatch = savePatch


# Load data from csv file into a NumPy structured array
data = hddm.load_csv('../../data.csv')

# Idiot check: make sure you're working with the right dataset
data.head(10)


# Build and run regression model
mA = hddm.HDDMRegressor(data,
                        {"v ~ C(revNum, Treatment('Alone'))",
                         "a ~ C(revNum, Treatment('Alone'))",
                         "z ~ C(revNum, Treatment('Alone'))"},
                        group_only_regressors=True,
                        p_outlier=.05,
                        include={'z'})
mA.find_starting_values()
mA.sample(100, burn=20, dbname='mA.db', db='pickle')
mA.savePatch('mA')


# Generate PPC simulated data
ppc_data = hddm.utils.post_pred_gen(mA, samples=2, append_data=True)


# Take a peek at the immediate output
# Some/all of the appended columns will only have values of NaN
ppc_data.head(10)


# Set these as the column numbers in your ppc_data
subj_idx_columnNum = 2  #i.e. subj_idx is third column (Python starts numbering at 0)
rt_columnNum = 3
response_columnNum = 4
revNum_columnNum = 5  # A trial-by-trial measure

# Not an efficient solution! ...but it works
# Depending on your simulation & dataset sizes, this will take a while...

for j in range(0, len(ppc_data)):
    
    # Get current row number of ppc_data
    rowNum = ppc_data.iloc[j].name[2]
    
    # Copies index numbers
    ppc_data.iloc[j, 1] = rowNum
    
    # Copies other columns
    ppc_data.iloc[j, subj_idx_columnNum] = data.iloc[rowNum].subj_idx
    ppc_data.iloc[j, rt_columnNum] = data.iloc[rowNum].rt
    ppc_data.iloc[j, response_columnNum] = data.iloc[rowNum].response
    ppc_data.iloc[j, revNum_columnNum] = data.iloc[rowNum].revNum


# Take another peek at your PPC data
# Your appended columns should all be filled in now!
ppc_data.head(10)


# Download entire simulated dataset
# Ensure that all of your appended columns match your original data!
ppc_data.to_csv('PPC_sim_mA.csv')


# Take a look at each of your PPC nodes before looking at the aggregate stats
# This ensures that the appropriate stats were computed for each node/subject
ppc_compare = hddm.utils.post_pred_stats(data, ppc_data, call_compare=False)
print(ppc_compare)


# Once you're positive that everything's worked the way it was supposed to...
ppc_compare = hddm.utils.post_pred_stats(data, ppc_data)
print(ppc_compare)


jae-yo...@brown.edu

unread,
Aug 30, 2017, 1:34:48 PM8/30/17
to hddm-users
Apologies for the double-tap on this thread. I was finding my previous solution unbearably slow, so I dove into the pandas documentation and wrote a proper "patch" based on the original Kabuki code. I run the following code in a cell towards the top of my Jupyter script and call post_pred_gen directly.

I am now working on writing a function that can return more granular statistics for simple categorical regressions, and will post a solution once it's completed...


import pymc as pm
import numpy as np
import pymc.progressbar as pbar

def _parents_to_random_posterior_sample(bottom_node, pos=None):
    """Walks through parents and sets them to pos sample."""
    for i, parent in enumerate(bottom_node.extended_parents):
        if not isinstance(parent, pm.Node): # Skip non-stochastic nodes
            continue

        if pos is None:
            # Set to random posterior position
            pos = np.random.randint(0, len(parent.trace()))

        assert len(parent.trace()) >= pos, "pos larger than posterior sample size"
        parent.value = parent.trace()[pos]

def _post_pred_generate(bottom_node, samples=500, data=None, append_data=True):
    """Generate posterior predictive data from a single observed node."""
    datasets = []
    ##############################
    # Sample and generate stats
    for sample in range(samples):
        _parents_to_random_posterior_sample(bottom_node)
        # Generate data from bottom node
        sampled_data = bottom_node.random()
        if append_data and data is not None:
            
sampled_data.reset_index(inplace=True)  # Only modification of original Kabuki code
            sampled_data = sampled_data.join(data.reset_index(), lsuffix='_sampled')
        datasets.append(sampled_data)
    return datasets

def post_pred_gen(model, groupby=None, samples=500, append_data=False, progress_bar=True):
    results = {}

    # Progress bar
    if progress_bar:
        n_iter = len(model.get_observeds())
        bar = pbar.progress_bar(n_iter)
        bar_iter = 0
    else:
        print("Sampling...")

    if groupby is None:
        iter_data = ((name, model.data.ix[obs['node'].value.index]) for name, obs in model.iter_observeds())
    else:
        iter_data = model.data.groupby(groupby)

    for name, data in iter_data:
        node = model.get_data_nodes(data.index)

        if progress_bar:
            bar_iter += 1
            bar.update(bar_iter)

        if node is None or not hasattr(node, 'random'):
            continue # Skip

        ##############################
        # Sample and generate stats
        datasets = _post_pred_generate(node, samples=samples, data=data, append_data=append_data)
        results[name] = pd.concat(datasets, names=['sample'], keys=list(range(len(datasets))))

    if progress_bar:
        bar_iter += 1
        bar.update(bar_iter)

    return pd.concat(results, names=['node'])

Thomas Wiecki

unread,
Aug 30, 2017, 2:56:59 PM8/30/17
to hddm-...@googlegroups.com
Very cool, thanks for the update! Would certainly be useful in kabuki.

--
You received this message because you are subscribed to the Google Groups "hddm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hddm-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dan Bang

unread,
Nov 28, 2017, 7:03:46 AM11/28/17
to hddm-users
Sorry for piggybacking  on this thread. I run the patch listed below before fitting the regression model etc., but when I run ppc_data.head(2000), appended columns still have NaNs. Am I meant to also run the more laborious patch that you listed in your earlier email?

jae-yo...@brown.edu

unread,
Nov 29, 2017, 11:00:27 AM11/29/17
to hddm-users
Hi Dan,

You should disregard the patch I wrote in my first post. Frankly, it's unusably slow.

Just to verify, when you're calling the function, are you using syntax that looks like this?
ppc_data = post_pred_gen(m, samples=1000, append_data=True)

If you're calling hddm.utils.post_pred_gen instead, it calls the original function instead of the patched function.

If that isn't the problem, then I'm not sure off the top of my head what else could be the issue.

Dan Bang

unread,
Nov 30, 2017, 10:45:10 AM11/30/17
to hddm-users
Thanks for quick reply!

Well-spotted: I did indeed call hddm.utils.post_pred_gen

I'll post here if discovering any problems after trying again

Ian Ballard

unread,
May 6, 2020, 8:39:29 PM5/6/20
to hddm-users
Hi there,

I know it's been a while from this post but I am trying to run posterior predictive checks from an HDDMRegression model and am running into this same difficulty. I used your patch and was able to get posterior predictive data from each subject (yes!). However, I'm hoping to get posterior predictive data by condition used in the HDDMRegression, so that I can make some simulated data plots. When I supply a groupby parameter, I get the following error:

(note that 'exp' was a condition factor included in the HDDMRegression model on drift rate)

res = post_pred_gen(m_within_subj, groupby = 'exp', append_data = True, samples = 2)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-58-842590cc16f1> in <module>()
      3 for name, data in iter_data:
      4     print(name,data)
----> 5     node = m_within_subj.get_data_nodes(data.index)

/opt/conda/lib/python3.6/site-packages/kabuki/hierarchical.py in get_data_nodes(self, idx)
    942 
    943         if len(data_nodes) != 1:
--> 944             raise NotImplementedError("Supply a grouping so that at most 1 observed node codes for each group.")
    945 
    946         return data_nodes[0]

NotImplementedError: Supply a grouping so that at most 1 observed node codes for each group.

I've tried a bit of debugging, but the issue appears to lie within get_data_nodes(), which is a little opaque to me.

Thanks for any advice you might have!

-Ian

Jae-Young Son

unread,
May 6, 2020, 9:16:33 PM5/6/20
to hddm-users
Hi Ian,

Hope you're doing well, given the circumstances.

I don't know off the top of my head what the issue might be (get_data_nodes is really opaque to me too), but here are some informed guesses and workarounds...

I'm guessing that exp (experiment ID?) is a between-subjects variable, i.e., the same subject didn't complete multiple experiments? And if so, are you trying to check whether your within-subject parameter estimates differ across experiments? If that's the case, I suspect the predictive check groupby argument might only work on models that are fit using hddm.HDDM, not hddm.HDDMRegressor. Mostly for the same reason that (at least, back when I was using it a few years ago) HDDMRegressor generally struggles with models that have both within- and between-subjects predictors. I talked to Mads about this when he was putting together the RL-DDM release, and it seems that a workaround might be to estimate a separate model for each experiment, a model that pools all experiments, and then to compare estimates/PPCs.

I think you'll get some loss of power from not having partial pooling in your estimates, and it's somewhat annoying to implement, but it's better than nothing. Hopefully, someone else can chime in with a better solution, but this is what I recall doing.

I think I put some the details in this manuscript (or the supplement? honestly, I don't remember), and you might also have some luck referencing my scripts in this OSF repository:

Ian Ballard

unread,
May 6, 2020, 9:46:06 PM5/6/20
to hddm-...@googlegroups.com
Hi Jae-Young Son,

Thanks for the quick response! Sorry I should have supplied more details: "exp" is actually a within-subjects manipulation-- each subject completed different variants of the same basic task in different blocks. But thank you for the information and I will check out the paper and the OSF code to see if I can track down a solution. For the record, I probably should have also mentioned that I am using the docker installation of HDDM 0.7.1.

Best,

Ian

--
You received this message because you are subscribed to the Google Groups "hddm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hddm-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hddm-users/48fc2469-33c6-4951-a8ec-36f1c4231de6%40googlegroups.com.

Jae-Young Son

unread,
May 6, 2020, 10:02:58 PM5/6/20
to hddm-users
Hi Ian,

Ah, I see. Was exp included in your regression? e.g., v ~ exp * covariance? If you just want a sense for whether exp has any effect, that'd be the easiest way to check.

If instead you're interested in testing whether a model performs better when you include exp as a predictor, I think you'll probably have to run two models and then manually compare their PPCs.

Coincidentally, I think you were my TA for a class I took in undergrad five years ago – The Nervous System, taught in the med school? What a small world.

All best,
Jae
To unsubscribe from this group and stop receiving emails from it, send an email to hddm-...@googlegroups.com.

Ian Ballard

unread,
May 7, 2020, 2:10:09 PM5/7/20
to hddm-...@googlegroups.com
Hi Jae-Young,

Ha! I remember you :). Glad to see you are doing so well.

For the record, I solved the problem and it was a ‘Doh!’ moment. All I needed to do was include ‘append=True’ and remove the groupby argument and the predicted RT dataframe includes a condition column. 

-Ian

To unsubscribe from this group and stop receiving emails from it, send an email to hddm-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hddm-users/45132fed-51a6-4407-a507-1e30a4f9af70%40googlegroups.com.

Blair Shevlin

unread,
Jun 4, 2020, 10:58:36 AM6/4/20
to hddm-users
Hi all,

For those of us working with the newest version of pandas, this patch (post_pred_gen) will need a patch. This is because ".ix" has been depreciated.

I've found I can get the patch to work by replacing the line

iter_data = ((name, model.data.ix[obs['node'].value.index]) for name, obs in model.iter_observeds())

with the following:

iter_data = ((name, model.data.iloc[obs['node'].value.index]) for name, obs in model.iter_observeds())

Sincerely,
Blair
-Ian

Thomas Wiecki

unread,
Jun 4, 2020, 12:10:56 PM6/4/20
to hddm-...@googlegroups.com
As of today this shouldn't be required if you update kabuki to the most recent version.

To unsubscribe from this group and stop receiving emails from it, send an email to hddm-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hddm-users/a2010c27-2146-4499-8c4b-7c8fcf31b6f3%40googlegroups.com.

valer...@gmail.com

unread,
Oct 8, 2021, 9:49:20 AM10/8/21
to hddm-users
Hi Thomas,
The hddm.utils.post_pred_gen() still outputs NaNs when using append_data=True. This happens even with the latest version of kabuki (0.6.3).
Anyhow, Blair's patch seems to work after the appropriate update, at least with pandas 1.0.4 and hddm 0.8.0.

Best wishes,
Valerio



Reply all
Reply to author
Forward
0 new messages