Help Specifying a Multilevel Model for Partially Clustered Data

Scott B

unread,

May 12, 2011, 1:29:03 AM5/12/11

to PyMC

Hi All,

New to this group and to PyMC (and mostly new to Python). In any case,
I'm writing to ask for help in specifying a multilevel model (mixed
model) for what I call partially clustered designs. An example of a
partially clustered design is a 2-condition randomized psychotherapy
study where subjects are randomly assigned to conditions. In condition
1, subjects are treated in small groups of say 8 people a piece. In
the condition 2, usually a control, subjects are only assessed on the
outcome (they don't receive an intervention). Thus you have clustering
in condition 1 but not condition 2.

The model for a 2-condition study looks like (just a single time point
to keep things simpler):

y_ij = b_0 + b_1*Tx + u_j*Tx + e_ij

where y_ij is the outcome for the ith person in cluster j (in most
multilevel modeling software and in PROC MCMC in SAS, subjects in the
unclustered condition are all in clusters of just 1 person), b_0 is
the overall intercept, b_1 is the treatment effect, Tx is a dummy
coded variable coded as 1 for the clustered condition and 0 for the
unclustered condition, u_j is the random effect for cluster j, and
e_ij is the residual error. The variance among clusters is \sigma^2_u
and the residual term is \sigma^2_e (ideally you would estimate a
unique residual by condition).

Because u_j interacts with Tx, the random effect is only contributes
to the clustered condition.

In my PyMC model, I expressed the model in matrix form - I find it
easier to deal with especially for dealing with the cluster effects.
Namely:

Y = XB + ZU + R

where X is an n x 2 design matrix for the overall intercept and
intervention effect, B is a 1 x 2 matrix of regression coefficients, Z
is an n x c design matrix for the cluster effects (where c is equal to
the number of clusters in the clustered condition), and U is a c x 1
matrix of cluster effects. The way I've written the model below, I
have R as an n x n diagonal matrix with \sigma^2_e on the diagonal and
0's on the off-diagonal.

All priors below are based on a model fit in PROC MCMC in SAS. I'm
trying to replicate the analyses in PyMC so I don't want to change the
priors.

The data for my code can be downloaded here: http://dl.dropbox.com/u/613463/run1.csv

The data consist of 200 total subjects. 100 in the clustered condition
and 100 in the unclustered. In the clustered condition there are 10
clusters of 10 people each. There is a single outcome variable.

The PyMC set-up file can be downloaded here (also pasted below):
http://dl.dropbox.com/u/613463/analysis1.py

The file for running the PyMC set-up file can be downloaded here:
http://dl.dropbox.com/u/613463/run_analysis1.py

I have 3 specific questions about the model:

1 - Given the description of the model, have I successfully specified
the model? The results are quite similar to PROC MCMC, though the
cluster variance (\sigma^2_u) differs more than I would expect due to
Monte Carlo error. The differences make me wonder if I haven't got it
quite right in PyMC.

2 - Is there a better (or more efficient) way to set up the model? The
model runs quickly but I am trying to learn Python and to get better
at optimizing how to set up models (especially multilevel models).

3 - How can change my specification so that I can estimate unique
residual variances for clustered and unclustered conditions? Right now
I've estimated just a single residual variance. But I typically want
separate estimates for the residual variances per intervention
condition.

Thanks so much for your help. My code follows my signature.

Best,
Scott

###Code###

import numpy as np
import pymc

pn_data = np.genfromtxt('run1.csv', delimiter=",", unpack=True,
skip_header=1)

pid = pn_data[0] #subject id
gid = pn_data[1].astype(int) #group (cluster) id
tx = pn_data[2].reshape(200,1) #treatment condition (1 = clustered, 0
= unclustered)
y = pn_data[3] #outcome variable

##Design matrix for intercept and intervention effect
inter = np.ones((len(pid),1), dtype=int)
X = np.hstack((inter,tx))

##Design matrix for cluster effect
Z = (gid[:,None] == np.unique(gid)).astype(int)
##Getting rid of the columns for the unclustered "clusters of 1"
Z = Z[0:240:1,0:10:1]

## Priors
var_u = pymc.Gamma('var_u', alpha=1, beta=1)
tau_u = pymc.Lambda('tau_u', lambda v=var_u: v**-1)

b0 = pymc.Normal('b0', mu=0, tau=10000**-1)
b1 = pymc.Normal('b1', mu=0, tau=10000**-1)

U = pymc.Normal('u', mu=0, tau=tau_u, value=np.zeros(10))

var_e1 = pymc.Uniform('var_e1', lower=0, upper=100)
tau_e1 = pymc.Lambda('tau_e1', lambda v=var_e1: v**-1)

@pymc.deterministic
def y_hat(b0=b0, b1=b1, X=X, Z=Z, U=U):
B = np.array((b0,b1))
return np.dot(X,B)+np.dot(Z,U)

@pymc.stochastic(observed=True)
def y_i(value=y, mu=y_hat, tau=tau_e1):
return pymc.normal_like(value,mu,tau)

###End Code###

A. Flaxman

unread,

May 12, 2011, 8:03:15 PM5/12/11

to PyMC

Hi Scott,

This is a really nice example, and very clearly written up. I put it in a git repository to make it easy to keep track of changes and suggestions, in case others also speak up.
https://github.com/aflaxman/pymc-multilevel-example

It seems like you can make it 2x faster by messing around with step methods and initial values. I didn't have enough procrastination time to push the issue of how many steps of MCMC are really needed, though. Some additional notes are in the git repo.

--Abie

Hi All,

Best,
Scott

###Code###

###End Code###

--
You received this message because you are subscribed to the Google Groups "PyMC" group.
To post to this group, send email to py...@googlegroups.com.
To unsubscribe from this group, send email to pymc+uns...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pymc?hl=en.

Scott Baldwin

unread,

May 13, 2011, 11:32:59 AM5/13/11

to py...@googlegroups.com

Hi Abie,

Thank you for working on the code. The README file in the git repo is instructive. I'll contribute any changes I make to the repo going forward.

A couple of follow-up questions (to you and the list)

1 - You mention in the README that the data wrangling section of the model opaque. I too find this to be the case. Having come from doing data analysis only in stats packages such as Stata, where datasets are usually just thought of as a rectangular object, I have a hard time figuring how to manage the data in python. In fact, that is why I ultimately used the matrix form of the model in the @deterministic section because then I knew what data were getting used in the analysis. So my question is: Is there anything in particular about my model specification that appeared problematic? My python skills are growing quickly but I still feel a little "wobbly."

2 - This dovetails with 1. Any thoughts on how to specify the stochastic section so that it uses a different tau depending on whether y_i is in the clustered condition or the unclustered condition? This is where the python way of doing things with data is not clear to me and I suspect it is simply due to the fact that I am used to traditional stat packages. In Stata or SAS, I would do an if-then type statement. My attempts for if-then statements in python haven't worked out.

Thanks again for the insights!

Best,
Scott

A. Flaxman

unread,

May 16, 2011, 7:49:07 AM5/16/11

to py...@googlegroups.com

Hi Scott,

I had some time to try the changes in data wrangling and model specification that are my response to your questions. I tried to track the changes well with git commits, and update that README file, so it's all here:

https://github.com/aflaxman/pymc-multilevel-example
https://github.com/aflaxman/pymc-multilevel-example/commits/master/

The big change is using csv2rec to load the csv, which I think leads to much clearer data manipulation code:

data = pl.csv2rec('data.csv')

## Design matrix for intercept and intervention effect
X = [[1., t_ij] for t_ij in data.treat]

## Design matrix for cluster effect
Z = pl.zeros([len(data), 10])
for row, j in enumerate(data.j):
if j <= 10:
Z[row, j-1] = 1.

Then the change to do different taus for the treatment and control is just changes to two lines:

var_e1 = mc.Uniform('var_e1', lower=0, upper=100, value=[1., 1.])
tau_e1 = mc.Lambda('tau_e1', lambda v=var_e1: v**-1, trace=False)

@mc.stochastic(observed=True)
def y_i(value=data.y, mu=y_hat, tau=tau_e1):
return mc.normal_like(value,mu,tau[data.treat])

--Abie

________________________________________
From: py...@googlegroups.com [py...@googlegroups.com] on behalf of Scott Baldwin [scott_...@byu.edu]
Sent: Friday, May 13, 2011 8:32 AM
To: py...@googlegroups.com
Subject: Re: [pymc] Help Specifying a Multilevel Model for Partially Clustered Data

Scott Baldwin

unread,

May 16, 2011, 6:51:31 PM5/16/11

to py...@googlegroups.com

Hi Abie,

Thank you very much. Your changes are very instructive. The pylab csv change leads to an object that is much more like a dataframe. Likewise, I didn't realize that I'd only need to index the tau in the stochastic call.

You mentioned Stata's xtmixed in your README file. If you've got Stata 11 or greater, you can fit this model with REML or ML with the following code:

xtmixed y tx || j:treat, nocons residuals(independent, by(tx))

In the end, I am with you -- I like the flexibility of the Bayesian approach in PyMC. Further, the confidence interval coverage will be below the nominal rate in Stata (they use z-tests) but the Bayesian approach will give us appropriate coverage.

Thanks a million. I'll keep watching the github repository and will let you know if I make any other changes.

Best,
Scott

Reply all

Reply to author

Forward