general linear mixed model with nested random effects

Brad Buran

unread,

Nov 16, 2015, 12:44:50 PM11/16/15

to pystat...@googlegroups.com

In R, I could specify an equation similar to:

value~timepoint*dose+(1|region/subject)

Is it possible to do so in statsmodels? I could not find anything in Patsy about specifying random effects, so perhaps this isn't implemented.

Brad

josef...@gmail.com

unread,

Nov 16, 2015, 1:24:56 PM11/16/15

to pystatsmodels

I'm not sure about nested effects and let Kerby or Saket answer that.

patsy and we don't use combination formulas (at least not yet). The version in master uses 3 formulas to specify the different terms, fixed effects, random effects within group and variance components

http://www.statsmodels.org/dev/generated/statsmodels.regression.mixed_linear_model.MixedLM.from_formula.html

I don't find a notebook with variance components for master right now, but this notebook has several example, but was written for a branch and is not completely consistent with statsmodels master

http://nbviewer.ipython.org/github/saketkc/statsmodels/blob/kerby_mixedlm_notebooks/examples/notebooks/MixedLM_compare_lr_test.ipynb

(anova/compare_lr_tests for mixed models doesn't exist yet, AFAIR)

Josef

Brad

Kerby Shedden

unread,

Nov 16, 2015, 8:57:44 PM11/16/15

to pystatsmodels, bbu...@alum.mit.edu

Yes, it should be possible to do this, I think you need the statsmodels master off github, not the last released version.

For the main formula just use "value ~ timepoint*dose"

Then specify "region" as the groups variable.

Then specify vc_var = {'subject' : '0 + C(subject)'}

This assumes that the subject id's are nested within regions. If the same subject is in different regions you have a crossed model. We can do that too but it takes a different syntax.

I haven't looked at this in a while but I think this will work. If not let us know.

Kerby

Kerby Shedden

unread,

Nov 16, 2015, 9:01:19 PM11/16/15

to pystatsmodels, bbu...@alum.mit.edu

You can also do something similar with GEE, using the Nested covariance structure, or the Equivalence covariance structure.

Kerby

Nathaniel Smith

unread,

Nov 16, 2015, 9:51:02 PM11/16/15

to pystatsmodels

On Mon, Nov 16, 2015 at 5:57 PM, Kerby Shedden <kshe...@umich.edu> wrote:
> Yes, it should be possible to do this, I think you need the statsmodels
> master off github, not the last released version.
>
> For the main formula just use "value ~ timepoint*dose"
>
> Then specify "region" as the groups variable.
>
> Then specify vc_var = {'subject' : '0 + C(subject)'}
>
> This assumes that the subject id's are nested within regions. If the same
> subject is in different regions you have a crossed model. We can do that
> too but it takes a different syntax.

For reference, the original R / lme4 syntax that Brad posted actually
enforces nesting by using the concatenation of region id + subject id
as the effective subject id.

(1 | subject/region) is equivalent to (1 | subject + subject:region),
which describes a random effects structure where each subject has a
unique draw from a gaussian, plus each subject:region combination has
a unique draw from a gaussian.

-n

--
Nathaniel J. Smith -- http://vorpus.org

josef...@gmail.com

unread,

Nov 16, 2015, 10:30:21 PM11/16/15

to pystatsmodels

1|region/subject

doesn't subject have to be nested inside region, i.e. a subject can only be in one region?

subject:region is the same as subject

and we just have a variance component for region and one for subject

?

(formulas are to dense for my taste)

Josef

josef...@gmail.com

unread,

Nov 16, 2015, 10:36:04 PM11/16/15

to pystatsmodels

Or maybe not, when I moved from the US to Canada I changed the realization of my random effect.

Josef

josef...@gmail.com

unread,

Nov 16, 2015, 10:47:00 PM11/16/15

to pystatsmodels

(The tax rate is different (region effect), and I start to like cold and snow (random effect for subject nested in region), in contrast to boring southern Californian weather. :)

To continue Kerby's example

instead of

vc_var = {'subject' : '0 + C(subject)'}

it would be

vc_var = {'subject' : '0 + C(subject):C(region'}

I guess.

But this would blow up with many zero columns for all the regions an individual is not in.

What's the patsy formula for nested?

"a / b is shorthand for a + a:b"

Josef

Nathaniel Smith

unread,

Nov 16, 2015, 10:47:09 PM11/16/15

to pystatsmodels

On Mon, Nov 16, 2015 at 7:30 PM, <josef...@gmail.com> wrote:
>
>
> On Mon, Nov 16, 2015 at 9:50 PM, Nathaniel Smith <n...@vorpus.org> wrote:
>>
>> On Mon, Nov 16, 2015 at 5:57 PM, Kerby Shedden <kshe...@umich.edu> wrote:
>> > Yes, it should be possible to do this, I think you need the statsmodels
>> > master off github, not the last released version.
>> >
>> > For the main formula just use "value ~ timepoint*dose"
>> >
>> > Then specify "region" as the groups variable.
>> >
>> > Then specify vc_var = {'subject' : '0 + C(subject)'}
>> >
>> > This assumes that the subject id's are nested within regions. If the
>> > same
>> > subject is in different regions you have a crossed model. We can do
>> > that
>> > too but it takes a different syntax.
>>
>> For reference, the original R / lme4 syntax that Brad posted actually
>> enforces nesting by using the concatenation of region id + subject id
>> as the effective subject id.
>>
>> (1 | subject/region) is equivalent to (1 | subject + subject:region),
>> which describes a random effects structure where each subject has a
>> unique draw from a gaussian, plus each subject:region combination has
>> a unique draw from a gaussian.
>
>
>
> 1|region/subject
>
> doesn't subject have to be nested inside region, i.e. a subject can only be
> in one region?
>
> subject:region is the same as subject

I think you're overthinking it :-). I'm just describing mechanically
what R does with this kind of thing. It won't like raise an error if
the same subject code appears in two different regions or anything
(which is what I'd expect if subject "had to be nested"). Literally if
you write region/subject, then in R that is a shorthand for region +
region:subject, and if you write region:subject here then what R /
lme4 will do with that is that first it will create a new set of
categories which are (region code, subject code) pairs, and then it
will use those categories for your analysis.

So *if* every subject has a unique id, and every subject is in only
one region, then (in R) region:subject is equivalent to just plain
subject (but with different labels). So if this is how your data is
coded to start with, then region/subject will give equivalent results
to just writing region+subject.

If subjects don't have unique ids, like e.g. if the way you coded your
data is that each region has its own subject 0, it's own subject 1,
etc., but the different "subject 0"s are actually different subjects,
then -- well, then this is probably a bad way to code your data :-).
But if you did it anyway, then R provides the "/" syntax as a
convenient shorthand for recoding it the way you should have coded it
in the first place. Basically that's all that "/" does. It's actually
pretty trivial, and honestly not terribly useful, but if you ever need
to interpret some R formula that uses it then at least you know what's
going on.

Kerby Shedden

unread,

Nov 16, 2015, 10:48:46 PM11/16/15

to pystatsmodels

In R, if the subject id's are unique then you can just use subject on its own. If the subject ids are recycled in each region then nesting treats them as if they were unique. Otherwise R would think you had a crossed model.

Unlike R our approach is group-based, which means that everything is independent across groups, regardless of how things are coded.

Kerby

josef...@gmail.com

unread,

Nov 16, 2015, 11:02:09 PM11/16/15

to pystatsmodels

On Mon, Nov 16, 2015 at 10:48 PM, Kerby Shedden <kshe...@umich.edu> wrote:

In R, if the subject id's are unique then you can just use subject on its own. If the subject ids are recycled in each region then nesting treats them as if they were unique. Otherwise R would think you had a crossed model.

Thanks, that clarifies it.

If nesting works like a multi-index, then it is a lot clearer why there is a need for extra notation. I never really understood that part.

Josef

josef...@gmail.com

unread,

Nov 16, 2015, 11:18:05 PM11/16/15

to pystatsmodels

Thanks for the clarification. I never thought of the case where there are non-unique IDs.

I didn't see your message when I replied to Kerby.

The patsy question still remains:

How many columns with zeros do we get with `region/subject` in patsy?

Josef

Brad Buran

unread,

Nov 17, 2015, 1:29:34 PM11/17/15

to pystat...@googlegroups.com

Hi Nathaniel,

This was an extremely informative post. I was told that I need to use
the / syntax because I have subjects from multiple regions; however,
each subject has a unique ID regardless of region.

Given this, I can just do:

smf.MixedLM.from_formula('value~timepoint*dose', groups='region')

I updated to the master version of statsmodels and compared the
results of the MixedLM with R's lme4. The results are very similar.

That said, it's good to know about passing in details regarding the
structure of random effects via the vc_var structure. I had not been
aware of this and did not fully understand what it was for when I came
across it in the API docs.

Brad

josef...@gmail.com

unread,

Nov 17, 2015, 2:05:28 PM11/17/15

to pystatsmodels

On Tue, Nov 17, 2015 at 1:29 PM, Brad Buran <bbu...@alum.mit.edu> wrote:

Hi Nathaniel,

This was an extremely informative post. I was told that I need to use
the / syntax because I have subjects from multiple regions; however,
each subject has a unique ID regardless of region.

Given this, I can just do:

smf.MixedLM.from_formula('value~timepoint*dose', groups='region')

This will use a random intercept per region. My interpretation was that your original R formula has random intercept on the `subject` level (within region).

I updated to the master version of statsmodels and compared the
results of the MixedLM with R's lme4. The results are very similar.

That said, it's good to know about passing in details regarding the
structure of random effects via the vc_var structure. I had not been
aware of this and did not fully understand what it was for when I came
across it in the API docs.

In terms of implementation:

Variance components construct the full dummy matrix, and can also be used for crossed effects within groups, and there could be only a single group. The original one-way mixed effects defined by `groups` never construct the full individual random effects matrix during optimization and should be more memory efficient if there are a large number of groups.

Josef

Kerby Shedden

unread,

Nov 17, 2015, 3:23:14 PM11/17/15

to pystatsmodels

If you have repeated measurements per subject then you generally would want to include a subject random effect, using vc_formula (and also use groups='region'). Otherwise just use groups='region' without a vc_formula.

Kerby

Nathaniel Smith

unread,

Nov 17, 2015, 3:25:39 PM11/17/15

to pystatsmodels

The number you are expecting ;-). (number of regions + number of subjects - 1)

I believe that in lme4's custom formula syntax, they don't actually compute a design matrix for the stuff on the right hand side of their | operator. I think the logic is that they first expand out formulas operators like /, and then they treat it as a list of categorical variables separated by + signs (with : taking on its regular meaning in R, which is the row-by-row concatenation thing I mentioned, rather than its related but slightly different formula meaning). I haven't checked the code, though. They definitely don't do anything involving redundancy elimination there (it makes no sense in the context), though, and I'm pretty sure they treat anything they find as categorical regardless of whether it's marked with C().

-n

Brad Buran

unread,

Nov 18, 2015, 11:18:57 AM11/18/15

to pystat...@googlegroups.com

On Tue, Nov 17, 2015 at 11:05 AM, <josef...@gmail.com> wrote:
> This will use a random intercept per region. My interpretation was that your
> original R formula has random intercept on the `subject` level (within
> region).

On Tue, Nov 17, 2015 at 12:23 PM, Kerby Shedden <kshe...@umich.edu> wrote:
> If you have repeated measurements per subject then you generally would want
> to include a subject random effect, using vc_formula (and also use
> groups='region'). Otherwise just use groups='region' without a vc_formula.

You're right. This is a typo, it should be groups='subject'. Sorry
about that! BTW, do you mean re_formula, not vc_formula?

Brad

josef...@gmail.com

unread,

Nov 18, 2015, 1:27:19 PM11/18/15

to pystatsmodels

I think I was till confused: nesting has a main and the interaction effect

see Kerby's original answer

groups='region' is needed to get the region random effect (main effect in nested)

vc is needed to get the subject random effect (interaction or additive effect).

re_formula is for random slopes in groups, where each component gets a different variance, plus random slopes are correlated.

`vc` are necessary to specify a second "group" (`groups` can only be one-way, correlated errors within, no correlation across groups)

This uses the dummy variable representation of the individual intercepts and imposes that they are normally distributed with the same variance. This cannot be represented by a re_formula because it doesn't impose that the `subject` random effects (realization of intercepts) are identically and independently distributed.

Josef

Brad

Brad Buran

unread,

Apr 22, 2016, 11:14:03 AM4/22/16

to pystat...@googlegroups.com

Sorry to dig up an old thread but I haven't been able to find good
documentation on vc_formula. Should the following formula work?

{'cell:target': '0+cell:target', 'cell': '0+cell'}

In this case, the 'cell:target' RE is 0.

However, if I manually combine:

data['cell_target'] = data.apply(lambda x: '{}_{}'.format(x['cell'],
x['target']), axis=1)

Then specify vc_formula as

{'cell_target': '0+cell_target', 'cell': '0+cell'}

It works. Is this expected?

Kerby Shedden

unread,

Apr 25, 2016, 9:09:41 AM4/25/16

to pystatsmodels, bbu...@alum.mit.edu

Are your `cell` and `target` variables being interpreted by Patsy as categoricals?

Maybe try something like this to be sure:

{'cell:target': '0+C(cell):C(target)', 'cell': '0+C(cell)'}

The two approaches you listed should be equivalent if cell and target are categoricals, but not if either is quantitative.

Kerby

Brad Buran

unread,

Apr 26, 2016, 10:53:32 AM4/26/16

to pystat...@googlegroups.com

On Mon, Apr 25, 2016 at 6:09 AM, Kerby Shedden <kshe...@umich.edu> wrote:

{'cell:target': '0+C(cell):C(target)', 'cell': '0+C(cell)'}

This worked. I forgot that if target is numeric it would be treated as continuous instead of a categorical.