mixedlm syntax

265 views
Skip to first unread message

Brent Pedersen

unread,
Aug 26, 2014, 11:02:05 AM8/26/14
to pystat...@googlegroups.com
Hi, I'm trying to recreate this lme4 fit in MixedLM (unrelated random intercepts):

    print(summary(lmer('methylation ~ asthma + age + gender + (1|id) + (1|CpG)', df)))

Here's one variation of what I have tried:

    m = MixedLM.from_formula('methylation ~ asthma + age + gender', groups='id', re_formula="id + CpG", data=df)
    r = m.fit(free=(np.ones(2), np.eye(m.k_re)))
    print r.summary()

I'm not clear on how the groups and the re_formula interact and this particular incantation runs without terminating. What am I missing?

BTW, the docstring for MixedLM mentions set_random() which seems to no longer exist.

thanks,
-Brent

josef...@gmail.com

unread,
Aug 26, 2014, 11:21:10 AM8/26/14
to pystatsmodels
On Tue, Aug 26, 2014 at 11:02 AM, Brent Pedersen <bped...@gmail.com> wrote:
> Hi, I'm trying to recreate this lme4 fit in MixedLM (unrelated random
> intercepts):
>
> print(summary(lmer('methylation ~ asthma + age + gender + (1|id) +
> (1|CpG)', df)))
>
> Here's one variation of what I have tried:
>
> m = MixedLM.from_formula('methylation ~ asthma + age + gender',
> groups='id', re_formula="id + CpG", data=df)
> r = m.fit(free=(np.ones(2), np.eye(m.k_re)))
> print r.summary()
>
> I'm not clear on how the groups and the re_formula interact and this
> particular incantation runs without terminating. What am I missing?

I'll try a guess to see if I understand formulas and Mixed enough now,
and let Kerby correct or confirm

'methylation ~ asthma + age + gender + (1|id) + (1|CpG)`
is a two factor or crossed random effect model, that is currently not
yet covered in MixedLM
aside: constant only like here `(1|id) + (1|CpG)` would be covered in
the PanelModel PR, that allows for crossed/2-factor random constants,
but not random slope coefficients.

what MixedLM currently implements is:

'methylation ~ asthma + age + gender + (1 + CpG | id)`
which would correspond to
m = MixedLM.from_formula('methylation ~ asthma + age + gender',
groups='id', re_formula="1 + CpG", data=df)

(the `1 +` is most likely redundant)

that is, the constant and the slope coefficients for `CpG` vary by
`id` as random effects (distributed as bivariate normal)


running without terminating is always a bug and needs to be fixed with
`prio-high`

>
> BTW, the docstring for MixedLM mentions set_random() which seems to no
> longer exist.

this became obsolete with the inclusion of `re_formula` in
`from_formula`, AFAIR.

Thanks for reporting back.

Josef

>
> thanks,
> -Brent

Kerby Shedden

unread,
Aug 26, 2014, 1:23:12 PM8/26/14
to pystat...@googlegroups.com
Hi Brent,

You probably don't want id in the re_formula.  You will automatically get a random intercept for each group just by specifying id as the grouping variable.  If CpG is a quantitative variable and you want a random slope for CpG, then the re_formula would just be 'CpG'.

If id is a categorical variable with many levels, then pandas will create a big indicator matrix for it.  That may explain the long run time using your current syntax.

Kerby

Kerby Shedden

unread,
Aug 26, 2014, 1:28:09 PM8/26/14
to pystat...@googlegroups.com
A bit more detail -- formally, re_formula='1' would give you a random intercept for each group, and re_formula='1+CpG' will give you a random intercept for each group and a random slope for CpG within each group (and these two random effects would by default be correlated, except that you are specifying them as independent through your use of 'free').

Patsy adds the intercept automatically, so re_formula='CpG' is equivalent to re_formula='1+CpG', and we treat an unspecified re_formula as equivalent to re_formula='1'.  If you really only want the random slope and no random intercept, then you would use re_formula='0+CpG'.

Kerby

Brent Pedersen

unread,
Aug 26, 2014, 2:33:38 PM8/26/14
to pystat...@googlegroups.com
On Tue, Aug 26, 2014 at 11:28 AM, Kerby Shedden <kshe...@umich.edu> wrote:
> A bit more detail -- formally, re_formula='1' would give you a random
> intercept for each group, and re_formula='1+CpG' will give you a random
> intercept for each group and a random slope for CpG within each group (and
> these two random effects would by default be correlated, except that you are
> specifying them as independent through your use of 'free').
>

thanks for the clarification and the work on MixedLM. I want a random
intercept for CpG and for group, both are categorical variables. Yes,
it's a big indicator matrix -- though lme4 handles it instantly. I'm
attaching the csv in case it's useful.
-Brent
m.csv

Kerby Shedden

unread,
Aug 26, 2014, 7:22:37 PM8/26/14
to pystat...@googlegroups.com
I see now,  Josef was right, this is a crossed model that we can't handle now.

I started putting together the code to allow us to handle crossed models, but it won't be ready for several months at least.

Kerby

josef...@gmail.com

unread,
Aug 26, 2014, 10:08:01 PM8/26/14
to pystatsmodels
On Tue, Aug 26, 2014 at 7:22 PM, Kerby Shedden <kshe...@umich.edu> wrote:
> I see now, Josef was right, this is a crossed model that we can't handle
> now.

Not about being able to run the specific model given by the formula:

Your CpG variable has only 6 levels, AFAICS from the csv.
Is there a specific reason why you want to assume CpG has a random
effect, which additionally assumes that the random effect/random
constant of the 6 levels of CpG is normally distributed?

I would just assume fixed effects for CpG which makes fewer
distributional assumptions, your dataset looks large enough.

(Which would also allow it to be compared with GEE, individual
specific random coefficient should be the same as equicorrelated, but
I haven't tried yet. GEE uses a different estimator and doesn't assume
that the within correlation structure is correctly specified (robust
cov_params or standard errors of the parameters).) (just some
advertising for GEE)

Josef
too many parenthesis.

Brent Pedersen

unread,
Aug 28, 2014, 11:04:02 AM8/28/14
to pystat...@googlegroups.com
On Tue, Aug 26, 2014 at 8:07 PM, <josef...@gmail.com> wrote:
> On Tue, Aug 26, 2014 at 7:22 PM, Kerby Shedden <kshe...@umich.edu> wrote:
>> I see now, Josef was right, this is a crossed model that we can't handle
>> now.
>
> Not about being able to run the specific model given by the formula:
>
> Your CpG variable has only 6 levels, AFAICS from the csv.
> Is there a specific reason why you want to assume CpG has a random
> effect, which additionally assumes that the random effect/random
> constant of the 6 levels of CpG is normally distributed?
>
> I would just assume fixed effects for CpG which makes fewer
> distributional assumptions, your dataset looks large enough.
>
> (Which would also allow it to be compared with GEE, individual
> specific random coefficient should be the same as equicorrelated, but
> I haven't tried yet. GEE uses a different estimator and doesn't assume
> that the within correlation structure is correctly specified (robust
> cov_params or standard errors of the parameters).) (just some
> advertising for GEE)
>
> Josef
> too many parenthesis.


I have been comparing GEE to MixedLM with this setup on the file I
sent before (using CpG as a fixed effect as you suggest):

df = pd.read_csv('m.csv')
m = MixedLM.from_formula('methylation ~ asthma + age + gender +
CpG', groups='id', data=df)
g = GEE.from_formula('methylation ~ asthma + age + gender + CpG',
groups='id', data=df)

the coefficient estimates and p-values are similar enough, but the
MixedLM is *much* slower:

In [9]: timeit m.fit()
1 loops, best of 3: 3.85 s per loop

In [10]: timeit g.fit()
10 loops, best of 3: 26.5 ms per loop

I'm not sure if there is something in niter_* that I can tweak, but
that is with the default.
I'm running this for hundreds of thousands of sites so GEE looking
more attractive.

thanks for the ideas,
-Brent

josef...@gmail.com

unread,
Aug 28, 2014, 11:59:29 AM8/28/14
to pystatsmodels
you need `cov_struct=Exchangeable` to correspond to a random intercept
(which is equicorrelated, I didn't remember the name)
http://nbviewer.ipython.org/urls/umich.box.com/shared/static/zyajjg1dxf2nmamztheg.ipynb

The default is `Independence` which is essentially OLS with cluster
robust standard errors in this case, but using the GEE estimation
loop.

Exchangeable will be a bit slower than Independence because the
correlation needs to be estimated.


>
> the coefficient estimates and p-values are similar enough, but the
> MixedLM is *much* slower:
>
> In [9]: timeit m.fit()
> 1 loops, best of 3: 3.85 s per loop
>
> In [10]: timeit g.fit()
> 10 loops, best of 3: 26.5 ms per loop
>
> I'm not sure if there is something in niter_* that I can tweak, but
> that is with the default.
> I'm running this for hundreds of thousands of sites so GEE looking
> more attractive.


GEE has been in master for about a year with several refactorings by Kerby.
MixedLM is "fresh" and yours is the first real feedback we get.
GEE also still has some slack, for example it estimates all 3
covariance types in fit (as I figured out last week).

I have no experience how to choose optimization options in MixedLM, I
haven't played much with it yet.
`do_cg` uses scipy's bfgs which is usually pretty fast if the starting
values are not awful.


One general comment about running estimation problems in a loop:

Choosing the starting values based on previous estimates can
considerably speed up the optimization if the problems are similar.

In the extreme case, choosing a common correlation structure
(distribution of the random effect) would remove repeated estimation
of the distribution, which can be set with `free` if I interpret the
docstring and code correctly.

Josef

Kerby Shedden

unread,
Aug 29, 2014, 12:31:51 AM8/29/14
to pystat...@googlegroups.com
GEE is about 65 times faster than MixedLM for me.

I haven't spent much effort to optimize either of these.  I've been focusing on making sure that they give the right answer.

Some observations:

MixedLM is doing 31 iterations, GEE is only doing 2 iterations.

There was a bug in MixedLM that was preventing it from returning the fit history.  I fixed that and will make a PR on it (also fixed the docstring issue that Brent found).

Below is the gradient norm at each iteration.  It does not appear to be stopping later than needed.  However I need to look more closely into how our convergence conditions relate to those in lme4.  Probably the main slowness for us is the time spent in each iteration, rather than the number of iterations.  

It does not appear that pre-estimation or post-estimation take up much time, the time is in the fitting itself.

I will try to do some further profiling within a few days.



In [47]: [np.sqrt(np.sum(mod.score(x)**2)) for x in rslt.hist[0]['allvecs']]
Out[47]: 
[1765.6874250551575,
 605.74385314963615,
 242.1857365665546,
 274.48755122122509,
 79.353589354469435,
 110.50931577599006,
 41.726352640901162,
 182.7634021761541,
 79.113410108399279,
 3.9395198046947875,
 14.698845406868134,
 24.507786352943004,
 32.371573622762369,
 37.796488764327307,
 37.395395623161349,
 30.226171676612012,
 16.748223897156084,
 14.68023860382705,
 19.69659331185952,
 11.922440662546805,
 4.6840148532877963,
 1.5523320169990367,
 3.0592133317556844,
 1.5091390973144858,
 0.27130415404902952,
 0.18563620236633904,
 0.27346548385484021,
 0.11695120988702028,
 0.022753952448470883,
 0.004298082865833542]

Kerby Shedden

unread,
Aug 29, 2014, 1:09:58 AM8/29/14
to pystat...@googlegroups.com
One more comment -- lme4 seems to be taking 14 iterations to converge on this dataset, with the same model.

Kerby Shedden

unread,
Aug 29, 2014, 11:34:38 PM8/29/14
to pystat...@googlegroups.com
You can cut the run time by about 1/3 by using reml=False as an argument to `fit`.

I have added a few optimizations to this branch:

https://github.com/statsmodels/statsmodels/pull/1940

The optimizations help a fair amount (up to around 40% faster) for more complicated random effects structures.  But here we are only using a random intercept, the improvement barely registers.

I am continuing to look for additional relatively easy optimizations.

josef...@gmail.com

unread,
Aug 31, 2014, 1:44:52 PM8/31/14
to pystatsmodels
On Fri, Aug 29, 2014 at 11:34 PM, Kerby Shedden <kshe...@umich.edu> wrote:
You can cut the run time by about 1/3 by using reml=False as an argument to `fit`.

I have added a few optimizations to this branch:

https://github.com/statsmodels/statsmodels/pull/1940

The optimizations help a fair amount (up to around 40% faster) for more complicated random effects structures.  But here we are only using a random intercept, the improvement barely registers.

I am continuing to look for additional relatively easy optimizations.


I also went through some of the MixedLM code for the first time. One of the main time wasters was a redundant inherited call to the numerical hessian.

I'm currently down to around 0.23 seconds for the MLE (reml=False) in this case (still without Kerby's changes).
(not all savings will generalize, because I use OLS as starting values, which are already the right  fixed effect parameters in this case.)

Josef

Brent Pedersen

unread,
Sep 4, 2014, 5:16:15 PM9/4/14
to pystat...@googlegroups.com
On Sun, Aug 31, 2014 at 11:44 AM, <josef...@gmail.com> wrote:
>
>
>
> On Fri, Aug 29, 2014 at 11:34 PM, Kerby Shedden <kshe...@umich.edu> wrote:
>>
>> You can cut the run time by about 1/3 by using reml=False as an argument
>> to `fit`.
>>
>> I have added a few optimizations to this branch:
>>
>> https://github.com/statsmodels/statsmodels/pull/1940
>>
>> The optimizations help a fair amount (up to around 40% faster) for more
>> complicated random effects structures. But here we are only using a random
>> intercept, the improvement barely registers.
>>
>> I am continuing to look for additional relatively easy optimizations.
>
>
>
> I also went through some of the MixedLM code for the first time. One of the
> main time wasters was a redundant inherited call to the numerical hessian.
>
> I'm currently down to around 0.23 seconds for the MLE (reml=False) in this
> case (still without Kerby's changes).
> (not all savings will generalize, because I use OLS as starting values,
> which are already the right fixed effect parameters in this case.)
>
> Josef
>
>
>>


Thanks to both of you for looking into this. Using the OLS estimates
as the start_params does make it quite a bit faster. I'm trying to see
how using MLE changes things genome-wide.

Brent Pedersen

unread,
Sep 8, 2014, 2:22:48 PM9/8/14
to pystat...@googlegroups.com
FYI I put up a notebook here:
http://nbviewer.ipython.org/github/brentp/crystal/blob/master/notebooks/crystal-methods-evaluation.ipynb

Shows that GEE does very well and actually cluster-OLS even better for
modeling methylation data given my evaluation criteria and test-data.

thanks again for the help.
-Brent

josef...@gmail.com

unread,
Sep 8, 2014, 3:05:05 PM9/8/14
to pystatsmodels
Thanks for the link, and the feedback. 
I didn't try to figure out the background to understand what it means, but the results or plots look good.

It's good to see these methods working. I don't have much of an overview about the applications in which these methods are working well and where we will run into numerical or small sample problems.

subtitle "Modern Statistics and Modern Econometrics have arrived."  (*)

(*) at least in parts

Josef

 
-Brent

Reply all
Reply to author
Forward
0 new messages