GSoC - Linear Mixed Models

Alexej Gossmann

unread,

May 27, 2015, 3:01:51 PM5/27/15

to pjot...@thebird.nl, sciru...@googlegroups.com

Hi Pjotr,

I have set up a github repository for my GSoC project:
https://github.com/agisga/MixedModels.git

Currently, it only includes some matrix methods which are going to be useful to me, and a prototype implementation of a deviance/REML function class. The deviance function is the objective function for optimization.

Also, I have sent the following pull requests to NMatrix yesterday, which are related to my project:
https://github.com/SciRuby/nmatrix/pull/366
https://github.com/SciRuby/nmatrix/pull/365

Alexej

Pjotr Prins

unread,

May 29, 2015, 8:34:06 PM5/29/15

to sciru...@googlegroups.com

Hi Alexej,

I had some interesting discussions today at the department of
biostatistics in Madison, and one thing that came up was that it would
be extremely useful if we have a readable version of R/lme4 - so the
algorithms are clearly explained. The R version of R/lme4 (yes, this
is getting recursive) is not that clear either. Your Ruby library, if
it is done in a clear and clean way could double as an explanation of
an implementation. I.e., a reference implementation.

Pj.

On Wed, May 27, 2015 at 02:01:50PM -0500, Alexej Gossmann wrote:
> Hi Pjotr,
>
> I have set up a github repository for my GSoC project:

> [1]https://github.com/agisga/MixedModels.git

>
> Currently, it only includes some matrix methods which are going to be
> useful to me, and a prototype implementation of a deviance/REML function
> class. The deviance function is the objective function for optimization.
>
> Also, I have sent the following pull requests to NMatrix yesterday, which
> are related to my project:

> [2]https://github.com/SciRuby/nmatrix/pull/366
> [3]https://github.com/SciRuby/nmatrix/pull/365
>
> Alexej
>
> --
> You received this message because you are subscribed to the Google Groups
> "SciRuby Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [4]sciruby-dev...@googlegroups.com.
> For more options, visit [5]https://groups.google.com/d/optout.
>
> References
>
> Visible links
> 1. https://github.com/agisga/MixedModels.git
> 2. https://github.com/SciRuby/nmatrix/pull/366
> 3. https://github.com/SciRuby/nmatrix/pull/365
> 4. mailto:sciruby-dev...@googlegroups.com
> 5. https://groups.google.com/d/optout

--

Alexej Gossmann

unread,

May 31, 2015, 12:27:17 AM5/31/15

to sciru...@googlegroups.com

Pjotr,

In fact, I find that the algorithms of lme4 are already clearly explained in the paper http://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf The utilized algorithms are very optimized and not that straight forward. So, I think it would be in any case hard to understand the algorithms from code alone, and Bates et. al. paper provides a very nice explanation with relevant R code snippets.

But I agree that the R code is quite confusing (even the lme4pureR version of it), mostly because of all the sparse matrix tricks involved and the use of R environments. I will try to write cleaner code, and I will try to find a good balance between code readability and performance. I hope that you can give me some good suggestions regarding that.

Btw, I have restructured the deviance/REML calculation as we have discussed today (and pushed to github), and I will move on to prototyping the optimization module now.

Alexej

> 4. mailto:sciruby-dev+unsub...@googlegroups.com
> 5. https://groups.google.com/d/optout

--

Pjotr Prins

unread,

May 31, 2015, 12:40:20 AM5/31/15

to sciru...@googlegroups.com

The new Ruby-esque code looks great :)

https://github.com/agisga/MixedModels/tree/master/lib/MixedModels

I have a feeling we'll end up with a short and clear implementation -
which is what I call a reference implementation.

Pj.

On Sat, May 30, 2015 at 09:27:17PM -0700, Alexej Gossmann wrote:
> Pjotr,
>
> In fact, I find that the algorithms of lme4 are already clearly explained
> in the paper

> [1]http://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf The

> > [1][2]https://github.com/agisga/MixedModels.git

> >
> > Currently, it only includes some matrix methods which are going to
> be
> > useful to me, and a prototype implementation of a deviance/REML
> function
> > class. The deviance function is the objective function for
> optimization.
> >
> > Also, I have sent the following pull requests to NMatrix yesterday,
> which
> > are related to my project:

> > [2][3]https://github.com/SciRuby/nmatrix/pull/366
> > [3][4]https://github.com/SciRuby/nmatrix/pull/365

> >
> > Alexej
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups
> > "SciRuby Development" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send an

> > email to [4][5]sciruby-dev...@googlegroups.com.
> > For more options, visit [5][6]https://groups.google.com/d/optout.
> >
> > References
> >
> > Visible links
> > 1. [7]https://github.com/agisga/MixedModels.git
> > 2. [8]https://github.com/SciRuby/nmatrix/pull/366
> > 3. [9]https://github.com/SciRuby/nmatrix/pull/365
> > 4. mailto:[10]sciruby-dev...@googlegroups.com
> > 5. [11]https://groups.google.com/d/optout
>
> --

>
> --
> You received this message because you are subscribed to the Google Groups
> "SciRuby Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to [12]sciruby-dev...@googlegroups.com.
> For more options, visit [13]https://groups.google.com/d/optout.
>
> References
>
> Visible links
> 1. http://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf
> 2. https://github.com/agisga/MixedModels.git
> 3. https://github.com/SciRuby/nmatrix/pull/366
> 4. https://github.com/SciRuby/nmatrix/pull/365
> 5. javascript:
> 6. https://groups.google.com/d/optout
> 7. https://github.com/agisga/MixedModels.git
> 8. https://github.com/SciRuby/nmatrix/pull/366
> 9. https://github.com/SciRuby/nmatrix/pull/365
> 10. javascript:
> 11. https://groups.google.com/d/optout
> 12. mailto:sciruby-dev...@googlegroups.com
> 13. https://groups.google.com/d/optout

--

Alexej Gossmann

unread,

Jun 5, 2015, 12:08:52 AM6/5/15

to sciru...@googlegroups.com

I have written a new blog post.
An example of a first linear mixed model fit via MixedModels in Ruby, with a comparison of the output values to the results obtained in R via lme4: http://agisga.github.io/First-linear-mixed-model-fit/

Alexej

> > 4. mailto:[10]sciruby-dev+unsub...@googlegroups.com

> > 5. [11]https://groups.google.com/d/optout
>
> --
>
> --
> You received this message because you are subscribed to the Google Groups
> "SciRuby Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [12]sciruby-dev...@googlegroups.com.
> For more options, visit [13]https://groups.google.com/d/optout.
>
> References
>
> Visible links
> 1. http://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf
> 2. https://github.com/agisga/MixedModels.git
> 3. https://github.com/SciRuby/nmatrix/pull/366
> 4. https://github.com/SciRuby/nmatrix/pull/365
> 5. javascript:
> 6. https://groups.google.com/d/optout
> 7. https://github.com/agisga/MixedModels.git
> 8. https://github.com/SciRuby/nmatrix/pull/366
> 9. https://github.com/SciRuby/nmatrix/pull/365
> 10. javascript:
> 11. https://groups.google.com/d/optout

> 12. mailto:sciruby-dev+unsub...@googlegroups.com
> 13. https://groups.google.com/d/optout

--

Alexej Gossmann

unread,

Jun 5, 2015, 5:45:12 PM6/5/15

to sciru...@googlegroups.com, Pjotr Prins

Hi Pjotr,

Right now I am thinking about how to parse an R-like formula in Ruby. An example of what I want is:
Say, I have a data frame dat_frm with n rows corresponding to n patients, and p columns "Drug_effectiveness", "Age", "Weight", "Height", "Dose", "Clinic", etc. I want to model the drug effectiveness as a linear function of the drug dose, patient age, weight and height, while assuming that there is a random fluctuation of the effect of each parameter due to the clinic that a patient is treated in. Then I have several ways to represent it in a formula (each formula defines a slightly different correlation structure between the random effects):

In any case, the algorithm needs to build the design matrix X out of columns "Age", "Weight", "Dose", and construct the appropriate random effects matrix Z according to the group structure given in the column "Clinic" of dat_frm.

Additionally, I might want to consider interaction effects, say between "Weight" and "Dose", which can be expressed as:

(5) "Effect ~ Age + Weight + Dose + Weight:Dose + (Age + Weight + Dose | Clinic)"
(6) "Effect ~ Age + Weight * Dose + (Age + Weight + Dose | Clinic)"
(7) ...

Do you have an idea how I can begin with setting up such a formula parsing module (I have no experience with parsing strings in Ruby or similar languages).

Also, I will look if I can use daru for the input data frames. What to you think?

I have also thought about the output module. Since I don't know in what context users will fit LMMs in Ruby, I decided not to worry about nice presentation of the output in fancy tables (like in R), but instead to give outputs as Hashes. For example, #fix_ef would return a Hash with all information about the estimated fixed effects coefficients; something like:

> model _ fit.fix_ef # => { "Intercept" => 3.32, "Intercept: SD" => 0.2, "Intercept: TS" => 2.54, "Intercept: p-value" => 0.02, "Intercept: 95%CI" => [3.13, 3.45], "Age" => 2.2, "Age: SD" => 1.13, "Age: TS" => 1.54, "Age: p-value" => 0.24, "Age: 95%CI" => [1.13, 3.45], ................. }

Do you think that's okay?

Alexej

> > 4. mailto:[10]sciruby-dev...@googlegroups.com

> > 5. [11]https://groups.google.com/d/optout
>
> --
>
> --
> You received this message because you are subscribed to the Google Groups
> "SciRuby Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [12]sciruby-dev...@googlegroups.com.
> For more options, visit [13]https://groups.google.com/d/optout.
>
> References
>
> Visible links
> 1. http://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf
> 2. https://github.com/agisga/MixedModels.git
> 3. https://github.com/SciRuby/nmatrix/pull/366
> 4. https://github.com/SciRuby/nmatrix/pull/365
> 5. javascript:
> 6. https://groups.google.com/d/optout
> 7. https://github.com/agisga/MixedModels.git
> 8. https://github.com/SciRuby/nmatrix/pull/366
> 9. https://github.com/SciRuby/nmatrix/pull/365
> 10. javascript:
> 11. https://groups.google.com/d/optout

> 12. mailto:sciruby-dev...@googlegroups.com
> 13. https://groups.google.com/d/optout

--

--
You received this message because you are subscribed to the Google Groups "SciRuby Development" group.

To unsubscribe from this group and stop receiving emails from it, send an email to sciruby-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pjotr Prins

unread,

Jun 5, 2015, 8:43:53 PM6/5/15

to Alexej Gossmann, sciru...@googlegroups.com

On Fri, Jun 05, 2015 at 04:45:11PM -0500, Alexej Gossmann wrote:
> Hi Pjotr,
>

It appears to me that a lexer/parser would be the way to go unless you
somehow can make it using Ruby as a DSL. Ragel we have used to good
effect as part of Sambamba and bio-vcf, e.g.,

https://github.com/pjotrp/bioruby-vcf/blob/master/ragel/gen_vcfheaderline_parser.rl

(tests at the bottom)

Ragel is well documented, fast and generates pure Ruby. It takes a few
days to master, but the principle is straightforward, you can create a
call-back at any time of the parsing and generate output.

> Also, I will look if I can use [1]daru for the input data frames. What to
> you think?

Probably worth trying. I have not used it myself.

> I have also thought about the output module. Since I don't know in what
> context users will fit LMMs in Ruby, I decided not to worry about nice
> presentation of the output in fancy tables (like in R), but instead to
> give outputs as Hashes. For example, #fix_ef would return a Hash with all
> information about the estimated fixed effects coefficients; something
> like:
>
> > model _ fit.fix_ef # => { "Intercept" => 3.32, "Intercept: SD" => 0.2,
> "Intercept: TS" => 2.54, "Intercept: p-value" => 0.02, "Intercept: 95%CI"
> => [3.13, 3.45], "Age" => 2.2, "Age: SD" => 1.13, "Age: TS" => 1.54,
> "Age: p-value" => 0.24, "Age: 95%CI" => [1.13, 3.45], ................. }
>
> Do you think that's okay?

Perfect. Allow for JSON output and anyone can transform it to
something else.

Pj.

Pjotr Prins

unread,

Jun 6, 2015, 2:24:04 PM6/6/15

to Pjotr Prins, Alexej Gossmann, sciru...@googlegroups.com

Maybe I am overcomplicating things. Effect is a vector, right? And the
rest is basically matrix notation. What do the NMatrix people say? Can
we come up with a pure Ruby notation that will actually work? The R
notation is nice, but I think under the hood it must be syntactic
sugar. We should be able to do something similar in Ruby (DSL).

A lexer/parser in combination with a DSL is still an option.

Pj.

Alexej Gossmann

unread,

Jun 6, 2015, 3:03:56 PM6/6/15

to sciru...@googlegroups.com, alex...@googlemail.com, pjotr...@gmail.com

Maybe I am overcomplicating things. Effect is a vector, right? And the
rest is basically matrix notation. What do the NMatrix people say?

Though it is matrix notation, the problematic is how to tell the model which columns of the matrix will be multiplied by the fixed effects coefficients, which ones will be multiplied by the random effects coefficients, and which columns represent the group structures for which of the random effects. That is, the specification of a classical linear mixed models, where groups of the observations share common random effects values.

The formula language in lme4 has a more or less clear way of specifying these relationships, and it is well documented. Therefore, I wanted to mimic it in Ruby.
R's nlme and Python's statmodels have similar formula interfaces too, differing from lme4 in the requirement of two formulas, one formula specifying the fixed effects and another the random effects (quite confusing in nlme, and I don't have experience with python's implementation).

Basically, such formula interfaces relieve the user from the specification of the model matrices X and Z, and the covariance matrix for the random effects. Also, they relieve the user from the need to know the exact mathematical matrix formulation of the statistical model (which is important because I don't think that many people who apply mixed models in medical and social sciences can actually write down the model as a matrix equation).

So, I cannot really see a way of how to avoid using an R-like formula interface for model specification (setting up all model matrices by hand is one alternative for people who understand the math behind their model, but it's lengthy as you can see in my example here).

Alexej

Alexej Gossmann

unread,

Jun 12, 2015, 3:35:11 PM6/12/15

to sciru...@googlegroups.com, Artem Tarasov, alex...@googlemail.com

I have written another blog post. It describes a prototype for the user interface of the linear mixed model algorithm:
http://agisga.github.io/LMM-model-specification/

Will Levine

unread,

Jun 12, 2015, 5:41:44 PM6/12/15

to sciru...@googlegroups.com

Hi Alexej,
You probably can't exactly replicate the R syntax in ruby (at least
not without writing a parser yourself) for at least two reasons.
First, to get "(1|variable)" to work, you would have to monkeypatch
one of the core classes for numbers, which is probably a bad idea.
Secondly, "~" doesn't work as a binary operator in ruby.

However, you might be able to approximate the R syntax. I am not sure
if this an adequate solution since I don't totally understand the
problem specification or even what an LMM is. I put together a really
quick example here:
https://gist.github.com/wlevine/444fe02b145f56558db1 Like I said, I
don't know if this is a good idea or even if it will work.

Will

Alexej Gossmann

unread,

Jun 12, 2015, 6:46:12 PM6/12/15

to sciru...@googlegroups.com

Thanks, Will! I like your idea :)

It might be somewhat inconvenient to define every variable with lmm_variable before constructing the formula. But I guess I can define a function that transforms a string "z ~ x*y + (1|u)" into another string like
formula_string = "z.is_modeled_as(lmm_variable(:x) * lmm_variable(:y) + (lmm_variable(:one) | lmm_variable(:u)))",

and then I could do,

formula = eval(formula_string)

to get

#<LMM_expression:0x007f09bbedf960 @content=[[:z], [:x, :mult, :y], [:one, :pipe, :u]]>

Finally, I could pass formula.content into LMM#from_daru.

All of that would happen inside a method LMM#from_formula (e.g. LMM.from_formula("z ~ x*y + (1|u)", dataset)).

Does that seem doable in Ruby?

The remaining problem is that the meaning of x*y or x:y is difficult if one or both variables are categorical.

Pjotr, what do you think?

Alexej

Pjotr Prins

unread,

Jun 13, 2015, 2:25:04 AM6/13/15

to 'Alexej Gossmann' via SciRuby Development

On Fri, Jun 12, 2015 at 05:46:11PM -0500, 'Alexej Gossmann' via SciRuby Development wrote:
> Thanks, Will! I like your idea :)
> It might be somewhat inconvenient to define every variable with
> lmm_variable before constructing the formula. But I guess I can define a
> function that transforms a string "z ~ x*y + (1|u)" into another string
> like
> formula_string = "z.is_modeled_as(lmm_variable(:x) * lmm_variable(:y) +
> (lmm_variable(:one) | lmm_variable(:u)))",
> and then I could do,
> formula = eval(formula_string)
> to get
>
> #<LMM_expression:0x007f09bbedf960 @content=[[:z], [:x, :mult, :y], [:one, :pipe, :u]]>
>
> Finally, I could pass formula.content into LMM#from_daru.
> All of that would happen inside a method LMM#from_formula (e.g.
> LMM.from_formula("z ~ x*y + (1|u)", dataset)).
> Does that seem doable in Ruby?
> The remaining problem is that the meaning of x*y or x:y is difficult if
> one or both variables are categorical.
> Pjotr, what do you think?

That looks rather clever :). How about declaring x and/or y to be
categorical when declaring the variable?

Creating an R-like parser to generate this declaration may be possible
too.

Pj.

Alexej Gossmann

unread,

Jun 17, 2015, 3:11:20 PM6/17/15

to sciru...@googlegroups.com

I made a user-friendly formula interface to MixedModels.
It also supports categorical variables and interaction effects (though currently only between two numeric variables).
This blog post gives a usage example and briefly describes the implementation: http://agisga.github.io/MixedModels_from_formula/

Alexej

Alexej Gossmann

unread,

Jul 3, 2015, 2:57:15 PM7/3/15

to sciru...@googlegroups.com

I implemented hypotheses tests and confidence intervals for the coefficient estimates of a linear mixed model. Here is a blog post describing the implementation and some of the underlying theory and controversies:

http://agisga.github.io/MixedModels_p_values_and_CI/

There is much debate in the scientific community on what is the right way to compute p-values and other diagnostics for linear mixed models. For a good summary see here or the links in my blog post.
Please let me know if you have an opinion on how we should handle it in Ruby (use the convenient and computationally light Wald tests only, or try to implement slow and heavy but more accurate MCMC and bootstrap methods, or not return any p-values as lme4 does, ...).

Alexej

Claudio Bustos

unread,

Jul 5, 2015, 10:18:17 PM7/5/15

to sciruby-dev

Personally, I love alternatives. A good implementation of bootstrap and MCMC p-values will be a must to have.

--
You received this message because you are subscribed to the Google Groups "SciRuby Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sciruby-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Claudio Bustos
Psicólogo
clbu...@gmail.com

Alexej Gossmann

unread,

Jul 6, 2015, 2:24:31 PM7/6/15

to sciru...@googlegroups.com

> Personally, I love alternatives. A good implementation of bootstrap and MCMC p-values will be a must to have.

Claudio, I aggree. Maybe I will implement the approach described http://www.r-project.org/conferences/useR-2009/slides/SanchezEspigares+Ocana.pdf, but I will also look for more references before starting the work (let me know if you know of any).

Alexej

Alexej Gossmann

unread,

Aug 7, 2015, 2:38:00 AM8/7/15

to SciRuby Development, pjot...@thebird.nl

Hello,

Recently I have implemented bootstrap methods (including confidence intervals) for linear mixed models in mixed_models.
Today, I have written a blog post with some underlying theory, some implementation details, and examples:
http://agisga.github.io/bootstap_confidence_intervals/

Best regards,
Alexej

Pjotr Prins

unread,

Aug 7, 2015, 3:42:46 AM8/7/15

to Alexej Gossmann, SciRuby Development, pjot...@thebird.nl

Hi Alexej,

Great stuff.

It is no accident you have 2x speedup. It can probably be explained
by your laptop using hyperthreading (so 2 cores look like 4). For CPU
intensive tasks hyperthreading usually gives a 10-20% speedup
depending on IO requirements.

Pj.

On Thu, Aug 06, 2015 at 11:38:00PM -0700, Alexej Gossmann wrote:
> Hello,
>
> Recently I have implemented bootstrap methods (including confidence

> intervals) for linear mixed models in [1]mixed_models.

> Today, I have written a blog post with some underlying theory, some
> implementation details, and examples:

> [2]http://agisga.github.io/bootstap_confidence_intervals/

>
> Best regards,
> Alexej
>
> On Wednesday, May 27, 2015 at 2:01:51 PM UTC-5, Alexej Gossmann wrote:
>
> Hi Pjotr,
>
> I have set up a github repository for my GSoC project:

> [3]https://github.com/agisga/MixedModels.git

>
> Currently, it only includes some matrix methods which are going to be
> useful to me, and a prototype implementation of a deviance/REML function
> class. The deviance function is the objective function for optimization.
>
> Also, I have sent the following pull requests to NMatrix yesterday,
> which are related to my project:

> [4]https://github.com/SciRuby/nmatrix/pull/366
> [5]https://github.com/SciRuby/nmatrix/pull/365
>
> Alexej
>
> References
>
> Visible links
> 1. https://github.com/agisga/mixed_models
> 2. http://agisga.github.io/bootstap_confidence_intervals/
> 3. https://github.com/agisga/MixedModels.git
> 4. https://github.com/SciRuby/nmatrix/pull/366
> 5. https://github.com/SciRuby/nmatrix/pull/365

--

Alexej Gossmann

unread,

Aug 7, 2015, 5:09:08 PM8/7/15

to SciRuby Development, pjot...@thebird.nl

It is no accident you have 2x speedup. It can probably be explained
by your laptop using hyperthreading (so 2 cores look like 4). For CPU
intensive tasks hyperthreading usually gives a 10-20% speedup
depending on IO requirements.

Thanks Pjotr. I checked, you are absolutely right about the CPU of my laptop. I corrected the blog post.

I have also updated my blog post about application of mixed models to genetic data (which I have actually never posted in the sciruby mailing list) with an application of bootstrap confidence intervals: http://agisga.github.io/mixed_models_applied_to_family_SNP_data/

I am working now on bootstrap and Monte Carlo based hypothesis tests (need to do some reading).

Best,

Alexej

On Wednesday, May 27, 2015 at 2:01:51 PM UTC-5, Alexej Gossmann wrote:

Reply all

Reply to author

Forward