R-like formulas

terhorst

unread,

Feb 10, 2011, 11:30:12 AM2/10/11

to pystatsmodels

Hi all,

One of the last slides of http://conference.scipy.org/scipy2010/slides/skipper_seabold_statsmodels.pdf
mentions an R-like formula framework. I was writing to see if anyone
is / was working on such a thing. I might be interested in taking a
crack at it if not. Any pointers or advice would be welcome. Thanks!

Jonathan

josef...@gmail.com

unread,

Feb 10, 2011, 1:20:16 PM2/10/11

to pystat...@googlegroups.com

That would be great. It would be a very welcome, especially by many
users that come from the R side.

Nothing much has happened except for looooong discussions.

http://groups.google.com/group/pystatsmodels/browse_thread/thread/9636cb2f8a0d37cf/12d738ba991be696?q=nipy#12d738ba991be696

and more recently as part of a thread on the nipy list

http://mail.scipy.org/pipermail/nipy-devel/2011-January/005296.html

nipy uses sympy to work with formulas but it doesn't use a string parser

The sandbox of statsmodels contains the original version of the
formula framework by Jonathan Taylor. Some of it is working fine,
other parts are incomplete or are not fully working.

Nathaniel Smith posted his formula parser on the mailing list in the
first thread, but I have to look for it.

In case this sounds scary, I would recommend starting at one part and
slowly build up a working version. One possibility would be to use
Nathaniel's parser as the front end and formula classes like the
current ones by Jonathan as a backend, which then can delegate the
estimation and testing to the model and result classes in statsmodels.

I would be very happy if we could get a version that is able to handle
initially easier cases, and gives as a way where we could expand in
future.

If you have questions about the current code or other issues, I will
be glad to help.

So please go ahead, give a crack at it, and keep us updated.

Josef

Nathaniel Smith

unread,

Feb 10, 2011, 1:38:53 PM2/10/11

to pystat...@googlegroups.com

Uh, *cough* funny you should ask. I wasn't planning to announce this
for a few days yet, but take a peek here: https://github.com/charlton

I need to write docs before I can get the public API right, but try
something like:

>>> from charlton.spec import ModelSpec
>>> data = {"y": [1, 2, 3], "x": [4, 5, 6], "a": ["high", "low", "high"]}
>>> model_spec = ModelSpec.from_desc_and_data("y ~ x", data)
>>> exog, endog = model_spec.make_matrices(data)
>>> print endog.column_info.column_names
['Intercept', 'x']
>>> print endog
[[ 1. 4.]
[ 1. 5.]
[ 1. 6.]]
>>> print endog.column_info.column_names
['Intercept', 'a[T.low]', 'np.log(x)', 'np.log(x):a[T.low]']
>>> exog, endog = model_spec.make_matrices(data)>>> print endog
[[ 1. 0. 1.38629436 0. ]
[ 1. 1. 1.60943791 1.60943791]
[ 1. 0. 1.79175947 0. ]]

Also see the TODO file ;-)

-- Nathaniel

josef...@gmail.com

unread,

Feb 10, 2011, 1:51:31 PM2/10/11

to pystat...@googlegroups.com

Nice, I like secret projects that say "OMG it works!"

Josef

Wes McKinney

unread,

Feb 10, 2011, 10:46:27 PM2/10/11

to pystat...@googlegroups.com

I've only had a brief look through and what you've done looks really
fantastic-- I'll be very excited to see it integrated into
statsmodels.

- Wes

Nathaniel Smith

unread,

Feb 11, 2011, 1:32:41 PM2/11/11

to pystat...@googlegroups.com

On Thu, Feb 10, 2011 at 10:51 AM, <josef...@gmail.com> wrote:
> Nice, I like secret projects that say "OMG it works!"

Yeah, I realized about 70% of the way in that I'd bitten off more than
I realized... :-) But it does do the job, and handles a number of
rather subtle issues. Now I just need to get that API right...

Speaking of which, what sort of API would *you* want for the model
matrix building stuff, to make it easy to integrate into your code? I
assume the ultimate user API you'd want to support would be something
like

MyModel(model_info, data=None, **otherargs)

where model_info might be any of: string formula ("y ~ x"), an object
describing that formula, or a raw model matrix?

-- Nathaniel

josef...@gmail.com

unread,

Feb 11, 2011, 2:38:25 PM2/11/11

to pystat...@googlegroups.com

Given some of our previous discussions, my current thoughts are
I would like to to work with several layers, so we have a good
separation of the code without loosing any info on the way.

* wrapper classes or functions, that could take model info or formula
as the main interface, like MyModel. This would use the formula
framework to generate the design matrix and additional information
like variable names, and any info we need about the columns. This
could be class methods of models, Model.fromformula or shorter, or
dispatch functions or separate classes.
Overloading the constructor, __init__, of the statistical models with
several different patterns with different signatures, would be
possible, but it doesn't look nice, and I have seen recommendations
against doing it. And I think it will make inheritance and super calls
to __init__ more complicated.

* current models (similar to R's lm.fit I guess) and result classes:
as we discussed in privatizing methods, we could attach the meta info,
that contains the model definition in some form, but we keep the
current endog and exog as numpy arrays.
the new public methods (fittedvalues, error, ...) can access the
metainfo to wrap the output

* post-estimation support for testing: here I would like to use the
information from the formula framework, or the formula framework
itself to (semi-)automatically create for example contrast matrices,
f_test, ttest and multipletests. For example f_test in pandas uses
variable names or string parsing to define the contrast matrix, and it
would be nice to have an optional way where we can use the formula
instance again to help us.

The result classes have now a compare method, but this is currently
the only support for quickly comparing different models. So somewhere
there should be support for quickly creating changed models out of
previous models (which seems to be your working style).

I haven't looked yet in details at your code (since it's secret). One
thought I had is how we can benefit from the design matrix and
contrast matrix construction code also programmatically, i.e.
bypassing the "string" part and using the underlying functions
directly. This would make, for example, writing some tests easier
because they need the same machinery.

One question for statsmodels internally will be what we put in
metainfo of a model instance, should we hold on to a formula instance
with all associated information and reuse it, or should we just keep
the information, and recreate a formula or whatever if we need further
processing.
(As an aside: I just saw a discussion on stackoverflow triggered
because scipy.stats.distribution instances are not picklable. One
answer was, pickle the state (parameters) instead of the class
instance.)

This is roughly where I would like to go. As it looks like you have
done all the heavy work for constructing design and contrast matrices
from a formula description. So, we could try to wire it up to the
linear regression models and flesh out the details as we make it work.

Thanks for taking a big bite,

Josef

>
> -- Nathaniel
>

josef...@gmail.com

unread,

Feb 12, 2011, 3:25:34 AM2/12/11

to pystat...@googlegroups.com

I tried out charlton for a few hours starting with your example, and
it looks pretty good. I will have a few questions on specific things
later.

>nosetests --all-modules \charlton\charlton
still shows a few bugs but also pretty good.

Python introspection works pretty well, to find my way around a bit.

From what I have read about R's formula implementation, it defines two
API's one for users and one for developers.

I'm glad to leave the user side (mostly) to the statisticians who have
experience with it. (Besides adding the formula interface.)

From the development side, it looks like much is inside charlton, I'm
not sure where the boundaries in R are compared to this.

What we can immediately use are the construction of design matrix with
make_matrices and the column_names. I haven't looked enough at the
contrast code yet.

term_name_to_columns and similar sound useful. I haven't figured them
out yet, but would be very interested.
For example for the multiple comparisons test:
I was basing it on a dummy encoding, and then constructing contrast
matrices for all pairwise comparisons, comparisons to control, and
similar for conditional interaction tests. The information I would
need for a general design matrix is the location of the relevant
dummies (column indices), the number of levels for each factor that is
involved in interaction effects, and (barely touched yet) whether the
factors are nested. I'm also using or constructing the associated
names. (I only looked at one and two factors so far.)

Another usecase are marginal or average (treatment) effects like what
Skipper did in discretemod. It's not possible to do this correctly for
interaction effects without the relevant information, for categorical
effects it's also difficult to do. (too late in the night, I don't
remember the details for categorical variables in this.)

So one of the internal API question is how to pull this information
out of the formula framework.

Going through the code in charlton, it really looks like it is a lot
of work that you did (and I'm glad I didn't have to figure this out by
myself.)

Josef

>
>>
>> -- Nathaniel
>>
>

Nathaniel Smith

unread,

Feb 13, 2011, 11:52:08 PM2/13/11

to pystat...@googlegroups.com

On Fri, Feb 11, 2011 at 11:38 AM, <josef...@gmail.com> wrote:
> Given some of our previous discussions, my current thoughts are
> I would like to to work with several layers, so we have a good
> separation of the code without loosing any info on the way.
>
> * wrapper classes or functions, that could take model info or formula
> as the main interface, like MyModel. This would use the formula
> framework to generate the design matrix and additional information
> like variable names, and any info we need about the columns. This
> could be class methods of models, Model.fromformula or shorter, or
> dispatch functions or separate classes.
> Overloading the constructor, __init__, of the statistical models with
> several different patterns with different signatures, would be
> possible, but it doesn't look nice, and I have seen recommendations
> against doing it. And I think it will make inheritance and super calls
> to __init__ more complicated.
>
> * current models (similar to R's lm.fit I guess) and result classes:
> as we discussed in privatizing methods, we could attach the meta info,
> that contains the model definition in some form, but we keep the
> current endog and exog as numpy arrays.
> the new public methods (fittedvalues, error, ...) can access the
> metainfo to wrap the output

I think it would be simplest to have a slightly different interface,
consisting of a single class that does something like:

def __init__(self, design, data=None, ...):
self.exog, self.endog = charlton.make_model_matrices(design, data)
# ...

Where make_model_matrices is a function that supports all of these
calling conventions and always does the right thing:
make_model_matrices("y ~ x", {"y": ..., "x": ...})
make_model_matrices(my_model_desc_object, {"y": ..., "x": ...})
make_model_matrices((np.asarray(y), np.asarray(X)))

I don't see the advantage of taking that last case and sticking it
into a whole other class. Also, this way we get consistent handling of
metadata -- in the last case I'd expect make_model_matrices to make up
some column names ("column0", "column1", etc.) and attach the
appropriate metadata, so that all the later code can assume that
metadata exists and is in the standard format.

> * post-estimation support for testing: here I would like to use the
> information from the formula framework, or the formula framework
> itself to (semi-)automatically create for example contrast matrices,
> f_test, ttest and multipletests. For example f_test in pandas uses
> variable names or string parsing to define the contrast matrix, and it
> would be nice to have an optional way where we can use the formula
> instance again to help us.

Yes, that would be lovely. I haven't thought about this much yet, but
I bet we could re-use charlton's parser. (The parser itself is very
generic and just spits out an AST; there's an evaluator that runs over
that AST to produce model descriptions, but writing a different
evaluator would be trivial.)

> The result classes have now a compare method, but this is currently
> the only support for quickly comparing different models. So somewhere
> there should be support for quickly creating changed models out of
> previous models (which seems to be your working style).

Well, I actually mostly tweak models by hitting up-arrow to find them
in my command history and then changing their specification, which
requires a good high-level notation but not much more... But here are
some other things you can do if you have some richer information than
just the model matrix:

-- What R calls "drop1": for each term in the model: fit a model
without that term, so you can test that term's unique contribution to
the model via likelihood ratio test.
-- Calculate model predictions on new data vectors. (Both for its own
sake, and as a primitive needed for many other operations -- e.g., if
you have a non-linear model, then plotting on the response scale or
bootstrapping confidence intervals on the response scale both require
the ability to evaluate the fitted model on new data)
-- What R calls "update": fit a new model that's *like* that old model
except... R has a magic syntax that looks like ". ~ . - x1 + x2",
where "." means "what the old model had here". This will be easy to
add to charlton, I just haven't gotten around to it yet.

> I haven't looked yet in details at your code (since it's secret). One
> thought I had is how we can benefit from the design matrix and
> contrast matrix construction code also programmatically, i.e.
> bypassing the "string" part and using the underlying functions
> directly. This would make, for example, writing some tests easier
> because they need the same machinery.

Yes, that was a design goal (partly to satisfy some people's fears
about parsing strings, but it's a good idea). See
charlton.desc.ModelDesc, which is the Python object equivalent of a
formula string.

> From the development side, it looks like much is inside charlton, I'm
> not sure where the boundaries in R are compared to this.

AFAICT (I haven't actually used them "in anger", so to speak), R's
formula handling is pretty opaque from the outside. There are a few
standard operations and people treat them as a black box. I think
Charlton is actually somewhat better in this regard -- e.g., it should
be easy to extend the syntax if you have special needs. (The 'lme4'
package in R does something similar to let you specify random effects
alongside your fixed effects, but almost no-one else does this,
probably because it requires such complicated magic.)

> term_name_to_columns and similar sound useful. I haven't figured them
> out yet, but would be very interested.

Basically, a "term" is one item in your formula, like "x" or "a:b". It
might correspond to multiple columns, either because it's a numeric
predictor that turns out to be a matrix with multiple columns (this is
convenient for something like "spline(x)", where spline is a function
that returns a spline basis for x as a matrix), or because it's a
categorical variable that needed coding.

term_name_to_columns just says, "x" ended up in columns 2-4, while
"a:b" ended up in 5-12.

I figured it'd be especially useful for something like classic ANOVA
tests, which involve fitting a model like "1 + a + b + a:b" and then
testing each *term* for significance, but I assume there are other
uses too.

> For example for the multiple comparisons test:
> I was basing it on a dummy encoding, and then constructing contrast
> matrices for all pairwise comparisons, comparisons to control, and
> similar for conditional interaction tests. The information I would
> need for a general design matrix is the location of the relevant
> dummies (column indices), the number of levels for each factor that is
> involved in interaction effects, and (barely touched yet) whether the
> factors are nested. I'm also using or constructing the associated
> names. (I only looked at one and two factors so far.)

You can get dummy coding out of Charlton (e.g., "y ~ 0 + a:b"), but it
isn't the usual case. The full coding of something like an
interaction, possibly involving both categorical and numeric
variables, in a model that fits main effects and interaction effects
separately, can be... complicated. And depends on various
user-specified things, like what kind of coding they want to use.

Right now the only information Charlton exposes about these
complications is in the column names -- e.g., a column named
"a[T.low]:b[red]" is the pointwise multiplication of the
treatment-coded "low" level of factor a, and the dummy-coded "red"
level of factor b.

Maybe if we also exposed the contrast matrices then that'd be enough
to do things like estimate arbitrary new contrasts (like the pairwise
comparisons, comparisons to control, etc. that you mention) from a
fitted model? It seems likely to me, but I'd have to work through the
math to figure out how feasible it is in the general case.

Of course, just have the ability to control contrast coding gives you
some of what you want (like, treatment-coded estimates basically are
comparisons to control).

-- Nathaniel

josef...@gmail.com

unread,

Feb 14, 2011, 1:30:54 AM2/14/11

to pystat...@googlegroups.com

One of the main problems I have with the above is that I want to have
a shortcut to the models and not use the time for building a
consistent metadata. The main reason is that, as we extend resampling,
bootstrap, cross-validation and outlier detection, we create the model
instance and run through this maybe thousands of times. WLS is used by
both glm and rlm in the iterative solver. (But we still don't have any
profiling for statsmodels, and for non-linear models this might be
small compared to the estimation.)

What is the splitup between what's put in the formula and what is in
additional arguments, if models define additional arguments, sigma,
weight in GLS, WLS, families, links, and various other things. I never
figured out a pattern in R.

If "design" above just contains endog and exog then it would work
across many models, but if we stick more in it, then it might get
messy. For instrumental variables, are the instruments part of the
formula or separate. Do we get two component formulas, e.g.
zero-inflated poisson, one model for the mixture distribution, one
model for the component part, GARCH, one model for the mean, one model
for the variance?
Currently we just have python signatures that are usually pretty
informative. Signatures in R with Model(formula, ...) have a much
steeper learning curve, at least for me.

If signature overloading is generally desired, I might be in favor of
just overloading endog and keep the remaining arguments as optional,
and then delegate if endog is not a basic ndarray, maybe rename the
argument to "endog_or_formula".
The signature wouldn't indicate anymore what other arguments are
required, but it would still show them. And it would leave the way
open to move additional arguments into the formula.

Most of the things below are relatively clear and uncontroversial, and
I will have to go over them in detail and try them out (especially the
term_to_columns, I'd like to see how I can use them for the tests).

(Just as a note: in econometrics I never learned any codings besides
dummy coding, and, if I remember STATA correctly, they wrote in the
release notes that they only have dummy coding and will add other
codings later.)

Thanks,

Josef

Nathaniel Smith

unread,

Feb 14, 2011, 2:54:35 PM2/14/11

to pystat...@googlegroups.com

On Sun, Feb 13, 2011 at 10:30 PM, <josef...@gmail.com> wrote:
> On Sun, Feb 13, 2011 at 11:52 PM, Nathaniel Smith <n...@pobox.com> wrote:
>> I don't see the advantage of taking that last case and sticking it
>> into a whole other class. Also, this way we get consistent handling of
>> metadata -- in the last case I'd expect make_model_matrices to make up
>> some column names ("column0", "column1", etc.) and attach the
>> appropriate metadata, so that all the later code can assume that
>> metadata exists and is in the standard format.
>
> One of the main problems I have with the above is that I want to have
> a shortcut to the models and not use the time for building a
> consistent metadata. The main reason is that, as we extend resampling,
> bootstrap, cross-validation and outlier detection, we create the model
> instance and run through this maybe thousands of times. WLS is used by
> both glm and rlm in the iterative solver. (But we still don't have any
> profiling for statsmodels, and for non-linear models this might be
> small compared to the estimation.)

Here 'a' is a matrix with 100 columns:

In [6]: timeit column_names = ["column%s" % (i,) for i in xrange(a.shape[1])]
10000 loops, best of 3: 68.3 us per loop

70 microseconds * 1000s of runs = a few tenths of a second total
overhead. Profile first, *then* worry about optimization :-).

> What is the splitup between what's put in the formula and what is in
> additional arguments, if models define additional arguments, sigma,
> weight in GLS, WLS, families, links, and various other things. I never
> figured out a pattern in R.
>
> If "design" above just contains endog and exog then it would work
> across many models, but if we stick more in it, then it might get
> messy. For instrumental variables, are the instruments part of the
> formula or separate. Do we get two component formulas, e.g.
> zero-inflated poisson, one model for the mixture distribution, one
> model for the component part, GARCH, one model for the mean, one model
> for the variance?

A formula in charlton is just an object that lets you turn data into
either one or two matrices (the former for formulas like "x1 + x2"
without any "y ~" part). You won't go far wrong if you just take the
signature you would otherwise use, and for each argument that's a
data-dependent matrix with one row per observation, replace it with a
formula. I certainly wouldn't put weights, families, links, etc. in
there.

There are places where it might be nice to extend the basic formula
syntax to handle more exotic situations (like I mentioned lme4 does);
in those cases you probably do want both a high-level and low-level
API. But those cases are relatively rare, I think.

> (Just as a note: in econometrics I never learned any codings besides
> dummy coding, and, if I remember STATA correctly, they wrote in the
> release notes that they only have dummy coding and will add other
> codings later.)

It's amazing how many different statistics cultures there are, each
with their own terminology and everything. I, by comparison, have only
the haziest idea how an instrumental variable is used :-).

-- Nathaniel

Simon Byrne

unread,

Feb 14, 2011, 5:23:03 PM2/14/11

to pystatsmodels

On Feb 10, 6:20 pm, josef.p...@gmail.com wrote:
> Nothing much has happened except for looooong discussions.

I noticed that part of those discussions focused on how the R formula
syntax actually works. I recently learnt that it is called "Wilkinson-
Rogers notation":

https://stat.ethz.ch/pipermail/r-help/2002-July/023036.html

the original article, which details how expansions and whatnot are
supposed to work, is here (for those with JSTOR access):

http://www.jstor.org/stable/2346786

Simon

Jonathan Taylor

unread,

Feb 21, 2011, 1:23:27 PM2/21/11

to pystat...@googlegroups.com

Just adding my 2 cents. I agree with Josef in that I think that it would be wise to keep the model specification in terms of arrays
and use formula separately to generate design matrices and contrast matrices.

R's use of things like "lm(Y~., data)" makes it easy for users but can hide a lot of assumptions. That is,
I think the most consistent base function is like R's "lm.fit" rather than "lm".

Using things like "term_name_to_columns" is really an interactive tool
which is useful for "deconstructing" what R or some "formula" tool has done to your "formula" but it
is not really useful unless you already know what the term names are, i.e. how the formula parser takes a string like
"poly(x, 3) * factor(a) * factor(b) + poly(x,2) * poly(y,3) * factor(b) * factor(c)" and constructs the column names.
I think these issues should be dealt with separate from the actual fitting of the model.

Also, this has the advantage of allowing different specifications of "formula". I don't think it's much more work for a novice
user to use the template

>>> design = construct_matrices(model_spec)
>>> lm = OLS(Y, design)

as compared to

>>> lm = OLS(Y, model_spec)

It shows the novice user that an OLS model is really a specified by matrices not by a formula.

R's functions like "drop1" and "update" are useful for some automated purposes (like a stepwise version of AIC/BIC-based model selection) but are a little unflexible. For instance it seems natural to me to expect that drop1(lm(Y ~ poly(X,3)) would try fitting a quadratic model instead of the cubic specified by poly(X,3), but it doesn't:

> lm(Y~poly(X,3))

Call:
lm(formula = Y ~ poly(X, 3))

Coefficients:
(Intercept) poly(X, 3)1 poly(X, 3)2 poly(X, 3)3
   -0.07689      0.03371     -1.98151     -0.13904

> drop1(lm(Y~poly(X,3)))
Single term deletions

Model:
Y ~ poly(X, 3)
           Df Sum of Sq    RSS     AIC
<none>                  31.159 -1.9906
poly(X, 3) 3    3.9469 35.106 -3.2201
>

Or, it might be nice to try and drop the cubic using "update". No such luck even if
we use the actual string assigns to the column name in the fitted "lm".

> update(lm(Y~poly(X,3)), . ~ . - X^3)

Call:
lm(formula = Y ~ poly(X, 3))

Coefficients:
(Intercept) poly(X, 3)1 poly(X, 3)2 poly(X, 3)3
   -0.07689      0.03371     -1.98151     -0.13904

> update(lm(Y~poly(X,3)), . ~ . - I(X^3))

Call:
lm(formula = Y ~ poly(X, 3))

Coefficients:
(Intercept) poly(X, 3)1 poly(X, 3)2 poly(X, 3)3
   -0.07689      0.03371     -1.98151     -0.13904

> update(lm(Y~poly(X,3)), . ~ . - poly(X, 3)3)
Error: unexpected numeric constant in "update(lm(Y~poly(X,3)), . ~ . - poly(X, 3)3"

I don't think the above are bugs in R's implementation of formulae, it's just that they
don't do what I might consider natural and I just don't think it's worth creating a more complex OLS class that will try to do things that R
does not do natively in a natural way (at least in my opinion).

I would like to think of the formula as more than just a tool that creates matrices and there
are exotic use cases of formula that are not really rare. A natural one is an additive model that is not really easy to specify in R except for two specific smoothers (smoothing spline)
and loess.

Z = rnorm(40)
> gam(Y ~ s(X) + lo(Z))
Call:
gam(formula = Y ~ s(X) + lo(Z))

Degrees of Freedom: 39 total; 30.3881 Residual
Residual Deviance: 25.19731
>

It can be done by parsing the string "s" and "lo" keeping "s" as reserved keywords but I may have my own favorite
smoother that is not a smoothing spline or a loess smoother?

Random effects models are also not that rare...

Jonathan

--
Jonathan Taylor
Dept. of Statistics
Sequoia Hall, 137
390 Serra Mall
Stanford, CA 94305
Tel: 650.723.9230
Fax: 650.725.8977
Web: http://www-stat.stanford.edu/~jtaylo

Jonathan Taylor

unread,

Feb 21, 2011, 4:03:46 PM2/21/11

to pystat...@googlegroups.com

Not sure I sent this from the right email address. Apologize for the duplication, if any.

josef...@gmail.com

unread,

Feb 21, 2011, 8:26:38 PM2/21/11

to pystat...@googlegroups.com

Since I just learned roughly what R's poly is

If my interpretation as PCA on polynomial is correct, then I'm not
sure what drop would mean. In PLS and PCA regression one of the main
points is to choose the number of PCA components, and I can imagine
having something like a drop method that reduces the number of
components, but not the order of the underlying polynomial.
(In the sandbox code for PCA regression I just report AIC, BIC, ...
for varying number of components.)

Independent of the formulas, it looks like we need more helper
functions like add_constant, categorical (which is partially like
Factor) and np.vander.

Although the models don't care what the meaning of the design matrix
is, (and I like it that way), for additional tasks (e.g. tests) I see
now that it will be useful to have the second layer of formulas,
terms, ..., and so on, which knows how to construct and interpret the
design matrix, at least for the most common cases. And the formulas
will provide the additional infrastructure (like automatic name
creation and keeping track of columns) which np.vander for example
does not do.

> I would like to think of the formula as more than just a tool that creates
> matrices and there
> are exotic use cases of formula that are not really rare. A natural one is
> an additive model that is not really easy to specify in R except for two
> specific smoothers (smoothing spline)
> and loess.
>
> Z = rnorm(40)
>> gam(Y ~ s(X) + lo(Z))
> Call:
> gam(formula = Y ~ s(X) + lo(Z))
>
> Degrees of Freedom: 39 total; 30.3881 Residual
> Residual Deviance: 25.19731
>>
>
> It can be done by parsing the string "s" and "lo" keeping "s" as reserved
> keywords but I may have my own favorite
> smoother that is not a smoothing spline or a loess smoother?

It should be possible to define an interface also for non-linear
components, maybe similar to link functions in glm, that makes it
possible to plug in user defined smoothers for example; for example a
python class with some required methods and attributes.
For the string interface it might be possible to use the literal I() (?).

Interfaces for non-linear models is still pretty much an open question
(although I like subclasses).

>
> Random effects models are also not that rare...

I wish they were not so rare in statsmodels.

Thanks for the explanations, I'm always learning something new.

Josef

Nathaniel Smith

unread,

Feb 22, 2011, 12:42:25 PM2/22/11

to pystat...@googlegroups.com, Jonathan Taylor

On Mon, Feb 21, 2011 at 10:23 AM, Jonathan Taylor
<jonatha...@stanford.edu> wrote:
> Just adding my 2 cents. I agree with Josef in that I think that it would be
> wise to keep the model specification in terms of arrays
> and use formula separately to generate design matrices and contrast
> matrices.
>
> R's use of things like "lm(Y~., data)" makes it easy for users but can hide
> a lot of assumptions. That is,
> I think the most consistent base function is like R's "lm.fit" rather than
> "lm".

I think we all agree that model specification in terms of arrays is
fundamental; we just have different ideas for what exactly the
interface should look like. If I'm understanding correctly, the
proposals are (using R terminology):
Josef: have 'lm' and 'lm.fit' as two separate functions
Me: have one function 'lm' that can take either a formula or matrices
You: have 'model.matrix' and 'lm.fit' as two separate functions (no 'lm')

> Using things like "term_name_to_columns" is really an interactive tool
> which is useful for "deconstructing" what R or some "formula" tool has done
> to your "formula" but it
> is not really useful unless you already know what the term names are, i.e.
> how the formula parser takes a string like
> "poly(x, 3) * factor(a) * factor(b) + poly(x,2) * poly(y,3) * factor(b) *
> factor(c)" and constructs the column names.
> I think these issues should be dealt with separate from the actual fitting
> of the model.

Absolutely. But the model fitting code needs a minimal awareness of
this kind of metadata, simply so that it can pass it through in a
useful way. Consider, in R:
model <- lm(y ~ foo + bar + another.thing, data=data)
summary(model) # labels each coefficient
coef(model)["bar"] # extracts the relevant coefficient

> Also, this has the advantage of allowing different specifications of
> "formula". I don't think it's much more work for a novice
> user to use the template
>
>>>> design = construct_matrices(model_spec)
>>>> lm = OLS(Y, design)
>
> as compared to
>
>>>> lm = OLS(Y, model_spec)
>
> It shows the novice user that an OLS model is really a specified by matrices
> not by a formula.

That's great for the novice user, but do you really want to be typing
two lines every time you fit a model? Do you often fit models
interactively? It strikes me as an annoying hassle.

(NB: Y is generally also produced by the formula. This allows things
like "log(y) ~ x" or (for a multivariate model) "y1 + y2 ~ x".)

> R's functions like "drop1" and "update" are useful for some automated
> purposes (like a stepwise version of AIC/BIC-based model selection) but are
> a little unflexible.

Sure. Convenience functions are convenient, not comprehensive; if you
want comprehensive, well, you have Python :-).

> For instance it seems natural to me to expect that
> drop1(lm(Y ~ poly(X,3)) would try fitting a quadratic model instead of the
> cubic specified by poly(X,3), but it doesn't:

I thought you disliked formulas because they were *too* magic, not
because they weren't magic enough ;-).

Seriously, I take your point, but this is pretty predictable -- a
formula contains zero or more terms; each term produces to 1 or more
columns. poly(X, 3) has no "+" in it, so it's one term (which produces
3 columns, because poly(X, 3) evaluates to an n x 3 matrix). I can
imagine extending the formula syntax so that it "knows" about all
kinds of things like polynomial bases and stuff, but the current
system is very easy to explain and implement, very predictable, and
still handles a *lot* of the low-hanging fruit very well.

> I would like to think of the formula as more than just a tool that creates
> matrices and there
> are exotic use cases of formula that are not really rare. A natural one is
> an additive model that is not really easy to specify in R except for two
> specific smoothers (smoothing spline)
> and loess.

[...]

> It can be done by parsing the string "s" and "lo" keeping "s" as reserved
> keywords but I may have my own favorite
> smoother that is not a smoothing spline or a loess smoother?

I'm not sure where you're going here. The tricky part about smoothing
is that you need a richer interface between the smoother and the model
fitter. In R, this is accomplished without any messing about with
reserved keywords; "s" and "lo" are just functions that return some
special objects that the fitter knows how to interpret. So there's no
obstacle to defining your own smoother, except that your fitter may
need to be modified as well... but the 'mgcv' package, for instance,
supports this fully (see ?mgcv::user.defined.smooth).

I definitely plan to support this kind of use case in charlton,
presumably using the same strategy that R does. For these kinds of
extensions, you'll need to use charlton's "native" API, instead of the
high-level "just give me some model matrices please!" API, so that you
can hook in at the right places. But that's all straightforward. I'm
starting with the easy case (that AFAIK also covers every model in
statsmodels!) where you just need the matrices, but that's not a
limitation of the design.

Probably the trickiest bit is figuring out what the non-formula
interface to these fitters should look like. Is that what's worrying
you?

> Random effects models are also not that rare...

Ditto, except for random effects models it's probably more obvious how
to describe them without formulas.

-- Nathaniel

Jonathan Taylor

unread,

Feb 22, 2011, 8:47:51 PM2/22/11

to Nathaniel Smith, pystat...@googlegroups.com

Absolutely. But the model fitting code needs a minimal awareness of
this kind of metadata, simply so that it can pass it through in a
useful way. Consider, in R:
model <- lm(y ~ foo + bar + another.thing, data=data)>
summary(model) # labels each coefficient
coef(model)["bar"] # extracts the relevant coefficient

Summary and coef do not need the formula, just the column names which can be done via a dtype or something like larry or DataArray (something Fernando Perez was toying with to give arrays axes). Neither does resid, etc. A more proper ANOVA table
needs some contrasts based on factors generated by the formula but "summary" does not.

Josef: have 'lm' and 'lm.fit' as two separate functions
Me: have one function 'lm' that can take either a formula or matrices
You: have 'model.matrix' and 'lm.fit' as two separate functions (no 'lm')

I would be OK with either "Josef" or "Me" for the sake of having a consistent way of specifying a model: having complicated __init__ methods worries me from experience with nipy. I used to think it was a good idea, now I don't. If someone wants to write "lm"
then they can use it.

> I'm not sure where you're going here. The tricky part about smoothing
> is that you need a richer interface between the smoother and the model
> fitter. In R, this is accomplished without any messing about with
> reserved keywords; "s" and "lo" are just functions that return some
> special objects that the fitter knows how to interpret. So there's no
> obstacle to defining your own smoother, except that your fitter may
> need to be modified as well... but the 'mgcv' package, for instance,
> supports this fully (see ?mgcv::user.defined.smooth).

My point about smoothing with gam is that if you want to parse that in terms of strings like "y ~ my.smooth(x)" you are really
going further along the path of writing another language. One way to achieve general smoothers is to make a new sympy function with a "smooth" method that takes a residuals argument: backfitting is then trivial.

>> Random effects models are also not that rare...

> Ditto, except for random effects models it's probably more obvious how
> to describe them without formulas.

I think that's a function of difficulties with R's formula object. Roughly speaking, there are two canonical ways to describe mixed effects models mathematically

Y = X\beta + Z\gamma

where X_{n \times p} is a "fixed" effects design and Z_{n \times q} is a "random" effects design with Cov(\gamma)=\Sigma. The variance components
part of the mixed effects models estimates (some parametrization) \Sigma while the fixed effects parts estimate \beta. This fits very nicely into a formula framework. The multiple
linear regression model (OLS) is a special case: Z=I_{n \times n}, \Sigma = \sigma^2 I_{n \times n} so there is
only one parameter to estimate in the variance component: \sigma^2.

A second way to describe them is via

Y = X\beta + \epsilon

where Cov(\epsilon) = \sum_{i=1}^k \alpha_i V_i, V_i known symmetric matrices with the constraint \sum_i \alpha_i V_i is non-negative definite. The OLS model is the special case k=1, V_1=I_{n \times n}, \alpha_1=\sigma^2. You can fit the first form into this form by setting k=q(q+1)/2 and setting V_i = Z_jZ_k' + Z_kZ_j' to be the outer products of the columns of the "random" effects design (and enforcing symmetry).

I agree with you that it is hard to specify them with R's formula, and I don't have a current working version, but I disagree when you say that they don't fit into the "formula" framework. What is this "more obvious" way you have of describing them?

Jonathan

Nathaniel Smith

unread,

Feb 22, 2011, 10:49:09 PM2/22/11

to Jonathan Taylor, pystat...@googlegroups.com

On Tue, Feb 22, 2011 at 5:47 PM, Jonathan Taylor
<jonatha...@stanford.edu> wrote:
> Summary and coef do not need the formula, just the column names which can be
> done via a dtype or something like larry or DataArray (something Fernando
> Perez was toying with to give arrays axes). Neither does resid, etc. A more
> proper ANOVA table
> needs some contrasts based on factors generated by the formula but "summary"
> does not.

Yes, charlton defines a ModelMatrix class which is a subclass of
ndarray with this metadata added (with both column names, a la larry
or DataArray, and term->column mappings, as needed for anova etc.).
Formula evaluation generates these metadata-enhanced matrices by
default, but you could just as well instantiate a ModelMatrix directly
if you prefer.

I've been meaning to look at the various named-array attempts in more
detail. I'd be just as happy to start with one of them, but it wasn't
clear to me whether any are ready for real usage yet.

>> I'm not sure where you're going here. The tricky part about smoothing
>> is that you need a richer interface between the smoother and the model
>> fitter. In R, this is accomplished without any messing about with
>> reserved keywords; "s" and "lo" are just functions that return some
>> special objects that the fitter knows how to interpret. So there's no
>> obstacle to defining your own smoother, except that your fitter may
>> need to be modified as well... but the 'mgcv' package, for instance,
>> supports this fully (see ?mgcv::user.defined.smooth).
>
> My point about smoothing with gam is that if you want to parse that in terms
> of strings like "y ~ my.smooth(x)" you are really
> going further along the path of writing another language. One way to achieve
> general smoothers is to make a new sympy function with a "smooth" method
> that takes a residuals argument: backfitting is then trivial.

My point was just that you don't need any extra parsing magic to
handle these cases. You do need some special handling of those terms,
because they don't straightforwardly map to columns in a model matrix,
but you can do parsing and term evaluation in exactly the same way.

>>> Random effects models are also not that rare...
>
>> Ditto, except for random effects models it's probably more obvious how
>> to describe them without formulas.
>
> I think that's a function of difficulties with R's formula object. Roughly
> speaking, there are two canonical ways to describe mixed effects models
> mathematically

Gah, I was really unclear here. My bad. What I meant was, one would
like to have both formula and non-formula ways to specify models, and
it's sort of non-obvious (at least to me) what the cleanest
non-formula API for specifying GAMs is, but for mixed-effect models
the non-formula API is reasonably straightforward. Like, for instance,
just specifying the fixed effects model matrix, plus another model
matrix and associated stuff (e.g., desired parametrization for Sigma)
for each random effect group. Sorry for making you type all that.

> I agree with you that it is hard to specify them with R's formula, and I
> don't have a current working version, but I disagree when you say that they
> don't fit into the "formula" framework. What is this "more obvious" way you
> have of describing them?

I agree that they fit into the formula framework, though I also agree
that it isn't clear exactly what the *best* way to fit them in would
be. As far as R-style formulas go, nlme and lme4 both have reasonable
solutions (either specifying one formula per model matrix, or
extending the formula syntax to support formulas like "y ~ fix1 + fix2
+ (ran1 + ran2 | group1)", respectively; charlton's parser has
extensibility built in to allow for exactly this case). Where both
fall down is in specifying more complex parametrizations
of/constraints on the Sigma matrix. It really would be nice to do
better -- I've been frustrated by this limitation of lme4 myself --
but I haven't thought about how.

-- Nathaniel

Jonathan Taylor

unread,

Feb 23, 2011, 12:51:52 AM2/23/11

to pystat...@googlegroups.com

Agreed. Though I guess there is a little confusion here as when I refer to
a "formula" that is not R, I mean something low-level like nipy's formula which
is essentially just a way to group together "columns" and have some "factors" but
not R's factors.

> I agree with you that it is hard to specify them with R's formula, and I
> don't have a current working version, but I disagree when you say that they
> don't fit into the "formula" framework. What is this "more obvious" way you
> have of describing them?

I agree that they fit into the formula framework, though I also agree
that it isn't clear exactly what the *best* way to fit them in would
be. As far as R-style formulas go, nlme and lme4 both have reasonable
solutions (either specifying one formula per model matrix, or
extending the formula syntax to support formulas like "y ~ fix1 + fix2
+ (ran1 + ran2 | group1)", respectively; charlton's parser has
extensibility built in to allow for exactly this case).

To be honest, I have never really understood this notation at all. I don't like it and would prefer to just specify a matrix as well as a parametrization of \Sigma / estimation
method to find the parameters of that parametrization. I sincerely hope that I never have to use that notation to specify a mixed effects model in python.

Where both
fall down is in specifying more complex parametrizations
of/constraints on the Sigma matrix. It really would be nice to do
better -- I've been frustrated by this limitation of lme4 myself --
but I haven't thought about how.

I think this problem should be thought about at the low-level API first before thinking
about putting it into a parser. In principle, for many constraints, it may be very difficult
to specify them in a parser -- simply having a function to fit the corresponding
variance component model that has a standard signature like

def fit_variance_component(Y, Xbeta, Z, metadata):
do something
return Sigma # or some parameters in Sigma

seems like a good start.

One would also perhaps like to be able to cast the problem in the form: Y = X\beta + Z\gamma, to the form Y = X\beta + \epsilon, Cov(\epsilon) = \sum_i \alpha_i V_i. This latter form is (often) easier to solve algorithmically but less natural to use for model specification. This "casting" is where many equality constraints are imposed (like equality of the variance of the random effects across subjects), and often these constraints are very "simple", i.e. \Sigma_{jk}=0, j\neq k. \Sigma_{jj} = \sigma^2_{random}.

This latter form is also reminiscent of many time-series estimation problems where Y=X\beta + \epsilon, Cov(\epsilon) = f(\alpha_i, V_i)
for some parameters \alpha_i, V_i and a non-linear function f: think of ARMA process and the V_i's being some function of the back-shift operator and its adjoint. Any effort spent in generically setting up the estimation of time series parameters should perhaps also be used to estimate variance components.

I just think we should solve the low-level API before trying to specify them using strings.

-- Jonathan

Reply all

Reply to author

Forward