That would be great. It would be a very welcome, especially by many
users that come from the R side.
Nothing much has happened except for looooong discussions.
and more recently as part of a thread on the nipy list
http://mail.scipy.org/pipermail/nipy-devel/2011-January/005296.html
nipy uses sympy to work with formulas but it doesn't use a string parser
The sandbox of statsmodels contains the original version of the
formula framework by Jonathan Taylor. Some of it is working fine,
other parts are incomplete or are not fully working.
Nathaniel Smith posted his formula parser on the mailing list in the
first thread, but I have to look for it.
In case this sounds scary, I would recommend starting at one part and
slowly build up a working version. One possibility would be to use
Nathaniel's parser as the front end and formula classes like the
current ones by Jonathan as a backend, which then can delegate the
estimation and testing to the model and result classes in statsmodels.
I would be very happy if we could get a version that is able to handle
initially easier cases, and gives as a way where we could expand in
future.
If you have questions about the current code or other issues, I will
be glad to help.
So please go ahead, give a crack at it, and keep us updated.
Josef
Uh, *cough* funny you should ask. I wasn't planning to announce this
for a few days yet, but take a peek here: https://github.com/charlton
I need to write docs before I can get the public API right, but try
something like:
>>> from charlton.spec import ModelSpec
>>> data = {"y": [1, 2, 3], "x": [4, 5, 6], "a": ["high", "low", "high"]}
>>> model_spec = ModelSpec.from_desc_and_data("y ~ x", data)
>>> exog, endog = model_spec.make_matrices(data)
>>> print endog.column_info.column_names
['Intercept', 'x']
>>> print endog
[[ 1. 4.]
[ 1. 5.]
[ 1. 6.]]
>>> print endog.column_info.column_names
['Intercept', 'a[T.low]', 'np.log(x)', 'np.log(x):a[T.low]']
>>> exog, endog = model_spec.make_matrices(data)>>> print endog
[[ 1. 0. 1.38629436 0. ]
[ 1. 1. 1.60943791 1.60943791]
[ 1. 0. 1.79175947 0. ]]
Also see the TODO file ;-)
-- Nathaniel
Nice, I like secret projects that say "OMG it works!"
Josef
I've only had a brief look through and what you've done looks really
fantastic-- I'll be very excited to see it integrated into
statsmodels.
- Wes
Yeah, I realized about 70% of the way in that I'd bitten off more than
I realized... :-) But it does do the job, and handles a number of
rather subtle issues. Now I just need to get that API right...
Speaking of which, what sort of API would *you* want for the model
matrix building stuff, to make it easy to integrate into your code? I
assume the ultimate user API you'd want to support would be something
like
MyModel(model_info, data=None, **otherargs)
where model_info might be any of: string formula ("y ~ x"), an object
describing that formula, or a raw model matrix?
-- Nathaniel
Given some of our previous discussions, my current thoughts are
I would like to to work with several layers, so we have a good
separation of the code without loosing any info on the way.
* wrapper classes or functions, that could take model info or formula
as the main interface, like MyModel. This would use the formula
framework to generate the design matrix and additional information
like variable names, and any info we need about the columns. This
could be class methods of models, Model.fromformula or shorter, or
dispatch functions or separate classes.
Overloading the constructor, __init__, of the statistical models with
several different patterns with different signatures, would be
possible, but it doesn't look nice, and I have seen recommendations
against doing it. And I think it will make inheritance and super calls
to __init__ more complicated.
* current models (similar to R's lm.fit I guess) and result classes:
as we discussed in privatizing methods, we could attach the meta info,
that contains the model definition in some form, but we keep the
current endog and exog as numpy arrays.
the new public methods (fittedvalues, error, ...) can access the
metainfo to wrap the output
* post-estimation support for testing: here I would like to use the
information from the formula framework, or the formula framework
itself to (semi-)automatically create for example contrast matrices,
f_test, ttest and multipletests. For example f_test in pandas uses
variable names or string parsing to define the contrast matrix, and it
would be nice to have an optional way where we can use the formula
instance again to help us.
The result classes have now a compare method, but this is currently
the only support for quickly comparing different models. So somewhere
there should be support for quickly creating changed models out of
previous models (which seems to be your working style).
I haven't looked yet in details at your code (since it's secret). One
thought I had is how we can benefit from the design matrix and
contrast matrix construction code also programmatically, i.e.
bypassing the "string" part and using the underlying functions
directly. This would make, for example, writing some tests easier
because they need the same machinery.
One question for statsmodels internally will be what we put in
metainfo of a model instance, should we hold on to a formula instance
with all associated information and reuse it, or should we just keep
the information, and recreate a formula or whatever if we need further
processing.
(As an aside: I just saw a discussion on stackoverflow triggered
because scipy.stats.distribution instances are not picklable. One
answer was, pickle the state (parameters) instead of the class
instance.)
This is roughly where I would like to go. As it looks like you have
done all the heavy work for constructing design and contrast matrices
from a formula description. So, we could try to wire it up to the
linear regression models and flesh out the details as we make it work.
Thanks for taking a big bite,
Josef
>
> -- Nathaniel
>
I tried out charlton for a few hours starting with your example, and
it looks pretty good. I will have a few questions on specific things
later.
>nosetests --all-modules \charlton\charlton
still shows a few bugs but also pretty good.
Python introspection works pretty well, to find my way around a bit.
From what I have read about R's formula implementation, it defines two
API's one for users and one for developers.
I'm glad to leave the user side (mostly) to the statisticians who have
experience with it. (Besides adding the formula interface.)
From the development side, it looks like much is inside charlton, I'm
not sure where the boundaries in R are compared to this.
What we can immediately use are the construction of design matrix with
make_matrices and the column_names. I haven't looked enough at the
contrast code yet.
term_name_to_columns and similar sound useful. I haven't figured them
out yet, but would be very interested.
For example for the multiple comparisons test:
I was basing it on a dummy encoding, and then constructing contrast
matrices for all pairwise comparisons, comparisons to control, and
similar for conditional interaction tests. The information I would
need for a general design matrix is the location of the relevant
dummies (column indices), the number of levels for each factor that is
involved in interaction effects, and (barely touched yet) whether the
factors are nested. I'm also using or constructing the associated
names. (I only looked at one and two factors so far.)
Another usecase are marginal or average (treatment) effects like what
Skipper did in discretemod. It's not possible to do this correctly for
interaction effects without the relevant information, for categorical
effects it's also difficult to do. (too late in the night, I don't
remember the details for categorical variables in this.)
So one of the internal API question is how to pull this information
out of the formula framework.
Going through the code in charlton, it really looks like it is a lot
of work that you did (and I'm glad I didn't have to figure this out by
myself.)
Josef
>
>>
>> -- Nathaniel
>>
>
I think it would be simplest to have a slightly different interface,
consisting of a single class that does something like:
def __init__(self, design, data=None, ...):
self.exog, self.endog = charlton.make_model_matrices(design, data)
# ...
Where make_model_matrices is a function that supports all of these
calling conventions and always does the right thing:
make_model_matrices("y ~ x", {"y": ..., "x": ...})
make_model_matrices(my_model_desc_object, {"y": ..., "x": ...})
make_model_matrices((np.asarray(y), np.asarray(X)))
I don't see the advantage of taking that last case and sticking it
into a whole other class. Also, this way we get consistent handling of
metadata -- in the last case I'd expect make_model_matrices to make up
some column names ("column0", "column1", etc.) and attach the
appropriate metadata, so that all the later code can assume that
metadata exists and is in the standard format.
> * post-estimation support for testing: here I would like to use the
> information from the formula framework, or the formula framework
> itself to (semi-)automatically create for example contrast matrices,
> f_test, ttest and multipletests. For example f_test in pandas uses
> variable names or string parsing to define the contrast matrix, and it
> would be nice to have an optional way where we can use the formula
> instance again to help us.
Yes, that would be lovely. I haven't thought about this much yet, but
I bet we could re-use charlton's parser. (The parser itself is very
generic and just spits out an AST; there's an evaluator that runs over
that AST to produce model descriptions, but writing a different
evaluator would be trivial.)
> The result classes have now a compare method, but this is currently
> the only support for quickly comparing different models. So somewhere
> there should be support for quickly creating changed models out of
> previous models (which seems to be your working style).
Well, I actually mostly tweak models by hitting up-arrow to find them
in my command history and then changing their specification, which
requires a good high-level notation but not much more... But here are
some other things you can do if you have some richer information than
just the model matrix:
-- What R calls "drop1": for each term in the model: fit a model
without that term, so you can test that term's unique contribution to
the model via likelihood ratio test.
-- Calculate model predictions on new data vectors. (Both for its own
sake, and as a primitive needed for many other operations -- e.g., if
you have a non-linear model, then plotting on the response scale or
bootstrapping confidence intervals on the response scale both require
the ability to evaluate the fitted model on new data)
-- What R calls "update": fit a new model that's *like* that old model
except... R has a magic syntax that looks like ". ~ . - x1 + x2",
where "." means "what the old model had here". This will be easy to
add to charlton, I just haven't gotten around to it yet.
> I haven't looked yet in details at your code (since it's secret). One
> thought I had is how we can benefit from the design matrix and
> contrast matrix construction code also programmatically, i.e.
> bypassing the "string" part and using the underlying functions
> directly. This would make, for example, writing some tests easier
> because they need the same machinery.
Yes, that was a design goal (partly to satisfy some people's fears
about parsing strings, but it's a good idea). See
charlton.desc.ModelDesc, which is the Python object equivalent of a
formula string.
> From the development side, it looks like much is inside charlton, I'm
> not sure where the boundaries in R are compared to this.
AFAICT (I haven't actually used them "in anger", so to speak), R's
formula handling is pretty opaque from the outside. There are a few
standard operations and people treat them as a black box. I think
Charlton is actually somewhat better in this regard -- e.g., it should
be easy to extend the syntax if you have special needs. (The 'lme4'
package in R does something similar to let you specify random effects
alongside your fixed effects, but almost no-one else does this,
probably because it requires such complicated magic.)
> term_name_to_columns and similar sound useful. I haven't figured them
> out yet, but would be very interested.
Basically, a "term" is one item in your formula, like "x" or "a:b". It
might correspond to multiple columns, either because it's a numeric
predictor that turns out to be a matrix with multiple columns (this is
convenient for something like "spline(x)", where spline is a function
that returns a spline basis for x as a matrix), or because it's a
categorical variable that needed coding.
term_name_to_columns just says, "x" ended up in columns 2-4, while
"a:b" ended up in 5-12.
I figured it'd be especially useful for something like classic ANOVA
tests, which involve fitting a model like "1 + a + b + a:b" and then
testing each *term* for significance, but I assume there are other
uses too.
> For example for the multiple comparisons test:
> I was basing it on a dummy encoding, and then constructing contrast
> matrices for all pairwise comparisons, comparisons to control, and
> similar for conditional interaction tests. The information I would
> need for a general design matrix is the location of the relevant
> dummies (column indices), the number of levels for each factor that is
> involved in interaction effects, and (barely touched yet) whether the
> factors are nested. I'm also using or constructing the associated
> names. (I only looked at one and two factors so far.)
You can get dummy coding out of Charlton (e.g., "y ~ 0 + a:b"), but it
isn't the usual case. The full coding of something like an
interaction, possibly involving both categorical and numeric
variables, in a model that fits main effects and interaction effects
separately, can be... complicated. And depends on various
user-specified things, like what kind of coding they want to use.
Right now the only information Charlton exposes about these
complications is in the column names -- e.g., a column named
"a[T.low]:b[red]" is the pointwise multiplication of the
treatment-coded "low" level of factor a, and the dummy-coded "red"
level of factor b.
Maybe if we also exposed the contrast matrices then that'd be enough
to do things like estimate arbitrary new contrasts (like the pairwise
comparisons, comparisons to control, etc. that you mention) from a
fitted model? It seems likely to me, but I'd have to work through the
math to figure out how feasible it is in the general case.
Of course, just have the ability to control contrast coding gives you
some of what you want (like, treatment-coded estimates basically are
comparisons to control).
-- Nathaniel
One of the main problems I have with the above is that I want to have
a shortcut to the models and not use the time for building a
consistent metadata. The main reason is that, as we extend resampling,
bootstrap, cross-validation and outlier detection, we create the model
instance and run through this maybe thousands of times. WLS is used by
both glm and rlm in the iterative solver. (But we still don't have any
profiling for statsmodels, and for non-linear models this might be
small compared to the estimation.)
What is the splitup between what's put in the formula and what is in
additional arguments, if models define additional arguments, sigma,
weight in GLS, WLS, families, links, and various other things. I never
figured out a pattern in R.
If "design" above just contains endog and exog then it would work
across many models, but if we stick more in it, then it might get
messy. For instrumental variables, are the instruments part of the
formula or separate. Do we get two component formulas, e.g.
zero-inflated poisson, one model for the mixture distribution, one
model for the component part, GARCH, one model for the mean, one model
for the variance?
Currently we just have python signatures that are usually pretty
informative. Signatures in R with Model(formula, ...) have a much
steeper learning curve, at least for me.
If signature overloading is generally desired, I might be in favor of
just overloading endog and keep the remaining arguments as optional,
and then delegate if endog is not a basic ndarray, maybe rename the
argument to "endog_or_formula".
The signature wouldn't indicate anymore what other arguments are
required, but it would still show them. And it would leave the way
open to move additional arguments into the formula.
Most of the things below are relatively clear and uncontroversial, and
I will have to go over them in detail and try them out (especially the
term_to_columns, I'd like to see how I can use them for the tests).
(Just as a note: in econometrics I never learned any codings besides
dummy coding, and, if I remember STATA correctly, they wrote in the
release notes that they only have dummy coding and will add other
codings later.)
Thanks,
Josef
Here 'a' is a matrix with 100 columns:
In [6]: timeit column_names = ["column%s" % (i,) for i in xrange(a.shape[1])]
10000 loops, best of 3: 68.3 us per loop
70 microseconds * 1000s of runs = a few tenths of a second total
overhead. Profile first, *then* worry about optimization :-).
> What is the splitup between what's put in the formula and what is in
> additional arguments, if models define additional arguments, sigma,
> weight in GLS, WLS, families, links, and various other things. I never
> figured out a pattern in R.
>
> If "design" above just contains endog and exog then it would work
> across many models, but if we stick more in it, then it might get
> messy. For instrumental variables, are the instruments part of the
> formula or separate. Do we get two component formulas, e.g.
> zero-inflated poisson, one model for the mixture distribution, one
> model for the component part, GARCH, one model for the mean, one model
> for the variance?
A formula in charlton is just an object that lets you turn data into
either one or two matrices (the former for formulas like "x1 + x2"
without any "y ~" part). You won't go far wrong if you just take the
signature you would otherwise use, and for each argument that's a
data-dependent matrix with one row per observation, replace it with a
formula. I certainly wouldn't put weights, families, links, etc. in
there.
There are places where it might be nice to extend the basic formula
syntax to handle more exotic situations (like I mentioned lme4 does);
in those cases you probably do want both a high-level and low-level
API. But those cases are relatively rare, I think.
> (Just as a note: in econometrics I never learned any codings besides
> dummy coding, and, if I remember STATA correctly, they wrote in the
> release notes that they only have dummy coding and will add other
> codings later.)
It's amazing how many different statistics cultures there are, each
with their own terminology and everything. I, by comparison, have only
the haziest idea how an instrumental variable is used :-).
-- Nathaniel
Since I just learned roughly what R's poly is
If my interpretation as PCA on polynomial is correct, then I'm not
sure what drop would mean. In PLS and PCA regression one of the main
points is to choose the number of PCA components, and I can imagine
having something like a drop method that reduces the number of
components, but not the order of the underlying polynomial.
(In the sandbox code for PCA regression I just report AIC, BIC, ...
for varying number of components.)
Independent of the formulas, it looks like we need more helper
functions like add_constant, categorical (which is partially like
Factor) and np.vander.
Although the models don't care what the meaning of the design matrix
is, (and I like it that way), for additional tasks (e.g. tests) I see
now that it will be useful to have the second layer of formulas,
terms, ..., and so on, which knows how to construct and interpret the
design matrix, at least for the most common cases. And the formulas
will provide the additional infrastructure (like automatic name
creation and keeping track of columns) which np.vander for example
does not do.
> I would like to think of the formula as more than just a tool that creates
> matrices and there
> are exotic use cases of formula that are not really rare. A natural one is
> an additive model that is not really easy to specify in R except for two
> specific smoothers (smoothing spline)
> and loess.
>
> Z = rnorm(40)
>> gam(Y ~ s(X) + lo(Z))
> Call:
> gam(formula = Y ~ s(X) + lo(Z))
>
> Degrees of Freedom: 39 total; 30.3881 Residual
> Residual Deviance: 25.19731
>>
>
> It can be done by parsing the string "s" and "lo" keeping "s" as reserved
> keywords but I may have my own favorite
> smoother that is not a smoothing spline or a loess smoother?
It should be possible to define an interface also for non-linear
components, maybe similar to link functions in glm, that makes it
possible to plug in user defined smoothers for example; for example a
python class with some required methods and attributes.
For the string interface it might be possible to use the literal I() (?).
Interfaces for non-linear models is still pretty much an open question
(although I like subclasses).
>
> Random effects models are also not that rare...
I wish they were not so rare in statsmodels.
Thanks for the explanations, I'm always learning something new.
Josef
I think we all agree that model specification in terms of arrays is
fundamental; we just have different ideas for what exactly the
interface should look like. If I'm understanding correctly, the
proposals are (using R terminology):
Josef: have 'lm' and 'lm.fit' as two separate functions
Me: have one function 'lm' that can take either a formula or matrices
You: have 'model.matrix' and 'lm.fit' as two separate functions (no 'lm')
> Using things like "term_name_to_columns" is really an interactive tool
> which is useful for "deconstructing" what R or some "formula" tool has done
> to your "formula" but it
> is not really useful unless you already know what the term names are, i.e.
> how the formula parser takes a string like
> "poly(x, 3) * factor(a) * factor(b) + poly(x,2) * poly(y,3) * factor(b) *
> factor(c)" and constructs the column names.
> I think these issues should be dealt with separate from the actual fitting
> of the model.
Absolutely. But the model fitting code needs a minimal awareness of
this kind of metadata, simply so that it can pass it through in a
useful way. Consider, in R:
model <- lm(y ~ foo + bar + another.thing, data=data)
summary(model) # labels each coefficient
coef(model)["bar"] # extracts the relevant coefficient
> Also, this has the advantage of allowing different specifications of
> "formula". I don't think it's much more work for a novice
> user to use the template
>
>>>> design = construct_matrices(model_spec)
>>>> lm = OLS(Y, design)
>
> as compared to
>
>>>> lm = OLS(Y, model_spec)
>
> It shows the novice user that an OLS model is really a specified by matrices
> not by a formula.
That's great for the novice user, but do you really want to be typing
two lines every time you fit a model? Do you often fit models
interactively? It strikes me as an annoying hassle.
(NB: Y is generally also produced by the formula. This allows things
like "log(y) ~ x" or (for a multivariate model) "y1 + y2 ~ x".)
> R's functions like "drop1" and "update" are useful for some automated
> purposes (like a stepwise version of AIC/BIC-based model selection) but are
> a little unflexible.
Sure. Convenience functions are convenient, not comprehensive; if you
want comprehensive, well, you have Python :-).
> For instance it seems natural to me to expect that
> drop1(lm(Y ~ poly(X,3)) would try fitting a quadratic model instead of the
> cubic specified by poly(X,3), but it doesn't:
I thought you disliked formulas because they were *too* magic, not
because they weren't magic enough ;-).
Seriously, I take your point, but this is pretty predictable -- a
formula contains zero or more terms; each term produces to 1 or more
columns. poly(X, 3) has no "+" in it, so it's one term (which produces
3 columns, because poly(X, 3) evaluates to an n x 3 matrix). I can
imagine extending the formula syntax so that it "knows" about all
kinds of things like polynomial bases and stuff, but the current
system is very easy to explain and implement, very predictable, and
still handles a *lot* of the low-hanging fruit very well.
> I would like to think of the formula as more than just a tool that creates
> matrices and there
> are exotic use cases of formula that are not really rare. A natural one is
> an additive model that is not really easy to specify in R except for two
> specific smoothers (smoothing spline)
> and loess.
[...]
> It can be done by parsing the string "s" and "lo" keeping "s" as reserved
> keywords but I may have my own favorite
> smoother that is not a smoothing spline or a loess smoother?
I'm not sure where you're going here. The tricky part about smoothing
is that you need a richer interface between the smoother and the model
fitter. In R, this is accomplished without any messing about with
reserved keywords; "s" and "lo" are just functions that return some
special objects that the fitter knows how to interpret. So there's no
obstacle to defining your own smoother, except that your fitter may
need to be modified as well... but the 'mgcv' package, for instance,
supports this fully (see ?mgcv::user.defined.smooth).
I definitely plan to support this kind of use case in charlton,
presumably using the same strategy that R does. For these kinds of
extensions, you'll need to use charlton's "native" API, instead of the
high-level "just give me some model matrices please!" API, so that you
can hook in at the right places. But that's all straightforward. I'm
starting with the easy case (that AFAIK also covers every model in
statsmodels!) where you just need the matrices, but that's not a
limitation of the design.
Probably the trickiest bit is figuring out what the non-formula
interface to these fitters should look like. Is that what's worrying
you?
> Random effects models are also not that rare...
Ditto, except for random effects models it's probably more obvious how
to describe them without formulas.
-- Nathaniel
Absolutely. But the model fitting code needs a minimal awareness of
this kind of metadata, simply so that it can pass it through in a
useful way. Consider, in R:
model <- lm(y ~ foo + bar + another.thing, data=data)>
summary(model) # labels each coefficient
coef(model)["bar"] # extracts the relevant coefficient
Josef: have 'lm' and 'lm.fit' as two separate functions
Me: have one function 'lm' that can take either a formula or matrices
You: have 'model.matrix' and 'lm.fit' as two separate functions (no 'lm')
Yes, charlton defines a ModelMatrix class which is a subclass of
ndarray with this metadata added (with both column names, a la larry
or DataArray, and term->column mappings, as needed for anova etc.).
Formula evaluation generates these metadata-enhanced matrices by
default, but you could just as well instantiate a ModelMatrix directly
if you prefer.
I've been meaning to look at the various named-array attempts in more
detail. I'd be just as happy to start with one of them, but it wasn't
clear to me whether any are ready for real usage yet.
>> I'm not sure where you're going here. The tricky part about smoothing
>> is that you need a richer interface between the smoother and the model
>> fitter. In R, this is accomplished without any messing about with
>> reserved keywords; "s" and "lo" are just functions that return some
>> special objects that the fitter knows how to interpret. So there's no
>> obstacle to defining your own smoother, except that your fitter may
>> need to be modified as well... but the 'mgcv' package, for instance,
>> supports this fully (see ?mgcv::user.defined.smooth).
>
> My point about smoothing with gam is that if you want to parse that in terms
> of strings like "y ~ my.smooth(x)" you are really
> going further along the path of writing another language. One way to achieve
> general smoothers is to make a new sympy function with a "smooth" method
> that takes a residuals argument: backfitting is then trivial.
My point was just that you don't need any extra parsing magic to
handle these cases. You do need some special handling of those terms,
because they don't straightforwardly map to columns in a model matrix,
but you can do parsing and term evaluation in exactly the same way.
>>> Random effects models are also not that rare...
>
>> Ditto, except for random effects models it's probably more obvious how
>> to describe them without formulas.
>
> I think that's a function of difficulties with R's formula object. Roughly
> speaking, there are two canonical ways to describe mixed effects models
> mathematically
Gah, I was really unclear here. My bad. What I meant was, one would
like to have both formula and non-formula ways to specify models, and
it's sort of non-obvious (at least to me) what the cleanest
non-formula API for specifying GAMs is, but for mixed-effect models
the non-formula API is reasonably straightforward. Like, for instance,
just specifying the fixed effects model matrix, plus another model
matrix and associated stuff (e.g., desired parametrization for Sigma)
for each random effect group. Sorry for making you type all that.
> I agree with you that it is hard to specify them with R's formula, and I
> don't have a current working version, but I disagree when you say that they
> don't fit into the "formula" framework. What is this "more obvious" way you
> have of describing them?
I agree that they fit into the formula framework, though I also agree
that it isn't clear exactly what the *best* way to fit them in would
be. As far as R-style formulas go, nlme and lme4 both have reasonable
solutions (either specifying one formula per model matrix, or
extending the formula syntax to support formulas like "y ~ fix1 + fix2
+ (ran1 + ran2 | group1)", respectively; charlton's parser has
extensibility built in to allow for exactly this case). Where both
fall down is in specifying more complex parametrizations
of/constraints on the Sigma matrix. It really would be nice to do
better -- I've been frustrated by this limitation of lme4 myself --
but I haven't thought about how.
-- Nathaniel
> I agree with you that it is hard to specify them with R's formula, and II agree that they fit into the formula framework, though I also agree
> don't have a current working version, but I disagree when you say that they
> don't fit into the "formula" framework. What is this "more obvious" way you
> have of describing them?
that it isn't clear exactly what the *best* way to fit them in would
be. As far as R-style formulas go, nlme and lme4 both have reasonable
solutions (either specifying one formula per model matrix, or
extending the formula syntax to support formulas like "y ~ fix1 + fix2
+ (ran1 + ran2 | group1)", respectively; charlton's parser has
extensibility built in to allow for exactly this case).
Where both
fall down is in specifying more complex parametrizations
of/constraints on the Sigma matrix. It really would be nice to do
better -- I've been frustrated by this limitation of lme4 myself --
but I haven't thought about how.