But first, there was some confusion in the last thread ("The Return of
Formula" from a few weeks ago) about what formulas even were, or why
they were useful (e.g., it was postulated that they might be useful
for speed, which I think is incorrect or irrelevant). So to make sure
we're on the same page:
The kind of formula I'm talking about comes from Bell Labs in the
early nineties, when they added them to their language "S"; "R" is an
independent implementation of S, but the formula stuff is basically
unchanged in the last 20 years. Guess it was a good idea :-). They're
basically a domain-specific, very succinct language for describing
statistical models, so in R if you want to regress a variable Y on
variables X1, X2, you just say:
lm(Y ~ X1 + X2) # "lm" = "linear model", R's least-squares function
and it builds an appropriate model matrix for you. Or if you have
factorial variables A, B, C, D and you want to fit a model with main
effects and second-order interactions, but leave out the higher-order
interactions, you just say:
lm(Y ~ (A + B + C + D)^2)
So the idea is that you decouple *describing* the model from actually
building a design matrix etc.
This has lots of advantages:
1) The usual advantages of decoupling: some modeling code can handle
sparse design matrices, some can't. If the user is building the design
matrix, they have to know; if the code is doing it, then it can just
do the right thing. Some models don't even involve a design matrix
(e.g., regression trees). Some models want to build the design matrix
in stages, to save memory (I need this myself). It also allows models
to be smarter, because they have more information about the problem:
there are R packages that 'know' what a spline term looks like and can
handle it specially, e.g. by providing automated control of
overfitting.
2) Saving grunt-work: it's not that hard to slap some numerical
predictor vectors into a matrix, but then when you want to handle
categorical predictors, and you need to calculate out orthogonal
contrasts, nesting, or you want to go from a linear fit to a
spline-based fit, etc., then building the design matrix is a huge
hassle. Better to implement all that stuff once and use it everywhere.
3) By far the most important: doing interactive/exploratory analysis.
If I'm playing around with some data, then it makes a *huge*
difference whether going from an idea to an implementation takes me 5
seconds or 50. (It's like the compile/test cycle in programming.) If I
go "hmm, what if I center these?" or "hmm, what if I add a quadratic
term?" or whatever, then with formulas I have to type about 5
characters and hit enter; if I'm building the matrix by hand, I have
to write several lines of code, and worse, *think* about what they
should be, when I'm already busy thinking about the data.
So, yeah, formulas are awesome -- so awesome that the code I'm working
on now is currently rigged up to call into R via rpy2 *just* to build
design matrices before doing all the actual grunt-work itself in
Python. I would like to change this before I start distributing it,
though, because it is not the most robust of designs :-).
So there are two existing Python implementations that try to solve
this problem that I know of: Jonathan Taylor's formula stuff in the
statsmodels sandbox, and... Jonathan Taylor's other formula stuff in
nipy. The former is stand-alone; the latter depends on sympy. AFAICT,
they both take the same basic approach, of requiring you to explicitly
create Python objects representing individual terms, then combine them
into a formula.
# sandbox formula:
f = Term("x") + Term("y") + Term("z")
# nipy formula:
f = Formula([Term("x"), Term("y"), Term("z")])
So -- again, AFAICT -- this has the disadvantage that for interactive
use it's way more verbose than R, and that the language for expressing
interactions is less rich (no short-hands like R's power operator,
used in the example above), and the language for expressing
transformations is also less rich (I can't tell how to make a term
like, say, log(x)). Also, there have been complaints that the
implementation is complex and confusing.
What I've done instead is implement a simple parser for the S formula
language, in Python, where arbitrary Python expressions (suitable for
'eval') are allowed as the predictors. So for example:
In [120]: formula.parse_formula("x + np.log(x) + (a + b + c)**2")
Out[120]: Formula(intercept=True, lhs=[],
rhs=[InteractionTerm(('x',)), InteractionTerm(('np.log(x)',)),
InteractionTerm(('a',)), InteractionTerm(('b',)),
InteractionTerm(('c',)), InteractionTerm(('a', 'b')),
InteractionTerm(('a', 'c')), InteractionTerm(('b', 'c'))])
The parser code is itself non-trivial, of course, but it's a real
parser and quite robust (actually, the error reporting is better than
R's...), and you can see that it just spits out a very simple
structure (basically a list of terms, where each term is a list of
interacting factors) for further processing. Also, there are no
external dependencies and it should work fine under Python 2.4 or even
earlier.
So I could just roll this into my yet-to-be-named EEG analysis
package, but it seems to me that even better would be to release it
(plus the not-yet-written code to actually create a design matrix from
the parser output, plus whatever brilliant changes you all suggest) as
standalone library for use by statsmodels, nipy, pymc, whoever.
What do you think? Is this a sensible design and would people be
interested in actually using it?
-- Nathaniel
On Fri, Jun 4, 2010 at 8:01 PM, Nathaniel Smith <n...@pobox.com> wrote:
> S formula
> language
I'm moderately enthusiastic about the idea, and definitely in favor of
making it available for reuse in packages like statsmodels.
A short hand notation like R's formula looks very convenient,
especially for those that are familiar with it, and especially for the
categorical regressor case. (And I have been wishing silently for a
formula parser for a while.)
Just brief comments and questions (to avoid my tendency for writing too much)
How closely does the syntax follow the R/S version?
How much can be expressed in formula expression, or how easy can it be
extended to other models than "lm"? e.g. random effects or covariance
matrices for GLS, system of equations, ...
The parser still needs to be wired up, both to get the data from our
DataFrames structures/classes/arrays and to the actual model that does
the calculation.
What would be the API of the formula to interface with the statistical
models/algorithms?
(I think, R has some functions defined that package developers can use
as the outcome of the formula parsing)
formulas (as parsed strings) look convenient for interactive use, but
not for programmatic use of statsmodels as a library. Is it easy to
keep dual access to the models?
I would like to see it implemented for the categorical case (lm with
fixed effects and maybe random effects), so that we have one fully
worked out example to see how well it would work and how easy it will
be to expand it to other models. (which was the main reason for my
"Return of Formula")
Work and contributions in this area, e.g. with "import formula" would
counterbalance a bit the econometrics focus/bias that Skipper and I
introduced since last summer, and could make statsmodels more useful
and attractive for the statisticians.
(I'm very happy if statsmodels is more than two econometricians working on it.)
Josef
The canonical reference is chapter 2 of Chambers & Hastie (1991),
Statistical Models in S. Of course, there are also lots of online docs
for R if you just want to get a feel for it -- or just type
'help(formula)' in R.
-- Nathaniel
Great thank I just wanted to be sure I was looking at the syntax you
where proposing to use.
Thanks
Vincent
>
> -- Nathaniel
>
When I was brainstorming for ideas, this is the idea that I think
makes the most sense.
This on top of the other work on Formula could be the best to take
advantage of everything that's currently out there. The Parser + the
Formula machinery + the estimators almost equals a statistical
package.
It would still be a fair amount of work to glue it all together I
think, so some continued discussion would be great. I still don't
have a clear view of how the (old) Formula framework can be used
without the models themselves being aware of the formula framework
(which I would like to be the case).
Do you have the parser code posted somewhere?
Skipper
Cool.
> Just brief comments and questions (to avoid my tendency for writing too much)
>
> How closely does the syntax follow the R/S version?
It's very, very close. The differences that I'm aware of are:
1) I follow Python and use '**' for exponentiation, where R/S use '^'.
2) My code has somewhat less surprising (I hope) handling of negation
within parentheses. (Negation is used to 'cancel out' terms, so for
instance 'a*b*c' gives all main effects and interactions between those
three factors, and 'a*b*c* - a:b:c' gives all main effects and second
order interactions but not the third order interaction.) In R there is
a strange thing where negation can't escape parentheses (or
something?), so we have:
R:
a + b - a --> b
a + (b - b) -> a
a + (b - a) --> a + b # ??
My code:
a + b - a --> b
a + (b - b) --> a
a + (b - a) --> b
This is pretty obscure; presumably no-one uses this 'feature'
intentionally in R.
3) In R/S, their are two forms of formula:
y ~ x
~ x
where the latter is a "right hand side only" formula. The '~' is
always necessary, because that's the syntax that introduces a formula.
In our case, we have a string that we know is supposed to be a
formula, so I allow a bare "x" as equivalent to "~ x".
Basically my theory is that any syntax that has survived a few decades
of pounding by smart people using it for real work is unlikely to be
improved upon by me in an afternoon.
> How much can be expressed in formula expression, or how easy can it be
> extended to other models than "lm"? e.g. random effects or covariance
> matrices for GLS, system of equations, ...
First, a formula is convertible to a design matrix, so any code that
uses a design matrix can trivially use a formula in its place. That
covers a lot of models right there...
Second, the syntax is extensible; the library using the parser can add
new operators. The use case I had in mind for this was exactly the
random effects case -- the lme4 package in R describes random effects
like:
y ~ a + b + (1 | Subject) + (1 | Word)
where the "a + b" part describes the fixed effects, and there is also
a random intercept for each Subject and each Word (which must be
categorical, of course). Parsing such formulas is straightforward with
my code; you can just tell it that there is an additional,
low-precedence operator named "|" and write a function that knows how
to evaluate such things.
And of course, one can still pass things "next" to the formula -- like
gls("y ~ a + b", y_cov_matrix)
or whatever. I'm not too worried about it reducing our flexibility.
> The parser still needs to be wired up, both to get the data from our
> DataFrames structures/classes/arrays and to the actual model that does
> the calculation.
> What would be the API of the formula to interface with the statistical
> models/algorithms?
> (I think, R has some functions defined that package developers can use
> as the outcome of the formula parsing)
Yeah, in R the basic API most people use just takes a formula, data
frame, any ancillary arguments (weights, NA handling preferences,
etc.), and spits out a design matrix annotated with various metadata
(e.g., a programmatic description of the mapping from formula terms to
columns in the matrix). Then there are more fine-grained APIs you can
use if you're doing something "clever". Seems like a sensible approach
to me.
> formulas (as parsed strings) look convenient for interactive use, but
> not for programmatic use of statsmodels as a library. Is it easy to
> keep dual access to the models?
That's a good point, and I haven't thought about it, but...I don't see
why not? You saw the data structure that the parser spits out, it's
very simple; we could, say, accept one of those structures instead of
a string everywhere, and just skip the parsing step if that's what the
caller hands us.
> I would like to see it implemented for the categorical case (lm with
> fixed effects and maybe random effects), so that we have one fully
> worked out example to see how well it would work and how easy it will
> be to expand it to other models. (which was the main reason for my
> "Return of Formula")
Sure.
Is there existing code for mixed-effect regression in Python that
you're aware of? I've just been reading up on that stuff with an eye
towards implementing it in incremental_ls.py, but that's more
complicated than you're talking about and anyway who knows when I'll
actually have time to do it.
Categorical variable handling is another can of worms, of course -- I
know how to do the math, but I'm not sure what the API for "hey look
that variable is categorical!" should look like :-). Any thoughts? R
has it easy here -- there's a built-in type specifically for ordered
and unordered categorical variables. The most elegant solution,
perhaps, would be to add some kind of "enumerated" dtype to numpy (the
hdf5 people could use this too, actually), but that'd be a whole
'nother project. OTOH, we can fake it with a simple implementation
that just takes strings and assumes that 1) all categorical variables
are unordered, unless told otherwise, 2) whatever strings appear,
those are the universe of possible categories.
...Also, I need a name. "formula" is fine for R/S where it's assumed
that everything is about statistics, but in Python-land something more
specific would be nice. scikits.statsmodelspec?
-- Nathaniel
Again - I speak as someone who does not know, but my instinct, often
wrong, is that the parser you suggest is half-way to being a symbolic
language, and that you will soon run into the need for the rest.
> So -- again, AFAICT -- this has the disadvantage that for interactive
> use it's way more verbose than R, and that the language for expressing
> interactions is less rich (no short-hands like R's power operator,
> used in the example above),
Not that particular short-cut, no, but you can express interactions in
the normal way I believe.
> and the language for expressing
> transformations is also less rich (I can't tell how to make a term
> like, say, log(x)).
That is sympy.log(x), I believe. An in general you can use any
sympy.defined function, or indeed any arbitrary function, using the
aliased stuff.
> Also, there have been complaints that the
> implementation is complex and confusing.
Have you found that to be true?
> What I've done instead is implement a simple parser for the S formula
> language, in Python, where arbitrary Python expressions (suitable for
> 'eval') are allowed as the predictors. So for example:
> In [120]: formula.parse_formula("x + np.log(x) + (a + b + c)**2")
> Out[120]: Formula(intercept=True, lhs=[],
> rhs=[InteractionTerm(('x',)), InteractionTerm(('np.log(x)',)),
> InteractionTerm(('a',)), InteractionTerm(('b',)),
> InteractionTerm(('c',)), InteractionTerm(('a', 'b')),
> InteractionTerm(('a', 'c')), InteractionTerm(('b', 'c'))])
What would I do if (as we often find) we want to have an arbitrary
function of x?
Best,
Matthew
Presumably it would just be something like
OLS(formula.design("y ~ x1 + x2", data))
But -- while the formula framework should certainly have such a
"user-level" API -- I am not at all convinced that this is the best
way to do things in general. If you look at my original list of
advantages of formulas, then you miss out on (1) (the models know more
about what they want out of a formula than the user does), and some of
(3) (there's just more typing and things to keep track of when fitting
models interactively). To that I'll add another: the more the models
knows about what its fitted, the richer the API it can present after
fitting. For instance, when doing predictions it is nice to specify
the new data, not the new design matrix; and when doing, say,
statistical tests, it is useful to be able to express them in terms
rather than columns (or even, main effects and interactions!). Of
course, all of these things are possible anyway -- the model could
only express predictions and tests and such in terms of matrices, and
then the formula object could have methods that people could call to
construct those matrices -- but I'm not sure that this is as nice.
(Or, heck, here's a trivial but very real advantage: if the fitted
model knows the formula, then the user can interrogate it to remind
themselves which fitted model object they're looking at. This matters
a lot if you're fitting lots of models interactively, doing
likelihood-ratio tests, etc.)
What are the advantages to keeping model code unaware of formulas? I
can imagine there could be some advantage in generality ("what if
someone makes up a new and different formula?"), but taken to the
extreme that view just argues for giving the user some matrix codes
and standing back, so obviously we have to take into account other
considerations too :-).
> Do you have the parser code posted somewhere?
No, I really need to get myself a sandbox somewhere for such things
:-). But for now, attached.
-- Nathaniel
I would say, it is a symbolic language. But one with different
trade-offs than something like sympy. Like:
Terseness is more valuable: it's not for extended explorations of
symbolic algebra, it's (in its essence) a DSL for writing down a list
of expressions. So it's just not worth creating multiple symbols (each
with its own function call and assignment), combining them, etc., when
you could just... write the list.
The algebra is totally different: For example,
a * b = a + b + a:b
a + a = a
So we'd be throwing out sympy's algebraic simplification rules and
writing our own anyway...
Since we're basically setting up a numerical problem, not doing
complex manipulations on a symbolic problem, there's very little
benefit to using the special sympy functions like sympy.log. And
there's a significant cost -- users shouldn't need to know numpy and
sympy and why there are two different 'log' functions and when to use
which, nor about 'aliased' functions, just to fit a linear model. I
mean, I can understand all those things, but I don't want to have to
explain them to people who are already struggling with the stats...
>> So -- again, AFAICT -- this has the disadvantage that for interactive
>> use it's way more verbose than R, and that the language for expressing
>> interactions is less rich (no short-hands like R's power operator,
>> used in the example above),
>
> Not that particular short-cut, no, but you can express interactions in
> the normal way I believe.
Not sure what you mean by 'the normal way'. There are two
multiplication-like operators in the S formula language, '*' and ':'.
':' is more fundamental, '*' is what you usually want to get
statistically meaningful results. My impression was that the nipy
formula code uses '*' to mean what S calls ':', and there's no
shortcut for what S calls '*'.
>> Also, there have been complaints that the
>> implementation is complex and confusing.
>
> Have you found that to be true?
Err, well *cough* I haven't actually tracked down the code and read it
yet; my email had all those "AFAICT" because I was working off the
examples and such I could find online. So I dunno. (When I started I
didn't even realize I was duplicating others work, so once I realized
I wanted to start a discussion before I wasted any more time :-).)
>> What I've done instead is implement a simple parser for the S formula
>> language, in Python, where arbitrary Python expressions (suitable for
>> 'eval') are allowed as the predictors. So for example:
>> In [120]: formula.parse_formula("x + np.log(x) + (a + b + c)**2")
>> Out[120]: Formula(intercept=True, lhs=[],
>> rhs=[InteractionTerm(('x',)), InteractionTerm(('np.log(x)',)),
>> InteractionTerm(('a',)), InteractionTerm(('b',)),
>> InteractionTerm(('c',)), InteractionTerm(('a', 'b')),
>> InteractionTerm(('a', 'c')), InteractionTerm(('b', 'c'))])
>
> What would I do if (as we often find) we want to have an arbitrary
> function of x?
def my_nifty_function(x):
return x # What? this is totally nifty, just ask any mathematician...
"y ~ my_nifty_function(x)"
Or am I misunderstanding the question?
-- Nathaniel
Firstly - I hope Jonathan will comment - because I don't know his
thinking well enough to expound it. So, I hope, by saying terrible,
ignorant things, I will jog his feelings of mercy so he comes to my
aid ;)
> The algebra is totally different: For example,
> a * b = a + b + a:b
> a + a = a
Yers - in fact Term does implement a+a = a - that's easy to do by
overriding __add__ in class Term, which is a subclass of Symbol.
> Since we're basically setting up a numerical problem, not doing
> complex manipulations on a symbolic problem, there's very little
> benefit to using the special sympy functions like sympy.log. And
> there's a significant cost -- users shouldn't need to know numpy and
> sympy and why there are two different 'log' functions and when to use
> which, nor about 'aliased' functions, just to fit a linear model.
Well - you can fill a formula namespace for that, that just extends
sympy's namespace with the formula stuff, but I agree that requires a
little practice to get used to.
>> Not that particular short-cut, no, but you can express interactions in
>> the normal way I believe.
>
> Not sure what you mean by 'the normal way'. There are two
> multiplication-like operators in the S formula language, '*' and ':'.
> ':' is more fundamental, '*' is what you usually want to get
> statistically meaningful results. My impression was that the nipy
> formula code uses '*' to mean what S calls ':', and there's no
> shortcut for what S calls '*'.
Yes, that's right - * is just the product, and thence what R/S calls
':' - I believe.
> Err, well *cough* I haven't actually tracked down the code and read it
> yet; my email had all those "AFAICT" because I was working off the
> examples and such I could find online. So I dunno. (When I started I
> didn't even realize I was duplicating others work, so once I realized
> I wanted to start a discussion before I wasted any more time :-).)
It might be worth having a look at the code - there is really very little:
http://github.com/nipy/nipy/blob/master/nipy/modalities/fmri/formula.py
plus:
http://github.com/nipy/nipy/blob/master/nipy/modalities/fmri/aliased.py
but that's a bit more low-level sympy and not important for the formulae.
>> What would I do if (as we often find) we want to have an arbitrary
>> function of x?
>
> def my_nifty_function(x):
> return x # What? this is totally nifty, just ask any mathematician...
> "y ~ my_nifty_function(x)"
>
> Or am I misunderstanding the question?
No - thanks - it was me who didn't understand.
Best,
Matthew
That's more what I had in mind with the API question a few paragraphs
above. Not the API of the formula for the user, but the link between
the results of the model parsing and the actual implementation of the
statistical model.
If the formula spits out a design matrix, then this is easy because
the design matrix can be directly fed to the Models. But for further
statistical analysis, e.g. constructing contrast matrices, the formula
parser needs to provide additional information about the structure of
the model/design matrix or the formula framework needs to construct
the contrast matrices.
eg.
http://developer.r-project.org/model-fitting-functions.txt
(a random link that google found, I don't remember which R docs/books I read.)
currently we have mainly lm.fit()
>
>> I would like to see it implemented for the categorical case (lm with
>> fixed effects and maybe random effects), so that we have one fully
>> worked out example to see how well it would work and how easy it will
>> be to expand it to other models. (which was the main reason for my
>> "Return of Formula")
>
> Sure.
>
> Is there existing code for mixed-effect regression in Python that
> you're aware of? I've just been reading up on that stuff with an eye
> towards implementing it in incremental_ls.py, but that's more
> complicated than you're talking about and anyway who knows when I'll
> actually have time to do it.
The sandbox has the old implementation of mixed, for mixed effects
models with repeated measurements, i.e. a block error covariance
matrix.
We also discussed mixed effects models in the context of panel data
models and we have generic gls, but this is still in planning and
experimental stage and on the todo list for summer.
>
> Categorical variable handling is another can of worms, of course -- I
> know how to do the math, but I'm not sure what the API for "hey look
> that variable is categorical!" should look like :-). Any thoughts? R
> has it easy here -- there's a built-in type specifically for ordered
> and unordered categorical variables. The most elegant solution,
> perhaps, would be to add some kind of "enumerated" dtype to numpy (the
> hdf5 people could use this too, actually), but that'd be a whole
> 'nother project. OTOH, we can fake it with a simple implementation
> that just takes strings and assumes that 1) all categorical variables
> are unordered, unless told otherwise, 2) whatever strings appear,
> those are the universe of possible categories.
Some of this goes into the data handling part that Skipper and Vincent
are working on. But, if I remember correctly, R also has a "Factor"
argument in the formula to indicate factor variables that are not
defined as such in the data frame.
One possibility is to include it as term in the formula "x +
Factor(z)" or as keyword argument
factors = [z]
I briefly looked at parse.py, and it looks well structured (for
extensibility), but I think it will require some time before we can
fit everything together.
Josef
What if we think of it differently and instead of formula needing to
pass in more info the results passes back to a formula class instance
that knows the design matrix. Then if other things need to
becalculated they are passed from formula to that fiction and back.
This keeps the models simple as to what is passed in and all the
convinence stuff is handeled in formula if you what to use it.
Vincent
This would keep the model interpretation and "helper" functions
together, but when in doubt about the boundaries, I prefer to keep the
statistics code together.
If parts of the statistical analysis disappear into the "magic black
box" of a formula framework, then the statistical model becomes
difficult to understand and extend (which induced us initially to kick
out formula).
Josef
> So there are two existing Python implementations that try to solve
> this problem that I know of: Jonathan Taylor's formula stuff in the
> statsmodels sandbox, and... Jonathan Taylor's other formula stuff in
> nipy. The former is stand-alone; the latter depends on sympy. AFAICT,
> they both take the same basic approach, of requiring you to explicitly
> create Python objects representing individual terms, then combine them
> into a formula.
> # sandbox formula:
> f = Term("x") + Term("y") + Term("z")
> # nipy formula:
> f = Formula([Term("x"), Term("y"), Term("z")])
> So -- again, AFAICT -- this has the disadvantage that for interactive
> use it's way more verbose than R, and that the language for expressing
> interactions is less rich (no short-hands like R's power operator,
> used in the example above), and the language for expressing
> transformations is also less rich (I can't tell how to make a term
> like, say, log(x)).
Also, there have been complaints that the
> implementation is complex and confusing.
>
> What I've done instead is implement a simple parser for the S formula
> language, in Python, where arbitrary Python expressions (suitable for
> 'eval') are allowed as the predictors.
So for example:
> In [120]: formula.parse_formula("x + np.log(x) + (a + b + c)**2")
> Out[120]: Formula(intercept=True, lhs=[],
> rhs=[InteractionTerm(('x',)), InteractionTerm(('np.log(x)',)),
> InteractionTerm(('a',)), InteractionTerm(('b',)),
> InteractionTerm(('c',)), InteractionTerm(('a', 'b')),
> InteractionTerm(('a', 'c')), InteractionTerm(('b', 'c'))])
>
> The parser code is itself non-trivial, of course, but it's a real
> parser and quite robust (actually, the error reporting is better than
> R's...), and you can see that it just spits out a very simple
> structure (basically a list of terms, where each term is a list of
> interacting factors) for further processing. Also, there are no
> external dependencies and it should work fine under Python 2.4 or even
> earlier.
It would still be a fair amount of work to glue it all together I
think, so some continued discussion would be great. I still don't
have a clear view of how the (old) Formula framework can be used
without the models themselves being aware of the formula framework
(which I would like to be the case).
However, this was not the case in the original stats.models, and cox
and mixed in the sandbox still use the formula internally in the model
class.
Your messages to pystatsmodels should go through now.
Josef
> For models like random forests making contrast matrices doesn't make sense
> but none of the models in statsmodels
> is meant to handle things like this at the moment. But, you could use a
> Formula object to create a matrix
> of features to be used in a random forest procedure...
>
>
> Another feature of using sympy is that you can plug this fairly easily into
> several
> non-linear regression techniques and potential mixed effects models because
> you can differentiate the expressions with respect to parameters. This has
> not
> been leveraged very much at all at this point, but it is not too complicated
> to do.
>
> -- Jonathan
>
>
> --
> Jonathan Taylor
> Dept. of Statistics
> Sequoia Hall, 137
> 390 Serra Mall
> Stanford, CA 94305
> Tel: 650.723.9230
> Fax: 650.725.8977
> Web: http://www-stat.stanford.edu/~jtaylo
>
> _______________________________________________
> Nipy-devel mailing list
> Nipy-...@neuroimaging.scipy.org
> http://mail.scipy.org/mailman/listinfo/nipy-devel
>
>
Sorry, this was just as an explanation, why Skipper and I mentioned
that we would like to keep formula and models separated.
It doesn't look like it's a controversial issue anymore, which wasn't
clear to us at the beginning of this discussion
Josef
>
> -- Jonathan
I don't think Josef meant it to be contentious or as a slight. Just a
fact about the code as we found it that motivated my original comment.
The code is in the sandbox because I moved it there after removing
the formula dependency from the core code, so it still works if you
use the old Formula framework (although there were not and are not any
tests, so I can't say this with certainty). I didn't have time to
make this work independent of the formula framework as part of the SoC
last summer or during the school year.
If my original worry about separating fitting and formula doesn't make
sense to you, then that's good. As long as whatever form the formula
ends up taking is not meant to be used in the fitting then that's fine
and what I would prefer if at all possible. And to answer Nathaniel's
earlier comment, which I take to be the opposite of what you are
saying:
On Sat, Jun 5, 2010 at 2:56 PM, Nathaniel Smith <n...@pobox.com> wrote:
> What are the advantages to keeping model code unaware of formulas? I
> can imagine there could be some advantage in generality ("what if
> someone makes up a new and different formula?"), but taken to the
> extreme that view just argues for giving the user some matrix codes
> and standing back, so obviously we have to take into account other
> considerations too :-).
>
First, by model code, I mean everything that is not the parser or
handling of the symbolic formula (ie., some sort of pre-cleaning). The
advantage is that it won't take a refactor of each model's fitting if
something changes at the formula level, and it's a lot less work for
me if I don't have to keep up with changes in the formula for the
models (Full disclosure: I very rarely find myself taking advantage of
R's advanced formula features for my own work). Maybe this is asking
too much and in the end they won't be inextricable(?). I see the
advantages of using a formula or some kind of symbolic representation
for the user to give the model what it needs, but at some point (ie.,
fitting for a parametric model) we are just working with data (ie.,
ndarrays). So the model's internals are separate from what the user
sees (ie., the rich and convenient representation of data via the
formula).
My $.02,
Skipper
Ahhh, I see, we were talking past each other. Right, of course I don't
think that the model internals should accomplish their calculations by
munging some sort of extended formula object about, I think they
should build some nice matrices or what-have-you, as appropriate for
the problem at hand, and work with those. I was thinking about APIs,
not implementations -- my argument was that if we draw a line so that
on one side is "code in the statsmodel repository" and on the other is
"code written by the user", then the code that converts a formula to
those matrices should be called from the statsmodels side. Do I have
it straight now?
-- Nathaniel
Ah ok, we're on the same page now, and I think that's reasonable.
That way if someone really, really wants to write their own formula
framework, they can always monkeypatch the code and overwrite
Model.prepare_data or whatever.
Skipper
On Mon, Jun 7, 2010 at 11:21 AM, Jonathan Taylor <jetay...@gmail.com> wrote:
> For log(x), use
>
>>>> x = Term('x')
>>>> logx = sympy.log(x)
>
> For power, this 5 line function does the trick:
[...]
> A lot of these things are missing because not many convenience features have
> been added.
Of course -- and my approach hardly has a leg to stand on there either
:-). The question isn't which approach is more complete now, but which
is more promising in the end.
> Yes, it is a little more verbose than R, but it has the advantage that it is
> python's interpreter / parser that is
> operating on genuine python objects and all decisions about order of
> operations, etc. follow python's rules.
I guess I'm not convinced that that's an advantage for either
approach. My parser uses the Python tokenizer from the standard
library, so that guarantees a certain level of compatibility, and then
its precedence table is identical to Python's (for the operations it
supports, and with the addition of ":" in between "*" and "**"), it
knows about parentheses, unary + and -, and I believe its logic for
identifying the boundaries of embedded Python expressions is correct
for all possible Python (which does not have a very complicated syntax
at the expression level). So the order of operations and such is the
same either way.
>> > What I've done instead is implement a simple parser for the S formula
>> > language, in Python, where arbitrary Python expressions (suitable for
>> > 'eval') are allowed as the predictors.
>
> Using "eval" seems a little non-robust unless you carefully control the
> namespace.
Well, you answer your own question :-). If you define an object with a
__getitem__ method and pass it to 'eval' as the namespace to use, then
your method can determine the value of each variable in the expression
using whatever careful means it likes. It's as deterministic as any
other approach.
> How does it handle things like R's "s(x)"?
I assume you mean ns(x)? (Base R doesn't have anything called s(x).)
> The current implementation of
> Formula does it like this
>
>>>> spline_x = natural_spline(x, knots=[4,6,8], order=4)
>>>> f5 = f3 + spline_x
>>>> f5
> Formula([x, log(x), a*c, a, c, a*b, b, b*c, ns_1(x), ns_2(x), ns_3(x),
> ns_4(x), ns_5(x), ns_6(x), ns_7(x)])
>>>> import pylab
>>>> basis = [f5['ns_%d(x)' % i] for i in range(1,7)]
>>>> [pylab.plot(x, b.alias(x)) for b in basis]
My plan would be to have a function natural_spline (or whatever) that
returns an Nx7 matrix, with the rule that if a term is Nxk then it
produces k columns in the design matrix. The advantage of this
approach is that you can better keep track of the relationship between
user-specified terms and design matrix columns -- e.g., you can easily
tell the user the significance of the natural spline term as a whole
(using a linear hypothesis test or whatever). (This idea is again
stolen from R; that's just how 'ns(x)' works.)
But of course that's orthogonal to the question of whether the user
specified that term using a sympy vs. parser based mechanism.
Actually, in addition to the above, I would teach the formula
evaluator that 'natural_spline' was a special sort of function whose
output was not necessarily just a pointwise function of its input, and
invoke a special method for handling it, where at the same time we
constructed the design matrix it also had a chance to 'memorize'
aspects of the input for later use. The idea here would be to avoid
the classic bug in R where if you let ns() or bs() choose their knots
automatically, and then later try to predict through a model using
them, then they will choose different knots the second time and you
end up using the coefficients fit on the first model to make
predictions in a different model. (The same applies to
scaling/centering functions, etc.) This kind of extended functionality
is a little tricky to implement in the parser+eval approach; it might
be an advantage of the sympy approach, where you can more easily
inspect the internal structure of your factors and notice that there's
a call to natural_spline hidden in there. But I'm fairly sure it can
be done either way without too much hassle.
I'm not sure I understand the question -- could you elaborate?
Generally, yes, I had planned to handle contrasts the same way R does
-- with automatic selection of dummy coding versus contrast coding to
achieve model identifiability, and giving the user an option of which
sort of contrast coding they prefer. Convenient selection of, say,
dummy vs. treatment vs. orthogonal polynomial coding is the sort of
thing that any formula specification framework will end up supporting
though, surely?
"a * i * l" expands to "a + i + l + a:i + a:l + i:l + a:i:l"; each of
those terms is easily identifiable as a main/second-order/third-order
effect by simply counting the number of factors (len(term.factors)),
and then each produces multiple columns in the design matrix (and just
like with the splines above, we note the relationship for later use,
so the classic ANOVA hypotheses would have the same API as an omnibus
test of the spline effect).
> Another feature of using sympy is that you can plug this fairly easily into
> several
> non-linear regression techniques and potential mixed effects models because
> you can differentiate the expressions with respect to parameters. This has
> not
> been leveraged very much at all at this point, but it is not too complicated
> to do.
I noticed some comments along those lines when I read the code
yesterday (thanks, Matt, for the pointer), and was intrigued -- the
idea of specifying even novel models in an automatically solvable form
is very neat. But then I was also disappointed after a moment's
thought to realize that the appropriate methods for most models in
practical use cannot be derived on the fly -- e.g., a generic solver
is probably not going to look at a generic model specification and
derive from first principles that this would be a good time to call
np.linalg.qr -- so in practice will we just end up coding all the
standard models in the standard way, and only see an advantage when
calling a generic non-linear optimization routine? How general do you
see this being? I'm not *too* dissatisfied with the current state of
having to derive my derivatives separately (maybe using sympy or
openopt's automatic differentiation or whatever) and pass them to my
CG routine, but then again, if it just worked...
Coalescing replies to other emails:
>> So it's just not worth creating multiple symbols (each
>> with its own function call and assignment), combining them, etc., when
>> you could just... write the list.
> Then why use a parser, why not just have a function that takes a list
> of expressions?
1) Because [Expr("a"), Expr("b")] is still less terse than "a + b".
Python syntax is very nice for writing a list of numbers, not so great
for writing a list of expressions -- horses for courses and all that.
2) At least some shortcuts are very useful -- if you did a classic
factorial experiment crossing AxB, then it seems only polite to let
you run the classic analysis by typing "A*B".
>> The algebra is totally different: For example,
>> a * b = a + b + a:b
>> a + a = a
>
> That is actually a source of frustration. And sometimes, the equality
> a * b = a + b + a:b isn't even true in R. At least it's not true in
> the sense
> that a*b - (a + a:b) != b. I think this is a "bug" of R rather than a
> "feature". It is not true to say that R really uses __algebra__.
I'm not sure what you mean --
> attr(terms(~ b), "term.labels")
[1] "b"
> attr(terms(~ a*b - a - a:b), "term.labels")
[1] "b"
-- they have exactly the same terms.
(There's some weirdness in R where the 'terms' object does still
mention 'a', as a variable that happens not to figure in *any* terms,
but oh well, R is a bit crusty sometimes and probably this has no
observable effect, and anyway, in my code they are equivalent.
As for whether this counts as algebra, I guess it's a matter of
opinion. I find the rules perfectly clear and predictable, but then, I
did just implement them from scratch. And I doubt muttering about
polynomial fields of order 2 would help...
>> url = 'http://stats191.stanford.edu/data/kidney.table'
>> kidney.table <- read.table(url, header=T)
>> attach(kidney.table)
>> a = factor(kidney.table$Duration)
>> b = factor(kidney.table$Weight)
>> anova(lm(Days ~ a + a:b), lm(Days ~ a*b))
> Analysis of Variance Table
>
> Model 1: Days ~ a + a:b
> Model 2: Days ~ a * b
> Res.Df RSS Df Sum of Sq F Pr(>F)
> 1 54 1564.8
> 2 54 1564.8 0 0
>> # these two models are equal, hence a + a:b = a*b?
Well. Equal to you and me, but people do so persist in caring not just
which model they fit but how it was coded. (This isn't sarcasm, I live
among psychologists and I often find their conceptual models of ANOVA
and friends to be personally incomprehensible, though I guess it works
for them.) In any case, specifying coding *is* a legitimate part of a
formula language's job, IMHO, and that's what the difference in those
formulas is doing.
Oh, oh, wait, sorry, no, I see what your point is about "a*b - (a +
a:b)" -- that equivalence for R formulas is a different concept than
equivalence as a model (defined as a likelihood function, say). True.
I guess I don't see this as as serious a problem as you do. (Surely
there are symbolically distinct nipy-style formulas that nonetheless
produce the same likelihood function?)
> Well, there could be more helper functions, certainly, but the users
> don't need to know what "aliased functions" are to fit your model.
Really? What if they want to try a logit transform and there is no
sympy.logit (not that I've checked, but let's assume) -- to implement
it, don't they have to know some technical details about not just
Python or numpy, but sympy? (Real question)
Yes. I guess we both understand what's going on here -- that ANOVA
hypotheses are traditionally defined in terms of the coding used (as
silly as that sometimes is), and that R is just following that
convention, but I'm not sure what you want to do about it. Both
analyses you've shown above are meaningful, even conventional, in some
cases -- how do you express this difference in your framework?
> That's arguable, ":" is perhaps the one that is used most often in
> statistical
> texts but why are these the most "statistically meaningful"? In any
> case,
> ":" gets confusing and in fact acts a lot like "*" with categorical
> variables because R doesn't force
> you to specify how you want to "orthogonalize" the contrasts. See
> example above.
I just meant that in the way that most practitioners use ANOVAs,
including an interaction without the corresponding main effects is
'wrong', and so it's useful that R's "*" gives you that behavior
without having to keep track all the time.
> My impression was that the nipy
>> formula code uses '*' to mean what S calls ':', and there's no
>> shortcut for what S calls '*'.
>
> Not currently. It could be added by simply doing (1+x)*(1+y) instead
> of just x*y. Yes, it forces you
> to include a 1, but then it follows ordinary rules of operations.
Yes, but, if your shortcut takes more thought and typing than just
writing x + y + x*y, then it isn't a shortcut :-).
In the end, my basic contention is just that a syntax like "a + b" --
if implemented in a robust and reliable way -- is going to be more
readable and more writable than one like "Formula([Term('a'),
Term('b')])", at least as good at handling the exotic cases (painless
specification of "interesting" contrast codes, data transformations,
etc.), and both simpler and more foolproof WRT to classic analyses
like ANOVA.
I kind of get the impression from your email that your argument is as
much with the conceptual framework of classic ANOVA, which was clearly
very much on the S designers minds, as it is with this question of
syntax. I sympathize, but for me the larger issue is this: I think
it'd be really nice if we could all reach consensus on the best way to
do this in Python, so this bit of complicated and -- well -- *boring*
code only has to get written once, and so users only have to learn one
syntax whether they're using NiPy or PyMC or whatever. (And we can all
write the docs together, and etc. etc.) And supporting everybody means
we probably have to support classic ANOVA. But you don't have to use
it? :-)
-- Nathaniel
P.S. Sorry everybody for the length! Thanks for making it this far...
Right, the goal would be to provide a formula API that's easy for the
model-fitting code to use, but gives it rich information to work with.
> eg.
> http://developer.r-project.org/model-fitting-functions.txt
> (a random link that google found, I don't remember which R docs/books I read.)
>
> currently we have mainly lm.fit()
Excellent link, thanks. I didn't know about makepredictcall --
apparently its been there since 2002, yet I've still encountered
documentation (not in R's online docs) that hadn't been updated.
Yikes.
> Some of this goes into the data handling part that Skipper and Vincent
> are working on. But, if I remember correctly, R also has a "Factor"
> argument in the formula to indicate factor variables that are not
> defined as such in the data frame.
>
> One possibility is to include it as term in the formula "x +
> Factor(z)" or as keyword argument
> factors = [z]
The Factor(z) approach is basically the same as the approach where we
define a special 'factor' data type -- that would just be a call to
the constructor :-). Something like that is probably the way to go for
now, with some sort of minimal type that is mostly meaningful to the
formula code itself, and hopefully someday it can be replaced by a
more generally useful type.
-- Nathaniel
I find the discussion and examples interesting although I still don't
fully understand this.
To me, if two (linear) models imply the same likelihood function then
they are the same model, independent of the parameterization. Any
tests can be constructed by appropriate choice of the constrast
(restriction) matrix for the f_test.
I think, I understand that which results are printed by the anova
command depends on the parameterization.
So, my question is whether the formula and the contrast matrix are
tied together, or are there two stages, in the first stage where the
formula specifies the likelihood function or set of terms, and a
second stage where the user can specify any arbitrary tests on the
main effects or interaction effects.
Manual construction of a contrast matrix is always possible, but does
a formula help to specify a contrast matrix that doesn't correspond to
the initial encoding of the effects.
Or, maybe I'm still missing the point.
Josef
I find the discussion and examples interesting although I still don't
fully understand this.
To me, if two (linear) models imply the same likelihood function then
they are the same model, independent of the parameterization. Any
tests can be constructed by appropriate choice of the constrast
(restriction) matrix for the f_test.
So, my question is whether the formula and the contrast matrix are
tied together, or are there two stages, in the first stage where the
formula specifies the likelihood function or set of terms, and a
second stage where the user can specify any arbitrary tests on the
main effects or interaction effects.
Manual construction of a contrast matrix is always possible, but does
a formula help to specify a contrast matrix that doesn't correspond to
the initial encoding of the effects.
On Mon, Jun 7, 2010 at 9:59 PM, Jonathan Taylor
<jonatha...@stanford.edu> wrote:
> Hi Nathaniel,
Thanks for the detailed response!
>> My plan would be to have a function natural_spline (or whatever) that
>> returns an Nx7 matrix, with the rule that if a term is Nxk then it
>> produces k columns in the design matrix. The advantage of this
>> approach is that you can better keep track of the relationship between
>> user-specified terms and design matrix columns -- e.g., you can easily
>> tell the user the significance of the natural spline term as a whole
>> (using a linear hypothesis test or whatever). (This idea is again
>> stolen from R; that's just how 'ns(x)' works.)
>
> This ties the function "ns" to an instance of ndarray. Formulae can and doI'm afraid I'm lost.
> make perfect sense independent of a data matrix.
ns is a function that takes an ndarray (plus maybe some extra
arguments specifying knots or whatever) and returns an ndarray. This
is independent of any given formula or data matrix.
Formula("ns(x)") is a formula that specifies that whatever 'x' is, it
should be transformed by the function 'ns' before appearing in the
design matrix. This can and does make sense independent of a data
matrix, AFAICT.
The idea is that when we compute the design matrix, we also keep track
of what knots were used in case the user asks us to 'recompute the
same model with new data' (cf. R's 'predict' method). Obviously they
could also ask us to compute a new model from the same formula, which
would presumably not use the 'memorized' knots.
>> I'm not sure I understand the question -- could you elaborate?
>> Generally, yes, I had planned to handle contrasts the same way R does
>> -- with automatic selection of dummy coding versus contrast coding to
>> achieve model identifiability, and giving the user an option of which
>> sort of contrast coding they prefer. Convenient selection of, say,
>> dummy vs. treatment vs. orthogonal polynomial coding is the sort of
>> thing that any formula specification framework will end up supporting
>> though, surely?
>>
>> "a * i * l" expands to "a + i + l + a:i + a:l + i:l + a:i:l"; each of
>
> Well, that is different from R's implementation for categorical variables.Well, umm... no? I implement R's syntax and semantics almost exactly;
> See my
> two-way ANOVA example I sent. So now you have created a different type of
> algebraic manipulation that doesn't follow R's syntax and does not follow
> the rules of algebra.
> I don't like the sounds of that.
there's no difference in the implementations at this level. If you
want to check, you can ask R how it expands "a*i*l":
But here I don't quite follow -- surely once you add one nonlinear
term, you do have to reinvent the wheel in the sense that now you have
to use totally different code to fit the model?
Is the advantage you foresee that, hey, at least it'll be quicker to
write down those 20 terms, because you'll be able to re-use the
various convenience methods and shortcuts and such that've already
been developed to work with ordinary formula objects? Or does it go
beyond that?
I think you mean ["a", "b", ("a", "b")], or else we still need some
sort of (simpler) parser to split up the last term (keeping in mind
that a and b could be arbitrary Python expressions, which might
themselves contain embedded "*"s).
But... yeah, you're right, that's not so bad at all. I do like the S
syntax a little better for interactive input -- less bouncing on the
shift key + syntactic conveniences are convenient -- but it's not bad.
(In fact, I used it for specifying the expected outputs in the parser
unit tests.)
Well, yes. But the only way to avoid that is if we require *every*
variable to be marked factor(A) or numeric(B) (or some equivalent
syntax). I don't think most people would be willing to put up with
that all the time.
Again, only if we assume that model equivalence implies formula
equivalence, which I've already admitted I don't care too much about.
>
> I don't think the rules of algebra are a matter of opinion.
>
>> I find the rules perfectly clear and predictable, but then, I
>> did just implement them from scratch. And I doubt muttering about
>> polynomial fields of order 2 would help...
>
> This doesn't really solve the problem either. Because with coefficients inErr, yes, quite, thinko on my part.
> field of order 2: a+a = (1+1)*a = 0.
> Also, in fields of order 2, (a+b)^2=a+b != a+b+a:b, though I will grant you
> that a*a=a.
What's really going on is this: "+" is set union, "-" is set
difference, ":" is a kind of Cartesian product, and operators are
evaluated with the usual precedence and associativity (i.e., from
left-to-right within levels of equal precedence).
And what's being
calculated is not arithmetic operations on design matrix columns, but
-- at a high level -- the set of terms which should end up in the
model matrix.
This certainly doesn't match the usual rules of arithmetic, but I
meant it's "algebraic" in the sense that it's a simple set of rules
applied consistently; we certainly could axiomatize it if we really
felt the need. If you would rather preserve the word "algebra" for
more conventional groups, fields, etc., then that's fine with me; it
seems to me that the rules and their consistency is the important
part, whatever we call that.
Okay, fair enough :-) -- though they again have to know the difference
> def logit(x):
> return sympy.log(x / (1 - x))
between numpy.log and sympy.log. I guess a better example to make my
point would have been if the user decides they don't like the spline
bases built into the formula framework wants to use the cubic spline
functions from scipy, or something like that -- then they need to
learn about aliased functions, yes?
>> Yes. I guess we both understand what's going on here -- that ANOVA
>> hypotheses are traditionally defined in terms of the coding used (as
>> silly as that sometimes is), and that R is just following that
>> convention, but I'm not sure what you want to do about it.
>
> The nipy formalism makes it explicit, by having a property ("main_effect")So if I'm reading the code right, the idea is that if you have some
> that will
> generate a coding for a factor. One can add other properties to the Factor
> that
> yield different codings.
Factor f (i.e., an object representing a single categorical variable),
then f.main_effect gives you back a new Formula wherein each term (=
column in the eventual model matrix) is a sympy expression that
evaluates to a contrast-coding of one of the levels of the original
Factor f?
So the R philosophy is that you tell it the variables you want to
include, and it'll figure out whether dummy or contrast coding is
necessary to achieve identifiability. My impression from the above
code is that in nipy formula, the philosophy is that you get dummy
coding by default, and it's the user's job to figure out which
variables need to be contrast-coded in which interactions, and mark
them explicitly? I can certainly see an argument for this way of doing
things, but is that a correct interpretation?
>> Yes, but, if your shortcut takes more thought and typing than just
>> writing x + y + x*y, then it isn't a shortcut :-).
>
> Well, at least I have the rules of algebra to stand on. :)Point! :-)
Hopefully the above makes it clear that the parser's rules at least
follow something, whether that's algebra or not, and it's the same
something as R.
Where "=" means "parsing the thing on the left and the thing on the
> I suppose you could tell it how to add and subtract itself (can you do
> that), i.e.
> what do these things give you?
>
> "~ x + y + x:y" - "x:y" (=?)
> "~x+a:b" * "~x + y" (=?)
right return identical Formula objects":
"x + y + x:y - x:y" = "x + y"
"x + a:b + x + y" = "x + a:b + y"
However, if/when I implement Formula->design matrix conversion code
using R-style rules, then "x:y" will -- just like in R -- fit the same
model as "x + y + x:y", just with a different parametrization.
>> at least as good at handling the exotic cases (painless
>> specification of "interesting" contrast codes, data transformations,
>> etc.), and both simpler and more foolproof WRT to classic analyses
>> like ANOVA.
>
> I don't think it's foolproof. To me, that simple 2-way ANOVA proves it isn'tI only claimed *more* foolproof :-). Like I said, I can see advantages
> foolproof.
to the explicit entry of every term and explicit marking of how factor
is coded in each (assuming that is what nipy does), but I'm not sure
'more foolproof' is one of them...
Again, only because you are defining algebraic consistency as
dependent on the final model fit, and that model fit brings in the
data frame. Which formulas are formally equivalent does not depend on
the data frame.
In order to meet your requirement that we can determine the final
model by looking only at the formula, not at the 'data.frame', we
would have to
a) require every variable be explicitly marked as either numeric or categorical
b) for every categorical variable, explicitly list all of its levels
in the formula.
This sounds redundant and quite unpleasant to use! In fact, I note
that even nipy formula has a way to pull the levels out of the data
vector... (Factor.fromcol)
(It is true that Factor.fromcol pulls it out of the data vector and
*into* the formula, rather than merely into the 'annotated design
matrix' or whatever object you end up with after combining the formula
and data. (In R this is the 'model.frame'.) So it's true that even
using Factor.fromcol, one can determine what model will be fit from
just the formula *object*. But you can't tell from looking at the code
that constructs that object,
and in the R-style syntax you can tell
from the object you get by combining the formula and the data.
If we use R's canonical identities:
A) a*b = a+b+a:b
B) a:a = a
C) a+a = a
and the usual rules for a field, i.e. associativity and commutativity for + and *.
a**2-a = a + a + a:a - a = a + a = a
On the other hand, (TYPO fixed below)
a**2 - a = (a + a + a:a) - a = a - a = 0.
Both cases use associativity and commutativity of + and * for a field and arrive at different answers.
Does anyone see anything wrong with this approach? This was more what
I hand in mind for a parser to do anyway rather than use eval or even
numexpr. That way we can have a shorthand for using the full symbolic
Formula if need be.
Skipper
This is also my preferred approach, especially because we need
programmatic access to the Terms manipulation.
And in cases like simultaneous equation models without categorical
variables, we could use the parser almost directly for the list of
endog and exog in each equation.
I think we still need well defined interfaces between
String-Formula-Parser, Terms-Formula-Manipulation and how it is used
by the models. (Especially, if we don't want the sympy dependency for
statsmodels and have to maintain our own Terms-Formula)
I don't know what will be the consequences if there are semantic
differences between parser and terms manipulation.
My testcase would be a rewrite and enhancement of onewaygls.py, which
would have linear (one-way-) ancova, and tests for structural break in
some variables at fixed breakpoints as a special cases. Design and
contrast matrices are currently handmade but could be delegated to
formula.
Josef
>
> Skipper
>
Yes. And in my proposed system, Formula("ns(x)") returns a Formula
whose single term is the unevaluated expression "ns(x)". I think we
must just be talking past each other somehow in here...
>> Formula("ns(x)") is a formula that specifies that whatever 'x' is, it
>> should be transformed by the function 'ns' before appearing in the
>> design matrix. This can and does make sense independent of a data
>> matrix, AFAICT.
>
> But when you say "ns returns a matrix Nxk" that tells me x is an ndarray
> with x.shape[0]=N. Am I wrong?
If x is an ndarray with shape (N,), then *evaluating* ns(x) returns an
ndarray with shape (N, k), yes.
Surely this is equivalent to saying: if you have unevaluated sympy
expressions ns_1(Term("x")), ns_2(Term("x")), ..., ns_k(Term("x")),
and you make "x" an ndarray with shape (N,), lambdify, evaluate, etc.,
then you get k ndarrays with shape (N,)?
>> Well, umm... no? I implement R's syntax and semantics almost exactly;
>> there's no difference in the implementations at this level. If you
>> want to check, you can ask R how it expands "a*i*l":
>
> But when you fit the model, R's formula gives you something different from
> what you have done. That's what my R
> example shows.
No... maybe you have a different interpretation of "a:b" than R? I
don't understand. In R's model of the world, "a:b", where "a" and "b"
are factors with levels "a1", "a2", etc., and "b1", "b2", etc., is a
factor with levels "a1.b1", "a1.b2", "a2.b1", etc. Surely there are
other possible definitions for ":", but I don't see anything
irrational about this one.
It has the consequence, though, that a model like "a + b + a:b" is
over-specified -- each entry in the vector "a:b" is basically a tuple
containing the corresponding entry of the vectors "a" and "b", so
obviously there's redundancy here. So R's interpretation of something
like "a + b + a:b" is that you want the model that allows every pair
(ax, by) to make a different prediction, *and* you want it *in a way
that extends the "a + b" model* to make interpretation easier.
So when you write "a + b + a:b" in R, then R's algorithm says:
1) Okay, I want an intercept, a, b, and a:b
2) I'll allocate 1 column to the intercept
3) Uh-oh -- if I dummy-code "a", then I'll introduce collinearity with
the intercept. I'll contrast code it in k-1 columns.
4) Uh-oh -- if I dummy-code "b", then I'll introduce collinearity with
the intercept. I'll contrast code it in k-1 columns.
5) Uh-oh -- if I dummy-code "a" in this interaction, then that'll
introduce collinearity with my "a" columns. I better contrast code it.
6) Uh-oh -- even then, if I dummy-code "b" in this interaction, then
that'll introduce collinearity with my "b" columns. I better contrast
code it too.
7) Okay, I'll allocate (k-1) * (k-1) columns to the interaction.
Total: 1 + k - 1 + k - 1 + k ** 2 - 2k + 1 = k**2 columns
And let's take the other extreme... if you do "0 + a:b", then R says:
1) Okay, I want *no* intercept, *just* the interaction of a & b
2) Hey, "a" won't be collinear with anything, I'd better dummy code it
rather than throw out a dimension.
3) Hey, "b" won't be collinear with anything either, I'd better dummy
code it rather than throw out a dimension.
4) Okay, so I'll use k * k columns for the interaction.
Total: k**2 columns
So it ends up with a design matrix of the same dimensions, and the
columns span the same space. The "redundant" model is made estimable.
And now the nesting of "a + b" in "a + b + a:b" is more natural, in a
way that makes it easy to allocate predictive "credit" into main
effects and the interaction. And it takes care of a lot of tiresome
bookkeeping for you along the way.
So I think my point is that this is all conceptually coherent, just
it's using a different set of consistent rules than the ones you are
thinking of. Or perhaps you knew all this already, but then if so I
don't understand why you say that R is doing something different than
I claim.
>> Is the advantage you foresee that, hey, at least it'll be quicker to
>> write down those 20 terms, because you'll be able to re-use the
>> various convenience methods and shortcuts and such that've already
>> been developed to work with ordinary formula objects? Or does it go
>> beyond that?
>
> They're not quite shortcuts, they are the full symbolic power of sympy.
Right, but I can always use the full symbolic power of sympy for
complicated non-linear models if I want, regardless of what the
interface to my least-squares solver looks like. So I was trying to
see whether consistency also gave us any additional practical
benefits.
Not that there's anything wrong with consistency!
> In terms of interactive input vs. robustness and extensibility, I think we
> should strive
> for something that can (with appropriate helper functions, etc.) meet all
> three rather than
> saving a few keystrokes. For simple interactivity, a parser, like yours,
> could return
> an instance of, say, Formula as currently implemented that can later be
> reused
> in whatever generality Formula allows.
True. Terseness is a funny thing... when we say Perl is terse, then
it's a complaint. When we say Java is verbose, then that's a complaint
too :-). I suppose the ideal is to make simple things short and
complex things long -- I think for people doing categorical data
analysis, A*B is conceptually simple, and in most cases I don't really
want to think carefully about the exact mechanics of achieving the
desired column-rank. And, of course, I don't see where it falls down
on robustness (though I understand your argument about algebra now, I
think), or extensibility (given that it's seemed to handle a lot of
different people's requirements over the last 20 years).
>> > I don't think the rules of algebra are a matter of opinion.
>> >
>> >> I find the rules perfectly clear and predictable, but then, I
>> >> did just implement them from scratch. And I doubt muttering about
>> >> polynomial fields of order 2 would help...
>> >
>> > This doesn't really solve the problem either. Because with coefficients
>> > in
>> > field of order 2: a+a = (1+1)*a = 0.
>> > Also, in fields of order 2, (a+b)^2=a+b != a+b+a:b, though I will grant
>> > you
>> > that a*a=a.
>>
>> Err, yes, quite, thinko on my part.
>
> Pardon?
Oops, sorry, my bad: http://ftp.sunet.se/jargon/html/T/thinko.html
>> This certainly doesn't match the usual rules of arithmetic, but I
>> meant it's "algebraic" in the sense that it's a simple set of rules
>> applied consistently; we certainly could axiomatize it if we really
>> felt the need. If you would rather preserve the word "algebra" for
>> more conventional groups, fields, etc., then that's fine with me; it
>> seems to me that the rules and their consistency is the important
>> part, whatever we call that.
>>
>
> I would appreciate seeing a proper axiomatization of R's formula rules in a
> (formally recognized) form
> of algebra.
Thanks for pushing on this, by the way -- I understand some stuff
better after trying to write the following paragraph for a bit. But
I'm not sure you'll be happy with the result :-).
The semantics, like I said before, are really simple: bare variables
(or Python expressions or whatever) name singleton sets. The operation
"+" is union, "-" is set difference, and ":" a kind of unordered
Cartesian product -- specifically it is the composition of Cartesian
product and a set union. What I mean, is, where an ordinary Cartesian
product on a set-of-sets gives:
{{a}, {b}} x {{c}, {d}} = {({a}, {c}), ({a}, {d}), ({b}, {c}), ({b}, {d])}
we have instead:
{{a}, {b}} : {{c}, {d}} = {{a, c}, {a, d}, {b, c}, {b, d}}.
where you can see that for each tuple (foo, bar) we took the union of
foo and bar. Given a set of factors F, the formula language describes
operations in the space 2^(2^F), where 2^ denotes the power set.
But I guess what we're really interested in is how well-behaved the
rules of evaluation are. Here are rules that are true for *both*
ordinary arithmetic and R formulas:
1) Multiplication is commutative: a : b = b : a
2) Multiplication is associative: (a : b) : c = a : (b : c)
3) "+" is commutative: a + b = b + a
4) "-" is not commutative.
5) "-" is not associative; therefore, we need a convention to
determine what something like "a - b - c" means. Our convention is
that "-" is "left-associative", i.e., a - b - c = (a - b) - c, never a
- (b - c).
6) Multiplication distributes over addition: a : (b + c) = a : b + a : c
And here are the ones that are true in R, but violate the expectations
of ordinary arithmetic:
7) "+" is associative by itself, but "+" and "-" are not associative
when mixed; therefore, we need a convention to determine what
something like "a + b + c" (or "a + b - c", etc.) means. Our
convention is that "+" is also left-associative, i.e., a + b + c = (a
+ b) + c, never a + (b + c).
8) Multiplication doesn't distribute over subtraction: a : (b - c) = a : b
You can prove all of these yourself, if you like (well, except for the
conventions, of course) -- they just follow directly from the
definitions of "+", "-", ":" that I gave above.
(Are there any expectations of ordinary arithmetic that I'm forgetting
about? Well, obviously the ones about identities and inverses don't
apply, but if there are others, let me know.)
In addition, of course, R has "*", "/", and "^", but these are defined
in terms of the operations above (and are very well-behaved
algebraically as far as I can tell -- "*" is commutative and
associative and distributive, etc.) so I don't think they introduce
any new horribleness.
Anyway. Hopefully that's formal enough to give us a basis for discussion?
What I find interesting is that from what you've said, nipy Formulas
actually implement almost exactly the semantics I've just described.
That is, a Formula is a set of terms, where a term is a set of
factors; a Formula is an element of 2^(2^F). You've said that you
override __add__ to implement a + a = a, so from this I infer that
addition of Formulas is basically concatenation with duplicate removal
-- i.e., set union. And, well, here's the definition of
Formula.__sub__:
d = list(set(self.terms).difference(other.terms))
which I think makes the set difference pretty clear :-). And finally,
since * is evaluated symbolically, but respecting commutation, that's
basically the same as forming a tuple and then ignoring the order of
elements. (But, you don't eliminate duplicates in expressions like a*a
-- that's the only difference.) Do I have that all correct? I admit
I'm a little confused by the difference in methods as defined on Terms
vs. Formulas, etc., so I'm not sure that the nipy code actually
implements these rules consistently, but from what you've said I think
that's the intent?
So, that's the underlying semantics, but then there's the question of
evaluation. And the interesting thing here is that the only thing
that's really new in R vs. arithmetic is that we need to make sure
that "+" is evaluated left-associatively. But it turns out that in
computers (as opposed to math), "+" isn't associative in the first
place (well, it is for integers, but not for, say, floating point, or
objects defining an __add__ method generally). So Python already has
an answer built into its parser -- according to Python, "+" is always
left-associative.
So as far as I can tell, nipy.Formula is an almost perfect
implementation of R's formula syntax and semantics, modulo any bugs
and leaving out R's syntactic sugar operations, and without the a*a=a
rule. It's arguably *more* accurate than my code, in which I currently
propagate "-"'s symbolically according to (inappropriate) arithmetic
rules and then evaluate them at the end, which this discussion has
convinced me is a bug that I'll fix.
So unless you've changed your mind somewhere in here and decided that
a + a = a is a bad rule, or remain convinced that a*a=a is a terrible
idea that invalidates the whole system (I'd be curious why if so!), I
don't think we have much room for argument about algebra.
Oh, well, and there's the question of whether intercepts are added by
default, but I don't think it's terribly interesting -- obviously one
just sets the default to whichever way is preferred, and we probably
disagree about what the right default is, but we can argue about that
after resolving the other issues.
> Yes, I agree that this is R's philosophy and it is kind of unreliable. For
> instance,
> the number of columns in a design matrix for a given formula depends on the
> data frame used to
> create the design matrix.
Of course, even in R, you can easily say '~ factor(A, levels=c(1, 2,
3))*factor(B, levels=c("Low", "High"))', if you want; it's just that
it has sensible defaults if you don't want to bother.
> Without the ironclad ability to identify columns
> by name in python (though there
> has been some progress), I don't know which column of a fitted model
> corresponds to which term in the Formula.
Ohhh. Right, if you're writing down a Formula and then immediately
after writing down raw column indices that are supposed to match up
with terms in the formula, but have no safety net to detect if you got
those wrong, then it makes *total* sense to me that you want your
formulas to be very very explicit about how they produce the design
columns. I would do the same thing. Except... it'll take some work but
can't we build that safety net, and won't we have to for this to be
usable by mere mortals who aren't used to building design matrices in
their head? And save ourselves the trouble of doing so at the same
time?
>> I only claimed *more* foolproof :-). Like I said, I can see advantages
>> to the explicit entry of every term and explicit marking of how factor
>> is coded in each (assuming that is what nipy does), but I'm not sure
>> 'more foolproof' is one of them...
>
> After several years involved in nipy, I've come to appreciate (through other
> developers) the explicit, if somewhat more verbose
> solution in the interest of robustness.
I'm a *huge* fan of explicitness -- I quote the Zen of Python at
people all the time. (In case you aren't familiar, type "import this"
at a Python prompt.)
But redundancy is almost as much of a sin as implicitness -- if you
have to say things multiple times then it's make-work, and they get
out of sync, and etc. The ideal is to say everything *exactly* once,
no more or less.
My bad. Let me try again.
In nipy, supposing that you're using Factor.fromcol (or
Formula.fromrec, or automatic knot selection, or centering), then
schematically we have:
(1) code describing parts of the formula + the data ---> produces a
Formula object
(2) Formula object + the data ---> produces a design matrix
where the Formula object fully describes the parametrization of the
model to be fit, but not the data.
In R, schematically we have:
(1) code describing the formula ---> produces a Formula object
(2) Formula object + the data ---> produces a FullFormula object
(3) FullFormula object + the data ---> produces a design matrix
where the FullFormula object fully describes the parametrization of
the model to be fit, but not the data.
Okay, in reality, R doesn't bother separating out steps (2) and (3)
exactly -- it just produces one object (the 'model.frame') which
contains *both* a high-level description of the model's
parametrization and the data needed to produce the design matrix
itself. But we could separate them easily if that's useful; all we
have to do is *not* save a copy of the data in step (2).
So my point is that *if* you're using Factor.fromcol or friends, then
there's really no meaningful difference between these approaches --
you have to be a bit careful because the analogue to a nipy.Formula
isn't an R.Formula, it's an R.FullFormula, but we could change the
names or whatever if it matters.
So then the only difference I see is that in nipy, you might *not* use
Factor.fromcol or whatever; you have the *option* of being totally
explicit in the Formula syntax itself. So that's a difference. We can
talk about how valuable it is. But if it's really valuable, then I'll
just suggest that we write a convenience function that takes an
R.Formula and a data structure describing the levels, column means,
etc., and returns an R.FullFormula from those, and then I think we'll
be even again? :-)
-- Nathaniel
Since this whole discussion is long and confusing, and to try and
clarify things in my own head, here's an attempt to distill the
high-level disagreements. If I understand them correctly.
------------
Where |'s highlight differences:
In the R model we have:
a small |terse/maximized-for-readability| special-purpose language for
expressing sets of and interactions between unevaluated "factors"
(represented as |arbitrary Python code|)
-> which feeds ->
a "design matrix builder" that passes through numerical factors
unchanged, but |has intelligent logic for automatically converting
categorical factors into numerical ones, in a way that takes care of
model identifiability, setting up conventional ANOVA hypothesis tests,
can be controlled by the user but otherwise has reasonable defaults,
etc.|
In the nipy Formula model we have:
a small special-purpose language |embedded in Python|, for describing
sets of and interactions between unevaluated "factors" (represented as
|sympy objects|)
-> which feeds ->
a "design matrix builder" that |requires all categorical variable
coding to have been expressed directly in the factors themselves,
possibly using various helpers|
-------------
As far as I can tell, most of these differences are somewhat
independent -- for instance, you could attach a smart design matrix
builder to a sympy-based Formula specifier, or vice-versa, or you
could replace sympy expressions with
Python-expressions-represented-as-strings without affecting the other
parts of the nipy Formula model. (Except for potential future use in
non-linear models, anyway.)
And I'm beginning to suspect that while we've mostly talked about the
"top half" of these models, a big part of the disagreement is actually
about the "bottom half", what we actually do with the list of terms we
get. Maybe making the distinction explicit will help...?
-- Nathaniel
Yes. And in my proposed system, Formula("ns(x)") returns a Formula
> In R it is, but in the nipy Formula, "natural_spline(Term('x'),
> knots=[3,4,5], order=3)" returns a Formula whose terms
> are functions based on the knots and are functions of Term('x').
whose single term is the unevaluated expression "ns(x)". I think we
must just be talking past each other somehow in here...
If x is an ndarray with shape (N,), then *evaluating* ns(x) returns an
>> Formula("ns(x)") is a formula that specifies that whatever 'x' is, it
>> should be transformed by the function 'ns' before appearing in the
>> design matrix. This can and does make sense independent of a data
>> matrix, AFAICT.
>
> But when you say "ns returns a matrix Nxk" that tells me x is an ndarray
> with x.shape[0]=N. Am I wrong?
ndarray with shape (N, k), yes.
Surely this is equivalent to saying: if you have unevaluated sympy
expressions ns_1(Term("x")), ns_2(Term("x")), ..., ns_k(Term("x")),
and you make "x" an ndarray with shape (N,), lambdify, evaluate, etc.,
then you get k ndarrays with shape (N,)?
>> Well, umm... no? I implement R's syntax and semantics almost exactly;No... maybe you have a different interpretation of "a:b" than R? I
>> there's no difference in the implementations at this level. If you
>> want to check, you can ask R how it expands "a*i*l":
>
> But when you fit the model, R's formula gives you something different from
> what you have done. That's what my R
> example shows.
don't understand. In R's model of the world, "a:b", where "a" and "b"
are factors with levels "a1", "a2", etc., and "b1", "b2", etc., is a
factor with levels "a1.b1", "a1.b2", "a2.b1", etc. Surely there are
other possible definitions for ":", but I don't see anything
irrational about this one.
It has the consequence, though, that a model like "a + b + a:b" is
over-specified -- each entry in the vector "a:b" is basically a tuple
containing the corresponding entry of the vectors "a" and "b", so
obviously there's redundancy here. So R's interpretation of something
like "a + b + a:b" is that you want the model that allows every pair
(ax, by) to make a different prediction, *and* you want it *in a way
that extends the "a + b" model* to make interpretation easier.
So when you write "a + b + a:b" in R, then R's algorithm says:
1) Okay, I want an intercept, a, b, and a:b
2) I'll allocate 1 column to the intercept
3) Uh-oh -- if I dummy-code "a", then I'll introduce collinearity with
the intercept. I'll contrast code it in k-1 columns.
4) Uh-oh -- if I dummy-code "b", then I'll introduce collinearity with
the intercept. I'll contrast code it in k-1 columns.
5) Uh-oh -- if I dummy-code "a" in this interaction, then that'll
introduce collinearity with my "a" columns. I better contrast code it.
6) Uh-oh -- even then, if I dummy-code "b" in this interaction, then
that'll introduce collinearity with my "b" columns. I better contrast
code it too.
7) Okay, I'll allocate (k-1) * (k-1) columns to the interaction.
Total: 1 + k - 1 + k - 1 + k ** 2 - 2k + 1 = k**2 columns
And let's take the other extreme... if you do "0 + a:b", then R says:
1) Okay, I want *no* intercept, *just* the interaction of a & b
2) Hey, "a" won't be collinear with anything, I'd better dummy code it
rather than throw out a dimension.
3) Hey, "b" won't be collinear with anything either, I'd better dummy
code it rather than throw out a dimension.
4) Okay, so I'll use k * k columns for the interaction.
Total: k**2 columns
So I think my point is that this is all conceptually coherent, just
it's using a different set of consistent rules than the ones you are
thinking of. Or perhaps you knew all this already, but then if so I
don't understand why you say that R is doing something different than
I claim.
True. Terseness is a funny thing... when we say Perl is terse, then
it's a complaint. When we say Java is verbose, then that's a complaint
too :-). I suppose the ideal is to make simple things short and
complex things long -- I think for people doing categorical data
analysis, A*B is conceptually simple, and in most cases I don't really
want to think carefully about the exact mechanics of achieving the
desired column-rank. And, of course, I don't see where it falls down
on robustness (though I understand your argument about algebra now, I
think), or extensibility (given that it's seemed to handle a lot of
different people's requirements over the last 20 years).
But I guess what we're really interested in is how well-behaved the
rules of evaluation are. Here are rules that are true for *both*
ordinary arithmetic and R formulas:
1) Multiplication is commutative: a : b = b : a
2) Multiplication is associative: (a : b) : c = a : (b : c)
3) "+" is commutative: a + b = b + a
4) "-" is not commutative.
5) "-" is not associative; therefore, we need a convention to
determine what something like "a - b - c" means. Our convention is
that "-" is "left-associative", i.e., a - b - c = (a - b) - c, never a
- (b - c).
6) Multiplication distributes over addition: a : (b + c) = a : b + a : c
You can prove all of these yourself, if you like (well, except for the
conventions, of course) -- they just follow directly from the
definitions of "+", "-", ":" that I gave above.
(Are there any expectations of ordinary arithmetic that I'm forgetting
about? Well, obviously the ones about identities and inverses don't
apply, but if there are others, let me know.)
In addition, of course, R has "*", "/", and "^", but these are defined
in terms of the operations above (and are very well-behaved
algebraically as far as I can tell -- "*" is commutative and
associative and distributive, etc.) so I don't think they introduce
any new horribleness.
Anyway. Hopefully that's formal enough to give us a basis for discussion?
What I find interesting is that from what you've said, nipy Formulas
actually implement almost exactly the semantics I've just described.
That is, a Formula is a set of terms, where a term is a set of
factors; a Formula is an element of 2^(2^F). You've said that you
override __add__ to implement a + a = a,
Formula.__sub__:
d = list(set(self.terms).difference(other.terms))
which I think makes the set difference pretty clear :-). And finally,
since * is evaluated symbolically, but respecting commutation, that's
basically the same as forming a tuple and then ignoring the order of
elements.
(But, you don't eliminate duplicates in expressions like a*a
-- that's the only difference.)
So, that's the underlying semantics, but then there's the question of
evaluation. And the interesting thing here is that the only thing
that's really new in R vs. arithmetic is that we need to make sure
that "+" is evaluated left-associatively. But it turns out that in
computers (as opposed to math), "+" isn't associative in the first
place (well, it is for integers, but not for, say, floating point, or
objects defining an __add__ method generally). So Python already has
an answer built into its parser -- according to Python, "+" is always
left-associative.
So as far as I can tell, nipy.Formula is an almost perfect
implementation of R's formula syntax and semantics, modulo any bugs
and leaving out R's syntactic sugar operations, and without the a*a=a
rule.
It's arguably *more* accurate than my code, in which I currently
propagate "-"'s symbolically according to (inappropriate) arithmetic
rules and then evaluate them at the end, which this discussion has
convinced me is a bug that I'll fix.
So unless you've changed your mind somewhere in here and decided that
a + a = a is a bad rule,
or remain convinced that a*a=a is a terrible
idea that invalidates the whole system (I'd be curious why if so!),
I
don't think we have much room for argument about algebra.
On Wed, Jun 9, 2010 at 3:02 PM, jonathan.taylor <jetay...@gmail.com> wrote:
> I agree that one will want to have easily
> manipulable Term, Formula, etc.
>
> On another note, I gave the parser (or, rather R's formula rules) a
> little more thought.
> There __is__ a way to axiomatize R's three rules
>
> 1) x*x = x
> 2) x+x=x
> 3) x*x=x+x+x:x
>
> for __categorical__ variables, i.e. for classical ANOVA type models.
The rules work the same for categorical or numerical variables,
because these rules are evaluated symbolically anyway.
> The rules
> use just "*" and "+", no '-' or ':'. The interpretation of
> a formula like
>
> a*b + b*c + c*d*e + c*e
>
> is that the four terms above, are represented as simplices in a
> simplicial complex,
> so the above is the same as
>
> [set(['a','b']), set(['b','c']), set(['c','d','e']), set(['c','e'])]
>
> The list of maximal simplices of the simplicial complex generated by
> these simplices is
>
> [set(['a','b']), set(['b','c']), set(['c','d','e'])]
>
> with 'c*e' being redundant because
> set(['c','e']).issubset(set(['c','d','e']))==True.
>
> This redundancy takes care of rules 1) and 2) above.
You don't need to remove non-maximal elements at the level of the
Formula -- they indeed don't affect which final model that's fit, but
they do affect its parametrization. They carry information into the
smart R-style "design matrix builder" (see my other email).
> The final rule,
> x*x=x is where
> categorical variables matter, this is always true for categoricals,
> but not for numerics.
>
> Anyways, I implemented a "parser" for this formula for ANOVA (using
> just "*".split() and "+".split()).
Using .split() is pretty different from a real parser, unfortunately,
because the characters "*" and "+" might appear embedded inside terms
-- like the R formula "log(a + b) + c", which .split() will make a
hash of. (I realize that you might not care about allowing such
arbitrary data transformations in this 'simplified' syntax, though,
but they should be possible, and a real parser is doable.)
> The interactions or rows in the ANOVA table are generated by all
> elements of the simplicial
> complex. So, we would get the following rows in this example: ['a:b',
> 'b:c', 'c:d:e','c:d','c:e','d:e','a','b','c','d','e'].
> For a full factorial design with k terms, the corresponding simplicial
> complex is the k (or k+1 depending on how you name it) simplex,
> and this yields all interactions.
R intentionally only includes in the ANOVA tables those terms which
are mentioned, rather than generating the full set of elements. For
instance, if you have a nested variable that's not coded as such (like
you have schools 1-10, and at each school you have teachers 1-10, but
teacher 1 at school 1 is different from teacher 1 at school 2), then
something like "a + a:b" (or the equivalent shortcut designed for such
cases, "a/b") is meaningful, and turning it into "a + b + a:b" will
produce an incorrect ANOVA table.
Of course, in that example one could just tell the user to recode
their data properly so that different teachers have different codes
and then use "a + b". My ANOVA-fu isn't strong enough to come up with
other edge-cases where such apparently ill-formed models are
reasonable, but I'm reluctant to assume that they don't need to be
expressible as formulas just because they aren't the common case. But
you just said a few emails ago that you teach your students that
models like "a + a:b" are not wrong, just unconventional, so maybe you
agree already.
> One could allow '-' above, which would correspond to deleting one of
> the simplices that generated the simplicial complex.
> This raises the issue of what to do with "a*b*c+c*d+e*f-z"? Failing
> silently, i.e. ignoring 'z' is always an option but seems a little
> worrying.
Yeah, I also strongly considered making "- z" here an error, but then
ended up deciding to stick with R compatibility...
...I think in the end, though, we might be worrying too much about
such edge cases. "-" is very rarely used in the first place, and the
user isn't going to come up with crazy complicated expressions just
for fun.
> There are also issues with order of operations / commutativity if one
> allows '-' and silently ignores lingering '-' signs. Consider
>
> a*(b-c) + a*c
>
> There are at least two possible answers to this: a*b+a*c or a*b
> depending on when one tries to drop something based on the minus
> sign.
See my other email for an argument why a*b + a*c is the correct
answer. Applying rules from arithmetic that just aren't true for
sets-of-sets/simplicial complexes seems like a bad idea to me.
-- Nathaniel
The rules work the same for categorical or numerical variables,
because these rules are evaluated symbolically anyway.
You don't need to remove non-maximal elements at the level of the
Formula -- they indeed don't affect which final model that's fit, but
they do affect its parametrization.
Using .split() is pretty different from a real parser, unfortunately,
> The final rule,
> x*x=x is where
> categorical variables matter, this is always true for categoricals,
> but not for numerics.
>
> Anyways, I implemented a "parser" for this formula for ANOVA (using
> just "*".split() and "+".split()).
because the characters "*" and "+" might appear embedded inside terms
-- like the R formula "log(a + b) + c", which .split() will make a
hash of. (I realize that you might not care about allowing such
arbitrary data transformations in this 'simplified' syntax, though,
but they should be possible, and a real parser is doable.)
R intentionally only includes in the ANOVA tables those terms which
> The interactions or rows in the ANOVA table are generated by all
> elements of the simplicial
> complex. So, we would get the following rows in this example: ['a:b',
> 'b:c', 'c:d:e','c:d','c:e','d:e','a','b','c','d','e'].
> For a full factorial design with k terms, the corresponding simplicial
> complex is the k (or k+1 depending on how you name it) simplex,
> and this yields all interactions.
are mentioned, rather than generating the full set of elements.
For
instance, if you have a nested variable that's not coded as such (like
you have schools 1-10, and at each school you have teachers 1-10, but
teacher 1 at school 1 is different from teacher 1 at school 2), then
something like "a + a:b" (or the equivalent shortcut designed for such
cases, "a/b") is meaningful, and turning it into "a + b + a:b" will
produce an incorrect ANOVA table.
Of course, in that example one could just tell the user to recode
their data properly so that different teachers have different codes
and then use "a + b". My ANOVA-fu isn't strong enough to come up with
other edge-cases where such apparently ill-formed models are
reasonable, but I'm reluctant to assume that they don't need to be
expressible as formulas just because they aren't the common case. But
you just said a few emails ago that you teach your students that
models like "a + a:b" are not wrong, just unconventional, so maybe you
agree already.
Yeah, I also strongly considered making "- z" here an error, but then
> One could allow '-' above, which would correspond to deleting one of
> the simplices that generated the simplicial complex.
> This raises the issue of what to do with "a*b*c+c*d+e*f-z"? Failing
> silently, i.e. ignoring 'z' is always an option but seems a little
> worrying.
ended up deciding to stick with R compatibility...
...I think in the end, though, we might be worrying too much about
such edge cases. "-" is very rarely used in the first place, and the
user isn't going to come up with crazy complicated expressions just
for fun.
See my other email for an argument why a*b + a*c is the correct
> There are also issues with order of operations / commutativity if one
> allows '-' and silently ignores lingering '-' signs. Consider
>
> a*(b-c) + a*c
>
> There are at least two possible answers to this: a*b+a*c or a*b
> depending on when one tries to drop something based on the minus
> sign.
answer.
Applying rules from arithmetic that just aren't true for
sets-of-sets/simplicial complexes seems like a bad idea to me.
To be clear: in R there are two different stages to turning something
like "a + b" into a formula: first the formula is evaluated
symbolically, applying the various operators defined over simplicial
complexes to produce a final simplicial complex (or set of them, I'm
not totally clear on the terminology). Then, we can forget all about
the syntax, the operators "+", "-", ":", "*", "^", "/", we just work
with that complex, treating it as a high-level description of the
desired design matrix and building that matrix.
On Wed, Jun 9, 2010 at 7:35 PM, Jonathan Taylor
<jonatha...@stanford.edu> wrote:
>> No... maybe you have a different interpretation of "a:b" than R? I
>> don't understand. In R's model of the world, "a:b", where "a" and "b"
>> are factors with levels "a1", "a2", etc., and "b1", "b2", etc., is a
>> factor with levels "a1.b1", "a1.b2", "a2.b1", etc. Surely there are
>> other possible definitions for ":", but I don't see anything
>> irrational about this one.
>
> It by itself is not irrational, but if you insist on the rule a*b = a+b+a:b
> you get algebraic
> inconsistencies.
If I understand your earlier examples correctly -- the ones where you
showed that, say, "a + b + a:b" and "a + a:b" produced equivalent
models, suggesting that b = 0 -- your algebraic inconsistencies depend
on your assumption that if two formulas produce equivalent design
matrices (i.e., ones that have the same column-span), then they must
be algebraically equivalent. But this just isn't true -- "algebra"
here refers to manipulations in the simplicial complex domain. There's
a separate function that converts the final simplicial complex into a
design matrix, and it turns out that in R's case, that function's
inverse does not map equivalent design matrices onto equivalent
simplicial complexes. There are certainly arguments to be had about
how that function should be designed, but algebraic consistency isn't
violated.
>> It has the consequence, though, that a model like "a + b + a:b" is
>> over-specified -- each entry in the vector "a:b" is basically a tuple
>> containing the corresponding entry of the vectors "a" and "b", so
>> obviously there's redundancy here. So R's interpretation of something
>> like "a + b + a:b" is that you want the model that allows every pair
>> (ax, by) to make a different prediction, *and* you want it *in a way
>> that extends the "a + b" model* to make interpretation easier.
>>
>
> That implementation I wrote about today handles this: the simplicial
> complex generated by the simplices [set(('a',)), set(('b',)),
> set(('a','b'))]
> has just one maximal simplex: set(('a','b')). Trying to fit this redundancy
> into the algebraic operations is where things sort of fail.
What do you mean by "trying to fit this redundancy into the algebraic
operations"? AFAICT, R does no such thing. Algebraically, "a:b" and "a
+ b + a:b" are different, all the algebraic operations respect that.
The redundancy only arises when generating the design matrix.
>> So when you write "a + b + a:b" in R, then R's algorithm says:
>> 1) Okay, I want an intercept, a, b, and a:b
>
> Yes, an intercept corresponds to set([]) and every simplicial complex
> contains set([]).
>
>> 2) I'll allocate 1 column to the intercept
>> 3) Uh-oh -- if I dummy-code "a", then I'll introduce collinearity with
>> the intercept. I'll contrast code it in k-1 columns.
>> 4) Uh-oh -- if I dummy-code "b", then I'll introduce collinearity with
>> the intercept. I'll contrast code it in k-1 columns.
>> 5) Uh-oh -- if I dummy-code "a" in this interaction, then that'll
>> introduce collinearity with my "a" columns. I better contrast code it.
>> 6) Uh-oh -- even then, if I dummy-code "b" in this interaction, then
>> that'll introduce collinearity with my "b" columns. I better contrast
>> code it too.
>> 7) Okay, I'll allocate (k-1) * (k-1) columns to the interaction.
>> Total: 1 + k - 1 + k - 1 + k ** 2 - 2k + 1 = k**2 columns
>
> Yes, I understand the reasoning behind all these decisions. I just object
> to calling these rules "algebraic rules" hence there is a need for another
> way to axiomatize them. I can follow the example for a*b, but what rules can
> I apply to the expression, say,
>
> (a*b*(c-z) + (d-f)*(c-a))*(b+a-1)
The above has nothing to do with algebraic rules -- it's a worked
example of the design matrix builder's algorithm, that applies after
the algebraic rules are long gone.
>> And let's take the other extreme... if you do "0 + a:b", then R says:
>> 1) Okay, I want *no* intercept, *just* the interaction of a & b
>> 2) Hey, "a" won't be collinear with anything, I'd better dummy code it
>> rather than throw out a dimension.
>> 3) Hey, "b" won't be collinear with anything either, I'd better dummy
>> code it rather than throw out a dimension.
>> 4) Okay, so I'll use k * k columns for the interaction.
>> Total: k**2 columns
>>
>
> In other words the simplicial complex generated by set(('a','b')) has
> set(('a',)) and set(('b',)) as members.
> It also has set([]).
Yes, the simplicial structure is definitely very useful for finding
non-redundant codings and interpreting relationships between the
corresponding models.
>> So I think my point is that this is all conceptually coherent, just
>> it's using a different set of consistent rules than the ones you are
>> thinking of. Or perhaps you knew all this already, but then if so I
>> don't understand why you say that R is doing something different than
>> I claim.
>
> It's your claim that this is algebraically consistent that I object to. I am
> now
> satisfied that I can come up with a consistent, mathematical definition of a
> formula with only "*" (or ":" if you like) and "+".
And the problem is R-style-"*"? If we have ":" and "+", then surely we
can define a function f(a, b) = a + b + a:b? How is that function not
consistent or mathematical?
>> The semantics, like I said before, are really simple: bare variables
>> (or Python expressions or whatever) name singleton sets. The operation
>> "+" is union, "-" is set difference, and ":" a kind of unordered
>> Cartesian product -- specifically it is the composition of Cartesian
>> product and a set union. What I mean, is, where an ordinary Cartesian
>> product on a set-of-sets gives:
>> {{a}, {b}} x {{c}, {d}} = {({a}, {c}), ({a}, {d}), ({b}, {c}), ({b},
>> {d])}
>> we have instead:
>> {{a}, {b}} : {{c}, {d}} = {{a, c}, {a, d}, {b, c}, {b, d}}.
>> where you can see that for each tuple (foo, bar) we took the union of
>> foo and bar. Given a set of factors F, the formula language describes
>> operations in the space 2^(2^F), where 2^ denotes the power set.
>>
>
> Yes, this is exactly a simplicial complex: a collection of subsets of F.
Okay, thanks for the terminology :-).
> The "*" is not a Cartesion product, a monomial denotes a subset of F, and
> "+" denotes
> a list of subsets of F. "-" can be viewed as set difference (or, if you want
> to enforce
> R's rules about not having an interaction without the corresponding main
> effects, then "-" is deletion
> of a maximal simplex of the simplicial complex).
R doesn't have any such rule -- it just defines an "interaction"
operation that takes two factors and produces a new factors whose
levels are pairs (factor1_level, factor2_level). The resulting factor
is therefore nested in the original factors, and therefore the "main
effect" for this new factor will necessarily include the predictive
power of the original factors.
But all of this is downstream; "-" is just set difference.
>> But I guess what we're really interested in is how well-behaved the
>> rules of evaluation are. Here are rules that are true for *both*
>> ordinary arithmetic and R formulas:
>> 1) Multiplication is commutative: a : b = b : a
>> 2) Multiplication is associative: (a : b) : c = a : (b : c)
>> 3) "+" is commutative: a + b = b + a
>> 4) "-" is not commutative.
>> 5) "-" is not associative; therefore, we need a convention to
>> determine what something like "a - b - c" means. Our convention is
>> that "-" is "left-associative", i.e., a - b - c = (a - b) - c, never a
>> - (b - c).
>> 6) Multiplication distributes over addition: a : (b + c) = a : b + a : c
>
> Things get worse when you add the rule a*b=a+b+a:b.
I'm afraid I still don't see how :-(
>> You can prove all of these yourself, if you like (well, except for the
>> conventions, of course) -- they just follow directly from the
>> definitions of "+", "-", ":" that I gave above.
>>
>> (Are there any expectations of ordinary arithmetic that I'm forgetting
>> about? Well, obviously the ones about identities and inverses don't
>> apply, but if there are others, let me know.)
>>
>> In addition, of course, R has "*", "/", and "^", but these are defined
>> in terms of the operations above (and are very well-behaved
>> algebraically as far as I can tell -- "*" is commutative and
>> associative and distributive, etc.) so I don't think they introduce
>> any new horribleness.
>
> I gave an example in an earlier email where "x*y=x+y+x:y" leads, along with
> "x*x=x" and "x+x=x" yields inconsistencies.
>
> a*a-a = a+a+a:a-a = (a+a+a:a)-a = a-a = 0
>
> or
>
> a*a-a = a + a + a:a-a =a + (a+ a:a-a) = a + 0 = a
Technically, actually, there is no ambiguity here -- precedence says
a*a - a = (a*a) - a
and now that order of evaluation is clear, we can rewrite the boolean
operators as functions to clarify why this is so:
-(*(a, a), a)
where we have
def *(x, y):
return +(+(x, y), :(x, y))
Calling *(a, a) will return a (or rather the simplicial complex
{{a}}), not an unevaluated symbolic expression a + a + a:a, so the
answer is 0.
But I think the problem you're thinking of arises even without "*":
a + a - a = (a + a) - a = a - a = 0
or
a + a - a = a + (a - a) = a
And that's what I listed in my previous email as rule (7), the one
rule that's different: since expressions with mixed + and - are
ambiguous (well... even more ambiguous than in ordinary arithmetic),
we have to specify that we not only use left-associative evaluation
for expressions involving a *mix* of "+" and "-" (as we do in ordinary
arithmetic), but we also use left-associative evaluation for
subexpressions involving only "+". So the answer is 0.
Note also that this rule means that we get the same answer in your
case even if *(a, a) did return an unevaluated, unparenthesed symbolic
expression.
It's also nice that, all talk of simplicial complexes aside, for the
user, applying this rule in your head is very simple: if the same term
appears multiple times in the expression, you just look at the last
time it appears: if it has a "-" in front, then that term does *not*
appear in the final model, otherwise it does. (Parentheses do make
this a little more complicated, though.)
>> (But, you don't eliminate duplicates in expressions like a*a
>> -- that's the only difference.)
>
> But this is a big difference because it breaks the commutativity
> and associativity.
R's "*" is both commutative:
a*b = a + b + a:b # definition
= b + a + b:a # commutivity of ":" and "+"
= b*a
and associative:
(a*b)*c = (a + b + a:b)*c # definition
= (a + b + a:b) + c + (a + b + a:b):c # definition
= a + b + a:b + c + a:c + b:c + a:b:c # distributivity of :
= a + b + c + b:c + a:b + a:c + a:b:c # commutivity of +
= a + (b + c + b:c) + a:(b + c + b:c) # distributivity of :
= a*(b + c + b:c)
= a*(b*c)
It's also distributive:
a*(b + c) = a + (b + c) + a:(b + c)
= a + b + c + a:b + a:c
= (a + b + a:b) + (a + c + a:c)
= a*b + a*c
Or am I missing something?
-- Nathaniel
Well, of course :-). But the simplicial complex framework makes
perfect sense whether we're talking about categorical variables or
not.
>> You don't need to remove non-maximal elements at the level of the
>> Formula -- they indeed don't affect which final model that's fit, but
>> they do affect its parametrization.
>
> Agreed, but we might as well have a smaller design matrix so
> if they can be eliminated they might as well be.
Yes, but in R's model it's important that this elimination happen in
the design matrix builder, not in the algebra. Because the design
matrix builder is smart about which columns are eliminated -- remember
that in the "a + b + a:b" case, it chooses to eliminate columns from
the a:b part, not from the a+b part. (Well, or rather, not eliminate
columns but rather reduce the dimensionality by coding them
differently, if we want to be precise. Basically the same result,
except with nicer properties wrt orthogonality etc.)
>> > The final rule,
>> > x*x=x is where
>> > categorical variables matter, this is always true for categoricals,
>> > but not for numerics.
>> >
>> > Anyways, I implemented a "parser" for this formula for ANOVA (using
>> > just "*".split() and "+".split()).
>>
>> Using .split() is pretty different from a real parser, unfortunately,
>> because the characters "*" and "+" might appear embedded inside terms
>> -- like the R formula "log(a + b) + c", which .split() will make a
>> hash of. (I realize that you might not care about allowing such
>> arbitrary data transformations in this 'simplified' syntax, though,
>> but they should be possible, and a real parser is doable.)
>
> Right, I only proposed this syntax for ANOVA models where log is never an
> issue.
But the models I like to fit have a mix of numeric and factorial terms :-(
(Including interactions between them. a:b where a is factorial and b
is numeric is a very useful construct in R; there's no special case
for it, but the standard rules still do 'the right thing' regardless
of whether 'a' has appeared earlier as a main effect, etc.)
>> > The interactions or rows in the ANOVA table are generated by all
>> > elements of the simplicial
>> > complex. So, we would get the following rows in this example: ['a:b',
>> > 'b:c', 'c:d:e','c:d','c:e','d:e','a','b','c','d','e'].
>> > For a full factorial design with k terms, the corresponding simplicial
>> > complex is the k (or k+1 depending on how you name it) simplex,
>> > and this yields all interactions.
>>
>> R intentionally only includes in the ANOVA tables those terms which
>> are mentioned, rather than generating the full set of elements.
>
> But that's just reporting. The fit still included the lower-order
> interactions, or
> else it doesn't follow R's rule of including main effects from any
> interaction. Which
> does it use in the fit?
I'm not sure what to do except point you to my worked example, the one
where in "a + b + a:b", the "a:b" term generates (k-1)**2 columns, but
in "0 + a:b" it generates k**2 columns. The result is that the ANOVA
tables R prints for these two models both have a line labeled "a:b",
but that line corresponds to a different hypothesis (with a different
number of degrees of freedom!) in the two cases. In the first case
it's a traditional "interaction" test, in the second it's a test of
whether any combination of a and b has a non-zero mean.
>> For
>> instance, if you have a nested variable that's not coded as such (like
>> you have schools 1-10, and at each school you have teachers 1-10, but
>> teacher 1 at school 1 is different from teacher 1 at school 2), then
>> something like "a + a:b" (or the equivalent shortcut designed for such
>> cases, "a/b") is meaningful, and turning it into "a + b + a:b" will
>> produce an incorrect ANOVA table.
>
> A nested factor can be handled, at least conceptually by making Factors with
> (teacher, school) tuples
> as the levels. To handle the ANOVA rules, one would have to manipulate the
...?
>> > One could allow '-' above, which would correspond to deleting one of
>> > the simplices that generated the simplicial complex.
>> > This raises the issue of what to do with "a*b*c+c*d+e*f-z"? Failing
>> > silently, i.e. ignoring 'z' is always an option but seems a little
>> > worrying.
>>
>> Yeah, I also strongly considered making "- z" here an error, but then
>> ended up deciding to stick with R compatibility...
>
> So, 1-z=1 \forall z?
Sorry, I don't follow.
>> ...I think in the end, though, we might be worrying too much about
>> such edge cases. "-" is very rarely used in the first place, and the
>> user isn't going to come up with crazy complicated expressions just
>> for fun.
>
> OK, but then one should carefully point exactly which cases the parser
> gets correct, and which it doesn't.
Well, but for that we have to define "correct". Any parser will get
all expressions correct according to the formal algorithm expressed by
its own code :-). In this case, in addition, it's straightforward to
make it get all expressions correct according to the simplicial
complex model. That's valuable, but my point is that most users
probably don't care either way as long as it does what they expect in
the cases that arise in practice.
>> > There are also issues with order of operations / commutativity if one
>> > allows '-' and silently ignores lingering '-' signs. Consider
>> >
>> > a*(b-c) + a*c
>> >
>> > There are at least two possible answers to this: a*b+a*c or a*b
>> > depending on when one tries to drop something based on the minus
>> > sign.
>>
>> See my other email for an argument why a*b + a*c is the correct
>> answer.
>
> But I don't want to have to read an email to know which answer is correct.
> Will you answer all my questions for any formula I have?
The formalism I gave there -- the same AFAICT as the formalism you
came up with yourself -- will answer all your questions for any
formula you have.
>> Applying rules from arithmetic that just aren't true for
>> sets-of-sets/simplicial complexes seems like a bad idea to me.
>>
>
> But that's my point! You're the one that is not following rules that aren't
> arithmetic. I
> have a well-defined "*" and "+" for ANOVA problems. It's not arithmetic,
> they
> represent operations on 2^(2^F) that are perfectly well-defined. I know it's
> not a parser,
> I don't trust a parser if you allow anything else besides "*" and "+" and
> want
> to keep these redundancy rules.
I'm not sure where you get the idea that I insist on arithmetic rules,
except maybe my note in a previous email that my parser's evaluator
does that for "-" in some cases. But like I said there, I consider
that a bug. I too am talking in terms of operations on 2^(2^F) that
are perfectly well defined, just AFAICT that includes *all* of R's
formula operators, not just "+" and ":"/"*", and they work perfectly
well for non-ANOVA problems too.
-- Nathaniel
Quoting myself: "I meant it's "algebraic" in the sense that it's a
simple set of rules applied consistently; we certainly could
axiomatize it if we really felt the need." I don't think this
definition comes entirely from left field; certainly there's precedent
for using the word "algebra" as a generic term for structured
operations on some space of mathematical objects! But it's also true
that I hadn't thought those rules through as clearly as I have now,
so, oh well. Hopefully we're at least on the same page now?
>> except maybe my note in a previous email that my parser's evaluator
>> does that for "-" in some cases. But like I said there, I consider
>> that a bug. I too am talking in terms of operations on 2^(2^F) that
>> are perfectly well defined,
>
> Well, each operator is well-defined but the final answer depends on
> the order in which you apply these operators: think of a*(b-c)+a*c or
> z-(1+z).
> So, again, if you're going to use some symbolic system to evaluate the
> result, you have to be careful about the order of operations.
Yes, of course.
> I should also emphasize that I don't think R's formula is irrational,
> it is very useful and can be formalized in terms of operations on a
> simplicial complex.
> I just pointed out that it is not algebraically consistent so trying
> to
> make it look algebraically consistent by applying order of operations,
> etc. is
> risky and filled with inconsistencies that are not easily dismissed as
> "bugs".
Well, my term to quibble about terminology: I wouldn't use the word
"consistent" here; perhaps its the mathematician in me talking, but it
just isn't the case that the R formula operations produce
contradictions; we have a perfectly reasonable axiomatization, after
all.
I *think* what you're saying is that it can produce results that you
find surprising -- perhaps partly because you are used to thinking of
formulas in terms of the vector arithmetic used to build the design
matrix, rather than a simple language for describing simple set-like
objects? The only real operation anyone does on this representation is
look at something like "~ a + b + b:whatever + c/d" and work out in
their head what the corresponding model is; associativity is just not
that critical for such reasoning, in the way it would be if you were
doing complex algebraic manipulations. Like I said, in practice it's
trivial to work out the interaction of +'s and -'s, you just look to
see which comes last.
-- Nathaniel
I guess I thought my talking about "different algebras" would be a
clue that I was using "algebra" in a more generic sense than "those
identities which apply to ordinary arithmetic, i.e., the ring/field
axioms"? What can I say -- I'm sorry this caused so much confusion.
> This is what I originally objected to, and trying to follow order of
> operations with these rules will give us problems.
But I guess I still don't see what these problems are. You've
mentioned teaching with R, so I guess you must have some real
experience with both the formula language and people's confusions
regarding it -- can you, for instance, give an example of a realistic
R-style formula that people would have trouble understanding because
of the order of operations?
(Heck, would it help if I offered to just take "-" out of my parser?
I'm not sure I've ever used it in real life; probably the only really
important use is with R's 'update' function, where you can say
"update(my.model, y ~ . - x)" as a shorthand to say "give me a model
that uses the same formula as that one, except with 'x' removed", but
I don't even implement the magic placeholder "." syntax.)
-- Nathaniel
Sure, anything can be done with a nipy Formula. Just like anything can
be done with an R-style formula. But my point was that I find great
value specifically in R's readability and smart design matrix builder
when I'm doing more general regression, and you were saying those are
things useful only for ANOVA. (Or at least, that's the impression I
got.)
> Use of the name "Term" is probably not ideal. Another name would be
> "Indeterminate". Then, a "Formula" is a set of real-valued functions
> of "Indeterminates" (and possibly some parameters, which is everything
> that is not an Indeterminate and is an atom of sympy).
> A "Factor" is a set of binary-valued functions of a categorical
> "Indeterminate", these binary-valued functions are called
> "FactorTerm"s (or maybe "FactorIndeterminate" ?).
>
> The product of two Formuale is a new set of real-valued functions
> determined by the pairwise products of these real-valued functions.
>
> So, to get an interaction between x=Term('x') and g=Factor('gender',
> ['male','female']) one forms f*x which
> is a new Formula with two functions [x*gender_male, x*gender_female].
> This is used in the fitting, to find
> the interactions, one looks at the expression
> Formula([x])*g.main_effect = Formula([x*(gender_male-gender_female)])
This is very cool, definitely. But it still strikes me as forcing me
to think hard about details at moments when those details are
irrelevant and distracting :-(.
>> I'm not sure what to do except point you to my worked example, the one
>> where in "a + b + a:b", the "a:b" term generates (k-1)**2 columns, but
>> in "0 + a:b" it generates k**2 columns. The result is that the ANOVA
>> tables R prints for these two models both have a line labeled "a:b",
>> but that line corresponds to a different hypothesis (with a different
>> number of degrees of freedom!) in the two cases.
>
> That's what I find worrying actually!
> Why should you expect users, who are told "a*b=a+b+a:b" to know that
> using "b+a:b", which looks like it should be equal to "a" gives the
> same fitted values as "a*b" but a wildly different ANOVA
> table?
Sorry? What user would look at "a*b=a + b + a:b" and conclude that "b
+ a:b = a"?
But even so, the ANOVA tables are not hard to predict, nor very surprising:
"a + b + a:b" -> produces an ANOVA table with lines "a", "b", "a:b"
"b + a:b" -> produces an ANOVA table with lines "b", "a:b"
And IIRC the rule for interpreting them is easy and consistent too:
each line tests whether the named factor has a significant effect
*given* the lines above. (For an unbalanced design, "a + b + a:b" also
gives different results than "b + a + a:b", for the same reason. So
you *need* to know this rule if you want to know what hypotheses
you're testing regardless!) Given that, it should be surprising if the
"a:b" line in "a + b + a:b" and "b + a:b" tables were *the same*!
>> > So, 1-z=1 \forall z?
>>
>> Sorry, I don't follow.
>
> If -z is ignored when it's not in what preceded it, then 1-z=1.
Oh, I see, I thought you were defining some new binary operator
\forall or something :-).
Yes. (Well, unless z = 1.)
>> Well, but for that we have to define "correct".
>
> OK. Define "correct" for me.
The rules I gave seem like a perfectly reasonable choice to me. If you
have another suggestion I'd be happy to hear it.
-- Nathaniel
I withdraw the word.
-- Nathaniel
Which operations are you thinking of? Only the final design matrix
builder -- which in my understanding takes over only after all 2^(2^F)
operations are complete -- even knows which variables are factors and
which numeric.
> In terms of fits (much less the rows of the ANOVA table), we have
>
> factorA * factorB = factorA + factorA:factorB
> numericA * numericB != numericA + numericA:numericB
> numericA * factorB != numericA + numericA:factorB
> factorA * numericB = factorA + factorA:numericB
>
> How about higher order rules, is there a predictable rule here?
Well, I can try. First, in all cases, a*b = a+b+a:b.
Then there's the question of whether b is nested in a:b. If it is,
then you can drop it from the formula without affecting the eventual
fitted values -- that's what gives rise to all the equalities above.
For numeric:numeric interactions, R interprets this as a simple
pointwise product, so neither a nor b is nested in a:b. This is a
pretty useless rule -- you can get the same thing more clearly by just
writing it as an arithmetic operation I(a*b) -- but I guess many
people do refer to this sort of quadratic term as "the interaction of
a and b". I'm not prepared to defend it very far, though; I'm pretty
sure R only has this rule for completeness, and Chambers and Hastie
call it "probably the least meaningful form of interaction". But
anyway, that's why numericA + numericB + numericA:numericB != numericA
+ numericA:numericB.
Factor:numeric interactions are interpreted as a "varying slopes"
model, i.e., one where each level of the factor gives a different
coefficient for the numeric. So in this case the numeric variable is
nested, but the factor variable is not. That's why factorA + numericB
+ factorA:numericB = factorA + factorA:numericB, but not numericB +
factorA:numericB.
Factor:factor interactions are interpreted as I've said in previous
emails. Or you can think of this as following the previous rule --
it's a quote-unquote "varying slopes" model where the quote-unquote
"slope" for each level of the one factor varies with each level of the
other factor. In any case, for this sort of interaction both
individual variables are nested. That's why factorA + factorB +
factorA:factorB = factorA + factorA:factorB.
> I confess that I can never really tell until I fit the model
> and look at the names of the columns R has spit out, and this is what
> I tell my students, too.
>
> I must be a little slow or something.
My impression is that it's a really elegant implementation with really
lousy documentation. But if it's useful, Ch. 2 of Chambers and Hastie
(1992) does make some things clearer, at least... And the 4th edition
of Modern Applied Statistics with S has a discussion starting on page
145, that's in some ways more detailed and in some ways more confusing
(mostly because of their alternation between discussing R and S+, the
latter of which apparently has a number of quirky behaviors). But
their discussion is phrased explicitly in terms of operations on parts
of the model matrix, so it might be easier to see the connections to
your preferred approach.
-- Nathaniel
maybe it's time for my example
numericY ~ factorA:(1 + factorB + numericX1 + numericX2) + numericZ
: or * I'm not sure
I want to test whether the intercept and the slopes (for B, X1, X2)
differ by groups defined by factorA
say factorB has 3 levels (>2)
I don't really care about the parameterization, all I want are to test
all possible Null hypothesis on the coefficients/effects based on only
a single regression (using some default dummy encoding)
(1) H0: factorB has no effect, i.e. all slopes are identical
(2) H0: factorB only affects intercept, but all slopes are constant (
no interaction effects)
H0Model: numericY ~ factorA + factorB + numericX1 + numericX2 + numericZ
(3) H0s: slope of an x for all levels of factorB are the same, nulls
for each x in {B, X1, X2}
if first (or third) H0 is rejected
(4) H0s: all slopes of levels i and j of factorB are the same, for all pairs i,j
(5) H0s: slope of an x of levels i and j of factorB are the same,
nulls for all pairs i,j, for each x in {B, X1, X2}
right now, I don't have a Z, and just generate all contrast matrices
and f_tests for Null hypotheses (1) and (4) (I think), and print a
table of all results
Is there a formula notation to specify e.g. one of the pairs and
variables in (5) on demand, without just reestimating?
Also, I know how to do everything in this example, but how do you get
the generic answer for all the tests?
If I specify two models (representing base model and one of the null
models in the above)
model1: numericY ~ factorA:(1 + factorB + numericX1 + numericX2) + numericZ
model2: numericY ~ factorA + factorB + numericX1 + numericX2 +
numericZ (no interaction)
Can the contrast matrix be build without re-estimating, similar to
Jonathan's example for contrastfromcols?
And for the fun of it (because it's easy to do) the same as above but
with factorA-group-specific heteroscedasticity, i.e. weighted least
squares with error variances estimated for each group.
---
In one of the previous examples, I find it confusing if ":" changes
meaning, I think the labels for the interesting effects should not
depend on the parameterization, e.g.
y ~ factorA*factorB or y ~ factorA + factorB + factorA*factorB
H0 (total): factorA has no effect, model is y ~ factorB
H0 (interaction): no interaction effect, model is y ~ factorA + factorB
H0 (main effect): factorA has no main effect, model is y ~ factorB +
factorA*factorB (maybe illegal)
or did I miss the main point in this?
----
formula candy:
suppose I have 20 regressors, do I have to type the names of all of them,
y ~ X1 + X2 + ..... + X20
or is there a shorthand notation?
I guess I will have to read this thread a few more times when I start
to look at implementation details (most likely second half of July) to
understand enough of it.
Josef
>
> -- Nathaniel
>
(Assuming R here)
For simplicity, I'll look only at numericX1 for a moment. Basically
you have two options:
numericX1 + factorA:numericX1
or
factorA:numericX1
The difference is that in the first case, R will fit one "overall"
slope for numericX1 (since that's what you said you wanted when you
put numericX1 in the model as a term by itself!), and then ancillary
"contrast" slopes for the different levels of factorA. What exactly it
does depends on what sort of contrast coding you use, but the default
will be to use "treatment" contrasts. So what it'll give you as
parameters are (1) the absolute slope for numericX1 when factorA is at
its "baseline" (first) level, and (2) the *differences* between the
baseline numericX1 slope and the numericX1 slope for the other levels
of factorA.
On the other hand, if you enter it as a bare 'factorA:numericX1', then
R will give you the absolute slopes for numericX1 at each level of
factorA.
That's the general rule -- R builds the design matrix by going from
low-order to high-order and left to right, and for each term adds in
any *new* columns (= linearly independent of existing columns).
> say factorB has 3 levels (>2)
>
> I don't really care about the parameterization, all I want are to test
> all possible Null hypothesis on the coefficients/effects based on only
> a single regression (using some default dummy encoding)
>
> (1) H0: factorB has no effect, i.e. all slopes are identical
> (2) H0: factorB only affects intercept, but all slopes are constant (
> no interaction effects)
> H0Model: numericY ~ factorA + factorB + numericX1 + numericX2 + numericZ
> (3) H0s: slope of an x for all levels of factorB are the same, nulls
> for each x in {B, X1, X2}
>
> if first (or third) H0 is rejected
> (4) H0s: all slopes of levels i and j of factorB are the same, for all pairs i,j
> (5) H0s: slope of an x of levels i and j of factorB are the same,
> nulls for all pairs i,j, for each x in {B, X1, X2}
>
> right now, I don't have a Z, and just generate all contrast matrices
> and f_tests for Null hypotheses (1) and (4) (I think), and print a
> table of all results
>
> Is there a formula notation to specify e.g. one of the pairs and
> variables in (5) on demand, without just reestimating?
> Also, I know how to do everything in this example, but how do you get
> the generic answer for all the tests?
I'm not sure what you mean. Again, talking about R, formulas are just
a way to make a (labeled) design matrix. If you want a convenient
notation for picking out particular columns after the matrix is built,
then it's nice to have those labels and you can write some sort of API
that accepts them and does whatever you want. But formulas per se
don't necessarily have anything to do with it, at that point. Unless
you have neat idea for an API that makes use of formulas and is an
elegant way to accomplish what you want, in which case we could
implement it, obviously.
> If I specify two models (representing base model and one of the null
> models in the above)
> model1: numericY ~ factorA:(1 + factorB + numericX1 + numericX2) + numericZ
> model2: numericY ~ factorA + factorB + numericX1 + numericX2 +
> numericZ (no interaction)
>
> Can the contrast matrix be build without re-estimating, similar to
> Jonathan's example for contrastfromcols?
So the question is: If I've fit a big model, can I write down the
formula specifying a smaller (i.e., nested) model and do a nested
hypothesis test directly without fitting the smaller model? (At least
in the OLS case?)
Interesting idea, I haven't thought about it. In simple cases like
when the difference between the two formulae is just that the "big"
formula has more terms tacked onto the end, then it's easy, but that
means it just has more columns tacked onto the end of its model matrix
and the hypothesis test is just whether the betas for those columns
are jointly zero. In more complicated cases then I *think* the formula
representation gives you enough information to work out the proper
contrasts without *too* much pain, but I'd have to think about it.
I haven't had a chance yet to sit down to re-read the various messages
about contrast_from_cols_or_rows, so I don't understand what its
purpose is yet. But the code itself is simple and AFAICT could be
implemented in any system.
> And for the fun of it (because it's easy to do) the same as above but
> with factorA-group-specific heteroscedasticity, i.e. weighted least
> squares with error variances estimated for each group.
I think this would just be something like:
fit(..., group_heteroscedasticity=factorA)
? A simple grouping term is just, like, a vector or something, it
isn't a design matrix and doesn't really care how you write down the
design matrix?
> ---
> In one of the previous examples, I find it confusing if ":" changes
> meaning, I think the labels for the interesting effects should not
> depend on the parameterization, e.g.
>
> y ~ factorA*factorB or y ~ factorA + factorB + factorA*factorB
>
> H0 (total): factorA has no effect, model is y ~ factorB
> H0 (interaction): no interaction effect, model is y ~ factorA + factorB
> H0 (main effect): factorA has no main effect, model is y ~ factorB +
> factorA*factorB (maybe illegal)
>
> or did I miss the main point in this?
I don't know, but I'm confused! When does ":" change meaning? And in
your R-style formulas above did you mean ":" when you wrote "*"?
Many hypothesis tests are written as tests of the parameters, so of
course when written in that form they depend on the parametrization.
I'm not sure there's any other way to write them in general. What else
would you write them in terms of?
(Nested hypothesis tests are described in terms of models, instead of
formulas, which is nice. But then to specify the restricted model, we
still often talk in terms of the parameters, like "the best model from
the subspace where beta1 is constrained to be twice beta2" or
whatever.)
> ----
> formula candy:
> suppose I have 20 regressors, do I have to type the names of all of them,
> y ~ X1 + X2 + ..... + X20
> or is there a shorthand notation?
In R, you can use the magic term "." to mean "the default set of
terms". In specifying a model from scratch, "." means "one term for
every column in my data frame (except the columns that are mentioned
on the left-hand-side)", like:
lm(y ~ ., my.data)
I haven't tried implementing it, but it's certainly doable.
HTH,
-- Nathaniel
As I understand it the parameterization and the resulting statistics
that are calculated depend on the formula. So the reported results
differ, even if the design matrix spans the same space.
this is useful for the interpretation of the regression coefficients,
but looks incomplete for testing of Tther joint hypotheses on the
regression coefficients.
Since I never really used R's (or any other) formulas, I'm still
trying to figure out what's possible and how it fits together.
But if we have the information about the model and the design matrix
provided by the formula, then it would be very useful to combine it
with some additional helper functions that construct contrast matrices
for f_tests (and t_tests) especially for cases where they are not
trivial.
>
>> If I specify two models (representing base model and one of the null
>> models in the above)
>> model1: numericY ~ factorA:(1 + factorB + numericX1 + numericX2) + numericZ
>> model2: numericY ~ factorA + factorB + numericX1 + numericX2 +
>> numericZ (no interaction)
>>
>> Can the contrast matrix be build without re-estimating, similar to
>> Jonathan's example for contrastfromcols?
>
> So the question is: If I've fit a big model, can I write down the
> formula specifying a smaller (i.e., nested) model and do a nested
> hypothesis test directly without fitting the smaller model? (At least
> in the OLS case?)
>
> Interesting idea, I haven't thought about it. In simple cases like
> when the difference between the two formulae is just that the "big"
> formula has more terms tacked onto the end, then it's easy, but that
> means it just has more columns tacked onto the end of its model matrix
> and the hypothesis test is just whether the betas for those columns
> are jointly zero. In more complicated cases then I *think* the formula
> representation gives you enough information to work out the proper
> contrasts without *too* much pain, but I'd have to think about it.
The nested model could also just have some interaction terms of
several groups combined.
As an example to what I was thinking about
Assume we have quarterly data for 3 decades 70s, 80s, 90s, where all
or some slope coefficients might vary by decade. So decade is a
categorical variable with 3 levels.
One hypothesis could be that all slope coefficients where the same in
the 70s and 80s, so in the reduced model decades would have only two
levels.
In the original big (unconstraint) model I can do an f_test that
specifies that the slopes of each regressor is the same for levels 70s
and 80s of the decade interaction terms.
So, I guess the difference in the approach is that R and SAS use
formulas to get the desired effects encoded in the design matrix,
while I (and I think it's more common in econometrics) would use a
(boring) dummy variable encoding and spend more effort on getting the
linear restrictions matrices for the f_tests of various effects.
>
> I haven't had a chance yet to sit down to re-read the various messages
> about contrast_from_cols_or_rows, so I don't understand what its
> purpose is yet. But the code itself is simple and AFAICT could be
> implemented in any system.
I will need to play quite a bit with it before I understand how this
really works.
>
>> And for the fun of it (because it's easy to do) the same as above but
>> with factorA-group-specific heteroscedasticity, i.e. weighted least
>> squares with error variances estimated for each group.
>
> I think this would just be something like:
> fit(..., group_heteroscedasticity=factorA)
> ? A simple grouping term is just, like, a vector or something, it
> isn't a design matrix and doesn't really care how you write down the
> design matrix?
Yes, in statsmodels we haven't gotten so far yet. We don't have yet a
convenient specification of heteroscedasticity for WLS, or more
general specification (random effects, correlations) for GLS.
Both WLS and GLS take the error covariance matrix as given.
>
>> ---
>> In one of the previous examples, I find it confusing if ":" changes
>> meaning, I think the labels for the interesting effects should not
>> depend on the parameterization, e.g.
>>
>> y ~ factorA*factorB or y ~ factorA + factorB + factorA*factorB
>>
>> H0 (total): factorA has no effect, model is y ~ factorB
>> H0 (interaction): no interaction effect, model is y ~ factorA + factorB
>> H0 (main effect): factorA has no main effect, model is y ~ factorB +
>> factorA*factorB (maybe illegal)
>>
>> or did I miss the main point in this?
>
> I don't know, but I'm confused! When does ":" change meaning? And in
> your R-style formulas above did you mean ":" when you wrote "*"?
Yes, in this part I wrote incorrectly * when I meant ":"
It showed up several times in the discussion, quoting you
"The result is that the ANOVA
tables R prints for these two models both have a line labeled "a:b",
but that line corresponds to a different hypothesis (with a different
number of degrees of freedom!) in the two cases"
>
> Many hypothesis tests are written as tests of the parameters, so of
> course when written in that form they depend on the parametrization.
> I'm not sure there's any other way to write them in general. What else
> would you write them in terms of?
I was thinking more in terms of using only a single default parameterization
e.g. y ~ factorA + factorB + factorA*factorB
and then construct the corresponding contrast matrices for all
possible tests and corresponding coefficients, and give them
identifiable (implementation independent) names or labels.
>
> (Nested hypothesis tests are described in terms of models, instead of
> formulas, which is nice. But then to specify the restricted model, we
> still often talk in terms of the parameters, like "the best model from
> the subspace where beta1 is constrained to be twice beta2" or
> whatever.)
This would be easier if there is a way to get the parameter
restrictions out of the comparison of two nested model, (as discussed
above)
>
>> ----
>> formula candy:
>> suppose I have 20 regressors, do I have to type the names of all of them,
>> y ~ X1 + X2 + ..... + X20
>> or is there a shorthand notation?
>
> In R, you can use the magic term "." to mean "the default set of
> terms". In specifying a model from scratch, "." means "one term for
> every column in my data frame (except the columns that are mentioned
> on the left-hand-side)", like:
> lm(y ~ ., my.data)
> I haven't tried implementing it, but it's certainly doable.
good to know, and I would think it's useful
Thanks,
Josef
>
> HTH,
> -- Nathaniel
>
Yes.
> this is useful for the interpretation of the regression coefficients,
> but looks incomplete for testing of Tther joint hypotheses on the
> regression coefficients.
Sure, the ability to modify the parametrization in some limited ways
isn't a replacement for being able to perform arbitrary linear
hypothesis tests. You still need that ability, and you still have it.
>> I'm not sure what you mean. Again, talking about R, formulas are just
>> a way to make a (labeled) design matrix. If you want a convenient
>> notation for picking out particular columns after the matrix is built,
>> then it's nice to have those labels and you can write some sort of API
>> that accepts them and does whatever you want. But formulas per se
>> don't necessarily have anything to do with it, at that point. Unless
>> you have neat idea for an API that makes use of formulas and is an
>> elegant way to accomplish what you want, in which case we could
>> implement it, obviously.
>
> Since I never really used R's (or any other) formulas, I'm still
> trying to figure out what's possible and how it fits together.
>
> But if we have the information about the model and the design matrix
> provided by the formula, then it would be very useful to combine it
> with some additional helper functions that construct contrast matrices
> for f_tests (and t_tests) especially for cases where they are not
> trivial.
Yes, definitely.
Yes, there is a whole tradition in ANOVA-land of using nothing but
"these parameters are jointly zero" tests. It's sort of silly, but
there it is.
There are other advantages to being able to do coding beyond dummy
variables, though -- you can have better interpretability, and there
*might* be cases where you have to do contrast-coding to avoid
collinearity. I'm not totally sure.
>>> In one of the previous examples, I find it confusing if ":" changes
>>> meaning, I think the labels for the interesting effects should not
>>> depend on the parameterization, e.g.
>>>
>>> y ~ factorA*factorB or y ~ factorA + factorB + factorA*factorB
>>>
>>> H0 (total): factorA has no effect, model is y ~ factorB
>>> H0 (interaction): no interaction effect, model is y ~ factorA + factorB
>>> H0 (main effect): factorA has no main effect, model is y ~ factorB +
>>> factorA*factorB (maybe illegal)
>>>
>>> or did I miss the main point in this?
>>
>> I don't know, but I'm confused! When does ":" change meaning? And in
>> your R-style formulas above did you mean ":" when you wrote "*"?
>
> Yes, in this part I wrote incorrectly * when I meant ":"
>
> It showed up several times in the discussion, quoting you
> "The result is that the ANOVA
> tables R prints for these two models both have a line labeled "a:b",
> but that line corresponds to a different hypothesis (with a different
> number of degrees of freedom!) in the two cases"
Okay, right. What's going on there is that in an ANOVA table like:
a F(...) = ..., p = ...
b F(...) = ..., p = ...
a:b F(...) = ..., p = ...
we obviously are doing a bunch of hypothesis tests, look at all those
p-values, but what are the hypotheses being tested? The convention is
that each line gives the result of a test for whether the named term
*adds* predictive power over the *previous* lines. So it's a test of
"~ 1" vs. "~ 1 + a", then of "~ 1 + a" vs. "~ 1 + a + b", then of "~ 1
+ a + b" vs. "~ 1 + a + b + a:b".
So if we have another table like:
a F(...) = ..., p = ...
a:b F(...) = ..., p = ...
then "a:b" refers to the same thing, the ":" means the same thing as
always, but the test being performed on that line is different because
the *table* is different -- now we're testing "~ 1 + a" vs. "~ 1 + a +
a:b".
>> Many hypothesis tests are written as tests of the parameters, so of
>> course when written in that form they depend on the parametrization.
>> I'm not sure there's any other way to write them in general. What else
>> would you write them in terms of?
>
> I was thinking more in terms of using only a single default parameterization
> e.g. y ~ factorA + factorB + factorA*factorB
> and then construct the corresponding contrast matrices for all
> possible tests and corresponding coefficients, and give them
> identifiable (implementation independent) names or labels.
Uh, a linear hypothesis test is specified by an nxk matrix + an nx1
vector (where k is the number of parameters). Some of them are
redundant or contradictory, sure, but still "all possible tests" =
"infinitely many"...
-- Nathaniel