Statsmodels sprint: formula framework and pandas integration

43 views
Skip to first unread message

Wes McKinney

unread,
Jul 16, 2011, 4:00:28 PM7/16/11
to pystat...@googlegroups.com, jonatha...@stanford.edu, Fernando Perez
Skipper, Chris JS, and I have been sprinting on statsmodels topics
today and yesterday. A major initial goal I had was to prototype what
pandas integration with the results classes is going to look like. The
prototype is very simple and works nicely actually (see
pandas-integration branch on GitHub):

import scikits.statsmodels.api as sm
data = sm.datasets.longley.load()
df = DataFrame(data.exog, columns=data.exog_name)
y = data.endog
df['intercept'] = 1.
olsresult = sm.OLS(y, df).fit()

In [10]: olsresult.params
Out[10]:
GNPDEFL 15.0618722714
GNP -0.0358191792926
UNEMP -2.02022980382
ARMED -1.03322686717
POP -0.0511041056538
YEAR 1829.15146461
intercept -3482258.6346

In [11]: print olsresult.summary()
<snip>
==============================================================================
| coefficient std. error t-statistic prob. |
------------------------------------------------------------------------------
| GNPDEFL 15.06 84.91 0.1774 0.8631 |
| GNP -0.03582 0.03349 -1.0695 0.3127 |
| UNEMP -2.020 0.4884 -4.1364 0.0025 |
| ARMED -1.033 0.2143 -4.8220 0.0009 |
| POP -0.05110 0.2261 -0.2261 0.8262 |
| YEAR 1829. 455.5 4.0159 0.0030 |
| intercept -3.482e+06 8.904e+05 -3.9108 0.0036 |
==============================================================================

What's nice about the setup is that if you pass pandas DataFrame, you
get metadata attached to results + nice summary, etc. If you pass in
ndarrays (current API), you get ndarrays out (hence all the unit tests
pass). Completely transparent. A little metaprogramming trickery
inside but not too bad (code is very brief). I'll write a blog article
about it to go into a bit more detail sometime in the next few days.
pandas-ifying parts of statsmodels is really going to be a big deal in
terms of usability (e.g. most data munging tasks are far easier with
DataFrames, plus you have metadata).

OK, great, data structures. Now we need a formula framework. And soon.
I think the API should look like:

# formula_object you rolled yourself using whatever formula system you used
result = ols(formula_object, data=data)

or

# pass an R-like formula string which will be converted to a Formula
result = ols('y ~ a + b + c', data=data)

Josef, before you get worried: the *computational ndarray classes will
be completely untouched by this*. For those who know what they're
doing and don't mind munging their own design matrices, the
ndarray-based estimation classes will stay as is. But: we'll have this
formula-based interface to users who want it and likely make it the
advertised way of using statsmodels (but of course keep docs for the
computational bit).

Now, I fear the religious debate over the right formula framework and
interface (see epic threads from earlier this year), but we need to
decide on something and start building out a nicer user interface. I'm
extremely excited about this and plan to invest quite a bit of time in
making statsmodels very easy to use, especially for R users.

So we need:

1) A formula framework. It would be possible to support multiple
formula frameworks but it would be best to have a single, official
formula package / subpackage for statsmodels.
2) A parser to convert R-like formula strings into a Formula object
3) User interface functions (e.g. like the above) in statsmodels
accepting formula and data

I've taken a look at charlton and Jonathan's packages:

https://github.com/charlton/charlton
https://github.com/jonathan-taylor/formula

We're leaning toward JT's sympy-based approach but I think we need to
make a careful evaluation. Point 2) above (the parser) has to happen
regardless...Nathaniel's already done it in charlton to emulate R's
formula framework but it could certainly be adapted to Jonathan's
formula package. I think it's worth providing a parse_r_formula
function (with all the warts of the R formula) to lessen the cognitive
dissonance for R users coming to statsmodels.

Anyway, I'd like to avoid another epic discussion and focus on the
practical goal of building a slick, usable set of formula-based
fitting functions in statsmodels. After pandas integration + formula
integration we will be looking at a very nice and much more compelling
package for statistical users. I also would like to see more people
working on statsmodels to really start closing the functionality and
usability gap between Python and R.

best,
Wes (and Skipper and ChrisJS)

Skipper Seabold

unread,
Jul 16, 2011, 4:40:59 PM7/16/11
to pystat...@googlegroups.com, jonatha...@stanford.edu, Fernando Perez

FYI, we've forked formula to the statsmodels repo. Just a placeholder
while we figure out how this is going to work. I fixed a few things
with the docs build and pushed them to sourceforge, if you'd like to
have a look. I'm not aware that they were posted anywhere else. Are
there some docs missing? api/index.rst? pyplots/*?

http://statsmodels.sourceforge.net/formula/
https://github.com/statsmodels/formula

Matthew Brett

unread,
Jul 16, 2011, 6:41:47 PM7/16/11
to pystat...@googlegroups.com
Hi,

Hahaha - you thought you'd got away without a 100 message discussion!

No, sorry, seriously, I just wanted to check that you'd come across
the stuff where Jonathan had implemented an R-a-like implementation,
using his formula framework. That's the ANCOVA classes. I think it
might be in Jonathan's 'jonathan' branch in 'formula'. I believe he
has tests in there to show (against R) that he gets the same results
as R, and he put quite a bit of work into trying to uncover the
various arcane rules that R uses in going from the formula to the
terms in the eventual model.

He also implemented the type I and II and III sum of squares terms
that R and others use in doing the F tests on the various parts of the
AN(C)OVA model - this is important because R users may be used to
seeing these outputs.

Thanks for sprinting, it's really good to see this stuff moving ahead.
If you can think of something for us isolated cases to do to help,
let us know,

Best,

Matthew

Skipper Seabold

unread,
Jul 16, 2011, 6:49:46 PM7/16/11
to pystat...@googlegroups.com

Thanks, yeah, we are working with it now.

Matthew Brett

unread,
Jul 16, 2011, 6:55:19 PM7/16/11
to pystat...@googlegroups.com
Hi,

Excellent - thanks. Good luck,

Matthew

Nathaniel Smith

unread,
Jul 17, 2011, 3:35:46 PM7/17/11
to pystat...@googlegroups.com, jonatha...@stanford.edu, Fernando Perez
On Sat, Jul 16, 2011 at 1:00 PM, Wes McKinney <wesm...@gmail.com> wrote:
> We're leaning toward JT's sympy-based approach but I think we need to
> make a careful evaluation. Point 2) above (the parser) has to happen
> regardless...Nathaniel's already done it in charlton to emulate R's
> formula framework but it could certainly be adapted to Jonathan's
> formula package. I think it's worth providing a parse_r_formula
> function (with all the warts of the R formula) to lessen the cognitive
> dissonance for R users coming to statsmodels.

Parsing an R-like formula language isn't totally trivial, but... it's
close. It'd be a good weekly homework problem for an undergrad course.
I can't see how just dropping the parser on top of another formula
library would work, since 95% of the complexity in handling R-style
formulas comes after the parser.

AFAICT, at the high level, a formula library has the following interface:

A formula is basically a function that takes data and returns a design
matrix + metadata

In Charlton, this object is called a 'ModelSpec', which is a class
with a single method 'make_matrices', which takes a data dictionary
(or table, or whatever), and returns two 'ModelMatrix' objects (one
for the response, and one for the predictors). A ModelMatrix is just
an ndarray with an extra '.column_info' attribute, which gives names
to sets of columns, so if the user has a factor called 'a', then we
can say that it was coded into columns 1-3.

This interface, or some tweaked version of it, seems like it could
support any formula framework we want to use -- the only thing
statsmodels cares about is that there's some standard way to build
design matrices and to access the metadata, right?

(Though! One thing to watch out for is that you really do want the
metadata to be rich enough to support both a single name referring to
multiple columns, and multiple overlapping names, so that we know "a"
is columns 1-3, and simultaneously "a[level1]" is column 1,
"a[level2]" is column 2, etc. Is the stuff you guys are doing flexible
enough for this?)

The interesting part of a formula framework, and where people might
disagree, is in the interface that the user uses to set up this
'ModelSpec' (or whatever) object.

-------------

Here's how that works in Charlton:

1) The user gives a string like "y ~ a + b + a:b"
2) This is parsed into a 'ModelDesc' object, which is a high-level
abstract description of a model. (As opposed to a ModelSpec object,
which knows all the nitty-gritty details.)
A ModelDesc is a set of left-hand-side Term objects, and a set of
right-hand-side Term objects.
In the above formula, the terms on the right are the intercept
(implicit), "a", "b", and "a:b". A Term object is represented as a set
of Factors -- the intercept is an empty set, the "a" and "b" terms are
singleton sets {a}, {b}, and "a:b" gives a set {a, b}. And a Factor is
any object that implements a certain protocol -- it can tell you its
name, it defines equality comparisons, and it can take a data
dictionary and give you some column(s) of data. The parser uses a
factor implementation called EvalFactor, which is constructed with a
string and uses eval(), but there could be others.

So we could also write the above model in Python as
desc = ModelDesc([Term([EvalFactor("y")])], [Term([]),
Term([EvalFactor("a")]), Term([EvalFactor("b")]),
Term([EvalFactor("a"), EvalFactor("b")])

Except for the Factors, these objects are very very simple, they're
literally just containers. So e.g., this expression:
desc.rhs_terms[2].factors[0]
will give us EvalFactor("b").

The formula language has a nice formal semantics in terms of
operations on these sets -- "+" is the union between two
sets-of-Terms, ":" is the union between two Terms themselves, etc.

If you want to construct a ModelDesc programmatically, that's
obviously very easy. Since these are the entire interface between the
parser and the rest of the framework, you never have to actually use
the parser if you don't want to -- the goal is to have just as nice an
interface for both people and programs.

Also, if you're trying to fit a model that doesn't want to go via a
design matrix (I guess e.g. tree-based classifiers fall into this?),
then you can stop here and handle the ModelDesc using your own code,
without ever constructing a ModelSpec; that lets us use a standard
interface at the user level (formula strings and ModelDesc objects)
for such models, even though the implementation is different
underneath.

3) Finally, the ModelDesc object is combined with the data to produce
a ModelSpec object.

This is the tricky part.

Desired feature: If some Factor object returns categorical data (e.g.,
an array of strings), then we want to be able to handle that by
applying some coding scheme. (Of course, if you want to do your coding
by hand, then you can do that in your Factor object or whatever, which
will then return numerical rather than categorical data and this
doesn't apply.)

Desired feature: The user can apply transformations whose output
depends on all the data (e.g., centering, where you need to know the
mean for the whole data set). (R originally did not take this into
account; there are GIANT WARNINGS in all the old R books about it,
because it bit users on the butt. They eventually fixed it. Also, R
has an advantage here, since in R, each categorical factor is
annotated with the list of possible levels. In Python, we really need
to be able to handle categorical factors that are just coded as a list
of strings, which means that the *set of possible levels* is analogous
to the mean -- we need to calculate it by looking at all of the data
up front.)

Desired feature: The user can construct model matrices incrementally,
without ever loading the whole data set into memory. This makes
handling the above data-dependent transformations much trickier, since
you may need multiple passes through the data; one to calculate the
mean, and then one to transform the data. Or worse, maybe someone
wants to do spline(center(x), dof=5), and the spline() function
automatically picks knots at quantiles, so first we need a pass to
find the mean, then we need a second pass to find the quantiles after
subtracting off the mean, and then only on the third pass can we
actually build the design matrix. (This still isn't fixed in R. People
need it -- check out the 'biglm' package -- but it can bite you in
nasty ways -- again, check out the warnings in the biglm docs.)

Anyway, Charlton handles all of the above issues.

Here's how we interpret a ModelDesc. I'll try to be formal enough to
let you reconstruct the real formal definitions, but not too formal to
follow... maybe this is a dumb idea, but let's see. Conceptually, each
Factor represents the linear subspace you get by evaluating that
factor and taking the span of the resulting column(s). Categorical
factors represent the linear subspace you get from just dummy-coding
all the different levels; for n levels, this is an n dimensional
subspace. Numerical factors are taken as-is. A Term represents the
subspace you get by taking the span of the pairwise products of the
basis vectors for each of the Factors in that term (there's probably a
fancier mathematical name for this -- it's just the standard
interaction-by-pairwise-products thing). (Also, this is why the empty
Term represents the intercept; an empty product is 1.) A set of Terms
(e.g., the RHS side of a formula) represents the smallest linear
subspace that contains all of the subspaces for all of the Terms
(i.e., the one you get by taking the union of all the basis sets).

But, we try to pick a "nice" basis set for this space -- one where
individual columns of our design matrix (i.e., the basis vectors)
correspond in a meaningful way to individual terms. Since our terms
can be redundant (e.g., "a" defines a subspace of "a:b", and the
intercept term defines a subspace of every categorical term), we need
some cleverness to detect this redundancy and remove it by applying
contrast coding instead of dummy coding at appropriate places. R is
actually buggy in this regard -- if you set up weird enough model
formulas, you can trick it into producing an over-complete or
under-complete basis. The nasty cases are around numerical factors,
and how to handle cases of partial redundancy like "a + a:b" or "1 +
a:b" as opposed to "1 + a + b + a:b". In practice these buggy cases
are pretty rare, and R's source code even contains at least one
special-case hack to "fix up" one of the places where the underlying
algorithm would get it wrong.

So this is one of the annoying places where I couldn't just steal from
R; I had to make up a new algorithm that fixes all these problems.
AFAICT, charlton produces a correct coding for every possible
ModelDesc object. (AFAICT = "I have a proof, but I haven't looked at
it in a while and no-one else has checked it anyway".)

So, here's how the ModelDesc+data -> ModelSpec code works:
First, we go through all the different factors, and figure out how
many passes through the data they need to "memorize" it (i.e., compute
the mean, figure out what categorical levels exist, etc.). Then we
take that many passes through the data, and sort the factors into
numerical versus categorical based on their observed behavior. Next,
we apply my fancy algorithm to figure out what coding scheme should be
used for each term. And then we stash all this stuff inside the
ModelSpec object.

Oh, we make sure to compute nice human-readable names for each of the
final set of columns. That goes in the ModelSpec too.

----------------

AFAIK, the way it works in the sympy-formula approach is, you
basically just write the function that implements the ModelSpec by
hand, except instead of writing it as a Python function, you write it
as a sympy function. If you want a particular coding scheme, or
centering, etc., you write it by hand or call a library function? I
won't say too much, because I never knew the details haven't even
looked at the new ANCOVA classes or anything, but hopefully it's clear
how the R approach is a little more complicated than just having a
nice parser at the front-end :-).

And it's a good excuse to get some of these details out of my head and
into a more public form :-)

-- Nathaniel

Skipper Seabold

unread,
Jul 18, 2011, 9:20:48 AM7/18/11
to pystat...@googlegroups.com, jonatha...@stanford.edu, Fernando Perez
On Sun, Jul 17, 2011 at 3:35 PM, Nathaniel Smith <n...@pobox.com> wrote:
> On Sat, Jul 16, 2011 at 1:00 PM, Wes McKinney <wesm...@gmail.com> wrote:
>> We're leaning toward JT's sympy-based approach but I think we need to
>> make a careful evaluation. Point 2) above (the parser) has to happen
>> regardless...Nathaniel's already done it in charlton to emulate R's
>> formula framework but it could certainly be adapted to Jonathan's
>> formula package. I think it's worth providing a parse_r_formula
>> function (with all the warts of the R formula) to lessen the cognitive
>> dissonance for R users coming to statsmodels.
>
> Parsing an R-like formula language isn't totally trivial, but... it's
> close. It'd be a good weekly homework problem for an undergrad course.
> I can't see how just dropping the parser on top of another formula
> library would work, since 95% of the complexity in handling R-style
> formulas comes after the parser.
>

Why not? I'm actually asking. The way I see it, if another formula
framework is implemented, then it's just a matter of hooking the
formula into the other framework. Ie., get a formula string, tokenize,
call out to underlying formula framework / data structure.

> AFAICT, at the high level, a formula library has the following interface:
>
> A formula is basically a function that takes data and returns a design
> matrix + metadata
>
> In Charlton, this object is called a 'ModelSpec', which is a class
> with a single method 'make_matrices', which takes a data dictionary
> (or table, or whatever), and returns two 'ModelMatrix' objects (one
> for the response, and one for the predictors). A ModelMatrix is just
> an ndarray with an extra '.column_info' attribute, which gives names
> to sets of columns, so if the user has a factor called 'a', then we
> can say that it was coded into columns 1-3.
>
> This interface, or some tweaked version of it, seems like it could
> support any formula framework we want to use -- the only thing
> statsmodels cares about is that there's some standard way to build
> design matrices and to access the metadata, right?
>

To my mind yes, but I can be convinced otherwise. But we have started
integrating with the pandas data structures. So I think we are going
to want to go from formula -> pandas data structures -> ndarrays. Wes?
I also have some vague notion that a full blown symbolic math
implementation could come in handy down the road.

> (Though! One thing to watch out for is that you really do want the
> metadata to be rich enough to support both a single name referring to
> multiple columns, and multiple overlapping names, so that we know "a"
> is columns 1-3, and simultaneously "a[level1]" is column 1,
> "a[level2]" is column 2, etc. Is the stuff you guys are doing flexible
> enough for this?)

I think we can use Formula's Factor class + DataFrame for this. There
was also more talk this week at SciPy about your enumerated dtype.

>
> The interesting part of a formula framework, and where people might
> disagree, is in the interface that the user uses to set up this
> 'ModelSpec' (or whatever) object.
>

Okay, I really want to avoid this discussion going on at length in the
interest of finding some common ground so we can just do this already,
but I will make a few brief points.

Exactly!

Up to here sounds almost identical to Formula, though sub in Formula
for ModelDesc. The added bonus of Formula though is that people are
using sympy already. It's vetted, general, and transparent.

Couldn't this all be handled symbolically and possibly more efficiently?

That's always been clear I think, and I hope we're not oversimplifying
it. What's not clear yet is that your way is really all that different
from using sympy. What I'm hearing is that your code handles use
cases, which is great, but anything we end up with should handle use
cases. This was also the gist of the 'novel' thread after rereading it
last week. It was a lot of 'what would you do in this case' rather
than, we should avoid/use sympy/formula/charlton because of...which is
the real issue right now.

Given that we've already committed to having the pandas dependency, I
think what I would like to see is a bringing together of your work up
to the ModelDesc + pandas + Formula. Does this sound reasonable? (I
sincerely hope so) Wes does that sound right to you?

>
> And it's a good excuse to get some of these details out of my head and
> into a more public form :-)
>

Basically, here's where I am right now (and Wes might be further away
from this position in that it's already getting it done). This *needs*
to get done. The path of least resistance is the way that I'd be more
willing to entertain. Ie., well-documented, and tested code that
_someone is willing to own_ inside or outside of our code and support
base until it's clear that the kinks are worked out.

Skipper

Wes McKinney

unread,
Jul 18, 2011, 10:26:37 AM7/18/11
to pystat...@googlegroups.com
On Mon, Jul 18, 2011 at 9:20 AM, Skipper Seabold <jsse...@gmail.com> wrote:
> On Sun, Jul 17, 2011 at 3:35 PM, Nathaniel Smith <n...@pobox.com> wrote:
>> On Sat, Jul 16, 2011 at 1:00 PM, Wes McKinney <wesm...@gmail.com> wrote:
>>> We're leaning toward JT's sympy-based approach but I think we need to
>>> make a careful evaluation. Point 2) above (the parser) has to happen
>>> regardless...Nathaniel's already done it in charlton to emulate R's
>>> formula framework but it could certainly be adapted to Jonathan's
>>> formula package. I think it's worth providing a parse_r_formula
>>> function (with all the warts of the R formula) to lessen the cognitive
>>> dissonance for R users coming to statsmodels.
>>
>> Parsing an R-like formula language isn't totally trivial, but... it's
>> close. It'd be a good weekly homework problem for an undergrad course.
>> I can't see how just dropping the parser on top of another formula
>> library would work, since 95% of the complexity in handling R-style
>> formulas comes after the parser.
>>
>
> Why not? I'm actually asking. The way I see it, if another formula
> framework is implemented, then it's just a matter of hooking the
> formula into the other framework. Ie., get a formula string, tokenize,
> call out to underlying formula framework / data structure.

I worry about inextricably tying ourselves to the way that R's
formulas work, warts and all. But that's a discussion for another
time.

I'm also not sure how difficult this will be (though I fully intend to
head down this route). The basic issue is that some of the wonkiness
of the R formulas will need to be emulated. For example, if you have:

a ~ b * c

and b and c are factors, this is really

a ~ b + c + b:c

using Jonathan's ANCOVA class, once you've created Factor terms for b
and c (requires the data of course), you do:

ANCOVA(b, c, (1, (b, c)))

In [14]: a
Out[14]: Factor('a', ['one', 'two', 'three'])

In [15]: b
Out[15]: Factor('b', ['foo', 'bar'])

In [16]: ANCOVA(a, b, (1, (a, b))).formula
Out[16]: Formula([1, b_foo, a_three, a_two, a_three*b_foo, a_two*b_foo])

I think there is a generic mechanism for doing stuff like:

log(y) ~ a + b + exp(c) + sqrt(d)

but I haven't gotten that far yet.

>> AFAICT, at the high level, a formula library has the following interface:
>>
>> A formula is basically a function that takes data and returns a design
>> matrix + metadata
>>
>> In Charlton, this object is called a 'ModelSpec', which is a class
>> with a single method 'make_matrices', which takes a data dictionary
>> (or table, or whatever), and returns two 'ModelMatrix' objects (one
>> for the response, and one for the predictors). A ModelMatrix is just
>> an ndarray with an extra '.column_info' attribute, which gives names
>> to sets of columns, so if the user has a factor called 'a', then we
>> can say that it was coded into columns 1-3.
>>
>> This interface, or some tweaked version of it, seems like it could
>> support any formula framework we want to use -- the only thing
>> statsmodels cares about is that there's some standard way to build
>> design matrices and to access the metadata, right?
>>
>
> To my mind yes, but I can be convinced otherwise. But we have started
> integrating with the pandas data structures. So I think we are going
> to want to go from formula -> pandas data structures -> ndarrays. Wes?
> I also have some vague notion that a full blown symbolic math
> implementation could come in handy down the road.

Yes, things will need to be modified so when you do

formula + DataFrame -> DataFrame

right now jonathan-taylor/formula uses record arrays but very easy to
adopt to DataFrame. So right an example I've been playing with looks
like :

terms = fromrec(recs)
race = terms['race']
edu = terms['edu']
smoke = terms['smoke']
ancova = ANCOVA(race, edu, (1, (race, edu)))
formula = ancova.formula + smoke
design = DataFrame.from_records(formula.design(recs))

In [60]: design.info()
<class 'pandas.core.frame.DataFrame'>
Index: 314 entries, 0 to 313
Data columns:
1 314 non-null values
edu_ltCollege*race_EA 314 non-null values
edu_ltCollege*race_Other 314 non-null values
edu_ltHS*race_EA 314 non-null values
edu_ltHS*race_Other 314 non-null values
edu_ltCollege 314 non-null values
edu_ltHS 314 non-null values
race_EA 314 non-null values
race_Other 314 non-null values
smoke 314 non-null values
dtypes: float64(10)

of course if you pass a record array to a formula you should probably
still get out a record array

>> (Though! One thing to watch out for is that you really do want the
>> metadata to be rich enough to support both a single name referring to
>> multiple columns, and multiple overlapping names, so that we know "a"
>> is columns 1-3, and simultaneously "a[level1]" is column 1,
>> "a[level2]" is column 2, etc. Is the stuff you guys are doing flexible
>> enough for this?)
>
> I think we can use Formula's Factor class + DataFrame for this. There
> was also more talk this week at SciPy about your enumerated dtype.

so you're saying something like

y ~ x + a['foo'] ?

I wonder also about things like:

y ~ x + I(foo > 1)

but I is nothing more than a pass-through function in R so it just
enables the parser to treat the whole expression as an indicator.

Yes, my goals are a) pandas integration and b) formula integration
with pandas and nice user interface functions in statsmodels.
Basically a hybridization of charlton + sympy-based Formula, with
emphasis on capturing all the user cases while avoiding wonkiness,
would be ideal. Obviously significant effort will need to be invested
in testing so we have a catalog of all the use cases we expect to
support (and ones we don't) and results from a trusted source (R). I
am rather liking the sympy approach, strikes me as less ad hoc so far.
Need to dig in more though.

>>
>> And it's a good excuse to get some of these details out of my head and
>> into a more public form :-)
>>
>
> Basically, here's where I am right now (and Wes might be further away
> from this position in that it's already getting it done). This *needs*
> to get done. The path of least resistance is the way that I'd be more
> willing to entertain. Ie., well-documented, and tested code that
> _someone is willing to own_ inside or outside of our code and support
> base until it's clear that the kinks are worked out.

I'd be comfortable having some degree of code ownership but having
involvement of you, Jonathan, and other more qualified people
(meaning: more experience working with the features of R that I've
done less with) is pretty important I think.

> Skipper
>

Nathaniel Smith

unread,
Jul 18, 2011, 12:52:06 PM7/18/11
to pystat...@googlegroups.com, jonatha...@stanford.edu, Fernando Perez
On Mon, Jul 18, 2011 at 6:20 AM, Skipper Seabold <jsse...@gmail.com> wrote:
> On Sun, Jul 17, 2011 at 3:35 PM, Nathaniel Smith <n...@pobox.com> wrote:
>> Parsing an R-like formula language isn't totally trivial, but... it's
>> close. It'd be a good weekly homework problem for an undergrad course.
>> I can't see how just dropping the parser on top of another formula
>> library would work, since 95% of the complexity in handling R-style
>> formulas comes after the parser.
>>
>
> Why not? I'm actually asking. The way I see it, if another formula
> framework is implemented, then it's just a matter of hooking the
> formula into the other framework. Ie., get a formula string, tokenize,
> call out to underlying formula framework / data structure.

Well, apparently there are two totally different sympy Formula
frameworks out there :-). My impression from last time Jonathan and I
talked was, his *main* objection to an R-style formula library was
that it took this high-level description in terms of sets of
interactions, and then designed your coding system for you -- his
formula library didn't do that at all. So with that system, there'd be
nothing to call out *to*... literally the only overlap between the two
frameworks would be at the level I described, where you have some
opaque object that turns data into design matrices.

This new ANCOVA thing is perhaps more comparable; I don't know how it
works internally at all.

>> (Though! One thing to watch out for is that you really do want the
>> metadata to be rich enough to support both a single name referring to
>> multiple columns, and multiple overlapping names, so that we know "a"
>> is columns 1-3, and simultaneously "a[level1]" is column 1,
>> "a[level2]" is column 2, etc. Is the stuff you guys are doing flexible
>> enough for this?)
>
> I think we can use Formula's Factor class + DataFrame for this. There
> was also more talk this week at SciPy about your enumerated dtype.

I think I used a confusing example, sorry. "a[level2]" there is
supposed to be arbitrary string naming a column, not a lookup
operation.

Here's a really nasty example to hopefully explain better. Say "a" is
a categorical factor with levels "foo", "bar", "quux", and "x" is
numeric. The formula "~ a + spline(x, 3) + a:spline(x, 3)" then gives
us a design matrix with 12 columns:
column 1 is the intercept
column 2 is treatment coding for a == "bar"
column 3 is treatment coding for a == "quux"
columns 4-6 are spline transformed versions of x
columns 7-9 are spline transformed versions of x, for a == "bar"
columns 10-12 are spline transformed versions of x, for a == "quux"

Here are some things we need to know:
-- a user-friendly name for each column. E.g., right now, charlton
generates names like:
column 2: "a[T.bar]"
column 6: "spline(x, 3)[2]"
column 10: "a[T.quux]:spline(x, 3)[0]"
(The T means treatment coding; I stole this particular naming
scheme from John Fox's 'car' package.)
-- which *sets* of columns correspond to which terms. Right now,
charlton *also* generates names like:
columns 2-3: "a"
columns 4-6: "spline(x, 3)"
columns 7-12: "a:spline(x, 3)"

This second set of information is important; you can't do an ANOVA
without it. (We might want even more metadata than this, but we need
this at a minimum.)

So that's my concern -- AFAIK DataFrame can't represent columns that
have multiple names simultaneously?

>> The interesting part of a formula framework, and where people might
>> disagree, is in the interface that the user uses to set up this
>> 'ModelSpec' (or whatever) object.
>>
>
> Okay, I really want to avoid this discussion going on at length in the
> interest of finding some common ground so we can just do this already,
> but I will make a few brief points.

Yes, me too.

>> Also, if you're trying to fit a model that doesn't want to go via a
>> design matrix (I guess e.g. tree-based classifiers fall into this?),
>> then you can stop here and handle the ModelDesc using your own code,
>> without ever constructing a ModelSpec; that lets us use a standard
>> interface at the user level (formula strings and ModelDesc objects)
>> for such models, even though the implementation is different
>> underneath.
>>
>
> Exactly!
>
> Up to here sounds almost identical to Formula, though sub in Formula
> for ModelDesc. The added bonus of Formula though is that people are
> using sympy already. It's vetted, general, and transparent.

That's... not my understanding of Formula at all. How do you read out
which factors are in some interaction, from a Formula object? Maybe
you're thinking of the ANCOVA-style Formulas?

>> Anyway, Charlton handles all of the above issues.
>
> Couldn't this all be handled symbolically and possibly more efficiently?

Well, it is handled symbolically, in the sense that charlton is just
working with Factor and Term objects. But that has nothing to do with
the kind of symbolic algebra problems that sympy solves; these are
digital computers, after all, most code is pretty symbolic :-).

I honestly have no idea how sympy could help with any of these issues
at all. This isn't a bash on sympy, sympy seems awesome for what it
does, I just... already wrote the code in the best way I could think
of, and I didn't see any places where sympy would be relevant. Could
you explain what you mean? I especially don't see how using sympy to
perform calculations could be more efficient than using native Python
to perform those same calculations...?

>> AFAIK, the way it works in the sympy-formula approach is, you
>> basically just write the function that implements the ModelSpec by
>> hand, except instead of writing it as a Python function, you write it
>> as a sympy function. If you want a particular coding scheme, or
>> centering, etc., you write it by hand or call a library function? I
>> won't say too much, because I never knew the details haven't even
>> looked at the new ANCOVA classes or anything, but hopefully it's clear
>> how the R approach is a little more complicated than just having a
>> nice parser at the front-end :-).
>
> That's always been clear I think, and I hope we're not oversimplifying
> it. What's not clear yet is that your way is really all that different
> from using sympy. What I'm hearing is that your code handles use
> cases, which is great, but anything we end up with should handle use
> cases. This was also the gist of the 'novel' thread after rereading it
> last week. It was a lot of 'what would you do in this case' rather
> than, we should avoid/use sympy/formula/charlton because of...which is
> the real issue right now.

As far as I understand it, the rationale for using sympy was that
it gave a terse syntax for writing your data->design function, i.e.,
you could write
x, z = sympy.vars("x", "z")
1 + x + sympy.log(z) + x**2
instead of
def my_formula(d):
return np.column_stack((1, d["x"], np.log(d["z"]), d["x"] ** 2))

However, this has inherent limitations on expressivity (you can only
use 'sympy-compatible' transformations like sympy.log), it's much
lower-level than R style formulas (no automatic coding, etc.), and
it's still more clumsy to write than a string that you hand to a
parser (you have to set up those vars). So if we're going to have a
parser, then that solves all of the problems that we were trying to
solve with sympy, with more expressivity and more convenience.

I'm not trying to re-start that argument -- I'm sure there are reasons
you guys think sympy is a better solution. I just don't know what
those reasons are, so I can't really comment on them.

Charlton does work and solve all of the problems I described in my
original email, right now. It needs docs, it needs some tests for the
last big hairy ModelSpec creation function (but overall test coverage
is 86%, will be more like 95ish one those tests are written), and it
needs some general poking at and tweaking to file off rough edges. But
it is a complete framework for formulas.

> Given that we've already committed to having the pandas dependency, I
> think what I would like to see is a bringing together of your work up
> to the ModelDesc + pandas + Formula. Does this sound reasonable? (I
> sincerely hope so) Wes does that sound right to you?

You're certainly welcome to use any of my work you like. I just don't
understand this hybrid framework you have in mind well enough to
comment on its reasonableness.

>> And it's a good excuse to get some of these details out of my head and
>> into a more public form :-)
>
> Basically, here's where I am right now (and Wes might be further away
> from this position in that it's already getting it done). This *needs*
> to get done. The path of least resistance is the way that I'd be more
> willing to entertain. Ie., well-documented, and tested code that
> _someone is willing to own_ inside or outside of our code and support
> base until it's clear that the kinks are worked out.

Makes sense. I finished writing my thesis a few days ago, so now I'm a
little less pressed for time... but even so, I'm going to be in Boston
for a week starting tomorrow for a conference, and then probably
moving to Germany in a few months, so... I probably shouldn't make any
guarantees about how much support I can give. If I do get my talk done
today, though, then I can try writing some docs on the plane tomorrow
:-)

-- Nathaniel

Christopher Jordan-Squire

unread,
Jul 18, 2011, 2:36:24 PM7/18/11
to pystat...@googlegroups.com

I think you can get away with ANCOVA(b,c,(b,c)).
 

Yay! Testing!

To make sure I understand, with this hybridization we'd make it so the user didn't have to type in all these terms('x') and terms('y') sorts of things, right? Just under in a string a la 'y~x+I(x^2)' and things in the background take care of the rest?
 

>>
>> And it's a good excuse to get some of these details out of my head and
>> into a more public form :-)
>>
>
> Basically, here's where I am right now (and Wes might be further away
> from this position in that it's already getting it done). This *needs*
> to get done. The path of least resistance is the way that I'd be more
> willing to entertain. Ie., well-documented, and tested code that
> _someone is willing to own_ inside or outside of our code and support
> base until it's clear that the kinks are worked out.

I'd be comfortable having some degree of code ownership but having
involvement of you, Jonathan, and other more qualified people
(meaning: more experience working with the features of R that I've
done less with) is pretty important I think.

Are you thinking of any specific features of R?

Overall, I agree with Wes and Skipper. Getting pandas integration with statsmodels and getting some sort of formula framework both into statsmodels is really necessary. I'm not as sold on the sympy stuff as they are, but it looks solid and does leave the door open for better model fitting algorithms based on optimization down the road. Also better ways at getting at stuff like information matrices (or so Skipper has assured me).

-Chris JS


 
> Skipper
>

Wes McKinney

unread,
Jul 18, 2011, 2:40:28 PM7/18/11
to pystat...@googlegroups.com

I'm speaking about hybridization of the internals. The user will have
the option of doing the full blown symbolic business or using the
string formula parser.

>>
>> >>
>> >> And it's a good excuse to get some of these details out of my head and
>> >> into a more public form :-)
>> >>
>> >
>> > Basically, here's where I am right now (and Wes might be further away
>> > from this position in that it's already getting it done). This *needs*
>> > to get done. The path of least resistance is the way that I'd be more
>> > willing to entertain. Ie., well-documented, and tested code that
>> > _someone is willing to own_ inside or outside of our code and support
>> > base until it's clear that the kinks are worked out.
>>
>> I'd be comfortable having some degree of code ownership but having
>> involvement of you, Jonathan, and other more qualified people
>> (meaning: more experience working with the features of R that I've
>> done less with) is pretty important I think.
>>
> Are you thinking of any specific features of R?
>
> Overall, I agree with Wes and Skipper. Getting pandas integration with
> statsmodels and getting some sort of formula framework both into statsmodels
> is really necessary. I'm not as sold on the sympy stuff as they are, but it
> looks solid and does leave the door open for better model fitting algorithms
> based on optimization down the road. Also better ways at getting at stuff
> like information matrices (or so Skipper has assured me).

The symbolic framework will make things like R's step method for model
selection very easy to implement since you can just do:

new_formula = formula - term_to_omit

> -Chris JS
>
>
>
>>
>> > Skipper
>> >
>
>

Christopher Jordan-Squire

unread,
Jul 18, 2011, 3:04:07 PM7/18/11
to pystat...@googlegroups.com

What do you mean by spline(x,3)? That exact command throws errors in R.

 

Hmmmm.....that's a good point. How do we handle it if someone wants to use a function that isn't covered by anything in sympy? Will that affect anything? There are a number of functions used in statistics not so used anywhere else. (Like logit, as mentioned in the previous epic thread.) I looked at the sympy codebase, and it appears there isn't a simple command to take a function and make it into a sympy function. (Not that there could be, anyways, since that would make it impossible to know derivatives/anti-derivatives of the function.)

Does someone more familiar with the formula codebase have any idea how much of a problem that will be, if at all?

 

Wes McKinney

unread,
Jul 18, 2011, 3:15:52 PM7/18/11
to pystat...@googlegroups.com

I haven't resolved how taylor/formula deals with this yet but it seems
easy (in general) to support any function available in the enclosing
namespace (or in numpy.* for example)-- so if it can't do it now I'll
make sure that it can. There will be a few other special functions
like "factor" I guess.

Nathaniel Smith

unread,
Jul 18, 2011, 3:17:46 PM7/18/11
to pystat...@googlegroups.com
On Mon, Jul 18, 2011 at 11:36 AM, Christopher Jordan-Squire
<chris.jor...@gmail.com> wrote:
[snip]

> I'm not as sold on the sympy stuff as they are, but it
> looks solid and does leave the door open for better model fitting algorithms
> based on optimization down the road.

I think this is based on a misconception. In an R-style formula, "+"
does not mean addition, it means a kind of set union. Having better
tools to compute derivatives in general would be awesome, and seems
like a great use of sympy to me. But this is totally orthogonal to
R-style model specification. I mean, what's the derivative of a set
union? R-style formulas just aren't useful for describing arbitrary
non-linear models.

One could have a different sort of formula framework that used sympy
to describe arbitrary non-linear models and then differentiated them,
and such a thing might well be useful internally to statsmodels in
addition to an R-style formula system... but they'd be totally
separate pieces of code, AFAICT.

(R does have some symbolic differentiation code -- "D()", "deriv()" --
that takes arbitrary R code and tries to compute its derivative. And
one of the input formats they accept is an R formula object. But this
is just abusing R's "~" operator as a convenient way to quote
arbitrary R code and then get at its syntax tree -- the equivalent in
Python would be to use the 'ast' module to parse some Python code and
then try to differentiate it. The 'nls' function does a similar thing
-- it interprets formulas using different rules than every other R
model, with the basic operators like "+" and "*" redefined to have
totally different meanings. These usages are basically unrelated to
the use of formulas for specifying linear models, and are pretty
confusing IMO, not something I'd want to copy.)

-- Nathaniel

Nathaniel Smith

unread,
Jul 18, 2011, 3:23:57 PM7/18/11
to pystat...@googlegroups.com
On Mon, Jul 18, 2011 at 12:04 PM, Christopher Jordan-Squire
<chris.jor...@gmail.com> wrote:
> On Mon, Jul 18, 2011 at 11:52 AM, Nathaniel Smith <n...@pobox.com> wrote:
>> Here's a really nasty example to hopefully explain better. Say "a" is
>> a categorical factor with levels "foo", "bar", "quux", and "x" is
>> numeric. The formula "~ a + spline(x, 3) + a:spline(x, 3)" then gives
>> us a design matrix with 12 columns:
>>  column 1 is the intercept
>>  column 2 is treatment coding for a == "bar"
>>  column 3 is treatment coding for a == "quux"
>>  columns 4-6 are spline transformed versions of x
>>  columns 7-9 are spline transformed versions of x, for a == "bar"
>>  columns 10-12 are spline transformed versions of x, for a == "quux"
>>
>
> What do you mean by spline(x,3)? That exact command throws errors in R.

In R I mean:
library(splines)
lm(y ~ a + bs(x, 3) + a:bs(x, 3))

I just thought people would have an easier time following if I said
'spline' instead of the rather cryptic 'bs' :-)

Maybe I should have used poly() as my example. I just like splines
better than polynomials...

-- Nathaniel

Christopher Jordan-Squire

unread,
Jul 18, 2011, 3:39:12 PM7/18/11
to pystat...@googlegroups.com

You're right, it's not something that makes sense when thinking about R's syntax. I meant internally, after the string is parsed. Not something that's integral to the formula stuff being discussed now, just something that could be dealt with later in a slightly easier manner.

 
-- Nathaniel

Nathaniel Smith

unread,
Jul 18, 2011, 4:27:24 PM7/18/11
to pystat...@googlegroups.com
On Mon, Jul 18, 2011 at 12:39 PM, Christopher Jordan-Squire
<chris.jor...@gmail.com> wrote:
> You're right, it's not something that makes sense when thinking about R's
> syntax. I meant internally, after the string is parsed. Not something that's
> integral to the formula stuff being discussed now, just something that could
> be dealt with later in a slightly easier manner.

Still not sure what you mean -- internally, after it's parsed, it's...
a bunch of sets. Then those sets are converted into a matrix according
to some complicated and domain-specific rules. There's no symbolic
algebra involved at any stage...

-- Nathaniel

Skipper Seabold

unread,
Jul 18, 2011, 5:08:02 PM7/18/11
to pystat...@googlegroups.com, jonatha...@stanford.edu, Fernando Perez
On Mon, Jul 18, 2011 at 12:52 PM, Nathaniel Smith <n...@pobox.com> wrote:
> On Mon, Jul 18, 2011 at 6:20 AM, Skipper Seabold <jsse...@gmail.com> wrote:
>> On Sun, Jul 17, 2011 at 3:35 PM, Nathaniel Smith <n...@pobox.com> wrote:
>>> Parsing an R-like formula language isn't totally trivial, but... it's
>>> close. It'd be a good weekly homework problem for an undergrad course.
>>> I can't see how just dropping the parser on top of another formula
>>> library would work, since 95% of the complexity in handling R-style
>>> formulas comes after the parser.
>>>
>>
>> Why not? I'm actually asking. The way I see it, if another formula
>> framework is implemented, then it's just a matter of hooking the
>> formula into the other framework. Ie., get a formula string, tokenize,
>> call out to underlying formula framework / data structure.
>
> Well, apparently there are two totally different sympy Formula
> frameworks out there :-). My impression from last time Jonathan and I
> talked was, his *main* objection to an R-style formula library was
> that it took this high-level description in terms of sets of
> interactions, and then designed your coding system for you -- his
> formula library didn't do that at all. So with that system, there'd be
> nothing to call out *to*... literally the only overlap between the two
> frameworks would be at the level I described, where you have some
> opaque object that turns data into design matrices.
>

No I think you're right, but that's intended for internal use (or the
power user, though I don't know why you'd ever really want to do
this). What we're proposing is making it so that the user has can
provide a string, which is parsed into the opaque object that turns
data into a design matrix.

> This new ANCOVA thing is perhaps more comparable; I don't know how it
> works internally at all.

Yes, I think this is right, though it's still not what I'd want I
don't think with the paired tuples of term objects. Still too much
work for interactive use.

Ah, ok. Does Wes' example answer this? Each interaction is a column
based on the factor levels IIUC.

>>> The interesting part of a formula framework, and where people might
>>> disagree, is in the interface that the user uses to set up this
>>> 'ModelSpec' (or whatever) object.
>>>
>>
>> Okay, I really want to avoid this discussion going on at length in the
>> interest of finding some common ground so we can just do this already,
>> but I will make a few brief points.
>
> Yes, me too.
>
>>> Also, if you're trying to fit a model that doesn't want to go via a
>>> design matrix (I guess e.g. tree-based classifiers fall into this?),
>>> then you can stop here and handle the ModelDesc using your own code,
>>> without ever constructing a ModelSpec; that lets us use a standard
>>> interface at the user level (formula strings and ModelDesc objects)
>>> for such models, even though the implementation is different
>>> underneath.
>>>
>>
>> Exactly!
>>
>> Up to here sounds almost identical to Formula, though sub in Formula
>> for ModelDesc. The added bonus of Formula though is that people are
>> using sympy already. It's vetted, general, and transparent.
>
> That's... not my understanding of Formula at all. How do you read out
> which factors are in some interaction, from a Formula object? Maybe
> you're thinking of the ANCOVA-style Formulas?

Yeah, I guess I am conflating the two because in the end I'd like them
to be conflated as much as possible. *Goes back to looking at your
code

>
>>> Anyway, Charlton handles all of the above issues.
>>
>> Couldn't this all be handled symbolically and possibly more efficiently?
>
> Well, it is handled symbolically, in the sense that charlton is just
> working with Factor and Term objects. But that has nothing to do with
> the kind of symbolic algebra problems that sympy solves; these are
> digital computers, after all, most code is pretty symbolic :-).
>
> I honestly have no idea how sympy could help with any of these issues
> at all. This isn't a bash on sympy, sympy seems awesome for what it
> does, I just... already wrote the code in the best way I could think
> of, and I didn't see any places where sympy would be relevant. Could
> you explain what you mean? I especially don't see how using sympy to
> perform calculations could be more efficient than using native Python
> to perform those same calculations...?
>

I don't have a good idea about this honsetly, and I'm more or less
playing devil's advocate though I'm glad to finally start talking
about merits of implementations. I was thinking mainly something like
computing marginal effects, symbolic derivatives, arbitrary
transformations using symbolic manipulations, and multiprecision
manipulations (though to what extent I don't know). Somewhat
orthogonal to the formula framework, but I was thinking to use the
same tool in both instances.

I agree here. I'm only thinking about under the hood after a parser,
which might be overkill. To what extent are sympy transformations
limiting though? We can define arbitrary functions with sympy.Function
IIUC.

> I'm not trying to re-start that argument -- I'm sure there are reasons
> you guys think sympy is a better solution. I just don't know what
> those reasons are, so I can't really comment on them.
>

I would be interested to hear some merits over your approach. I leaned
towards documentation on the one hand and an intuition that we could
leverage symbolic manipulations elsewhere in the code.

> Charlton does work and solve all of the problems I described in my
> original email, right now. It needs docs, it needs some tests for the
> last big hairy ModelSpec creation function (but overall test coverage
> is 86%, will be more like 95ish one those tests are written), and it
> needs some general poking at and tweaking to file off rough edges. But
> it is a complete framework for formulas.
>

This is good to know. Was going to ask where you thought it stood.

>> Given that we've already committed to having the pandas dependency, I
>> think what I would like to see is a bringing together of your work up
>> to the ModelDesc + pandas + Formula. Does this sound reasonable? (I
>> sincerely hope so) Wes does that sound right to you?
>
> You're certainly welcome to use any of my work you like. I just don't
> understand this hybrid framework you have in mind well enough to
> comment on its reasonableness.
>
>>> And it's a good excuse to get some of these details out of my head and
>>> into a more public form :-)
>>
>> Basically, here's where I am right now (and Wes might be further away
>> from this position in that it's already getting it done). This *needs*
>> to get done. The path of least resistance is the way that I'd be more
>> willing to entertain. Ie., well-documented, and tested code that
>> _someone is willing to own_ inside or outside of our code and support
>> base until it's clear that the kinks are worked out.
>
> Makes sense. I finished writing my thesis a few days ago, so now I'm a
> little less pressed for time... but even so, I'm going to be in Boston
> for a week starting tomorrow for a conference, and then probably
> moving to Germany in a few months, so... I probably shouldn't make any
> guarantees about how much support I can give. If I do get my talk done
> today, though, then I can try writing some docs on the plane tomorrow
> :-)
>

Please do, and keep us updated if you find some time and on your
plans. Examples are very appreciated from my end.

Thanks,

Skipper

Skipper Seabold

unread,
Jul 18, 2011, 5:08:45 PM7/18/11
to pystat...@googlegroups.com

Let's say conjectured instead of assured ;)

> -Chris JS
>
>
>
>>
>> > Skipper
>> >
>
>

Skipper Seabold

unread,
Jul 18, 2011, 5:13:48 PM7/18/11
to pystat...@googlegroups.com

All fair points. I'm starting to come to some clarity of understanding
about your position.

Skipper

josef...@gmail.com

unread,
Jul 19, 2011, 8:37:16 AM7/19/11
to pystat...@googlegroups.com
I'm trying to catch up a bit after 4 days of internet outage.

I'm not quite sure where the positions are in this thread, but I would
like to emphasize two points that I think Nathaniel made earlier.

going formula -> pandas -> models will not work (or will do only
part of the job). If we have formulas, we need internally the meta
information how the design matrix was build, what the columns mean and
which columns belong together. Otherwise, we cannot use it for
statistical tests, building or transforming contrast matrices and
calculating marginal effects and similar.

executing numpy code from sympy formulas gives too little control, and
hides too much. At least that was my impression until a year ago when
I looked into this the last time.

Personally, my opinion about formulas has gone down quite a bit,
especially since I saw all the ambiguities in the examples of "which
interaction effect" in AN(C)OVA. My preference is strictly, "explicit
is better than implicit", and I don't see much benefit in saving to
type 20 or 50 characters. (In terms of reproducability writing a
script with a few more lines is safer than typing shortcuts in an
interactive session.)
This means I care about getting the functions to create design
matrices, generate meta-information, generate contrast matrices and so
on, but I'm happy to leave any formulas to those that are familiar
with them.

Josef

Wes McKinney

unread,
Jul 19, 2011, 9:46:03 AM7/19/11
to pystat...@googlegroups.com
On Tue, Jul 19, 2011 at 8:37 AM, <josef...@gmail.com> wrote:
> I'm trying to catch up a bit after 4 days of internet outage.
>
> I'm not quite sure where the positions are in this thread, but I would
> like to emphasize two points that I think Nathaniel made earlier.
>
> going   formula -> pandas -> models   will not work (or will do only
> part of the job). If we have formulas, we need internally the meta
> information how the design matrix was build, what the columns mean and
> which columns belong together. Otherwise, we cannot use it for
> statistical tests, building or transforming contrast matrices and
> calculating marginal effects and similar.

Yes but this is not an all or nothing this! Let's get part of the way
there (allow one to construct design matrices via formula) then step
by step integrate formula with other areas which will benefit (like
you mention).

> executing numpy code from sympy formulas gives too little control, and
> hides too much. At least that was my impression until a year ago when
> I looked into this the last time.
>
> Personally, my opinion about formulas has gone down quite a bit,
> especially since I saw all the ambiguities in the examples of "which
> interaction effect" in AN(C)OVA. My preference is strictly, "explicit
> is better than implicit", and I don't see much benefit in saving to
> type 20 or 50 characters. (In terms of reproducability writing a
> script with a few more lines is safer than typing shortcuts in an
> interactive session.)
> This means I care about getting the functions to create design
> matrices, generate meta-information, generate contrast matrices and so
> on, but I'm happy to leave any formulas to those that are familiar
> with them.

Well, I don't think you represent the plurality of users =P The thus
far do-it-yourself approach with statsmodels is not working for a lot
of people. I myself would love to have used Python to do some of my
statistics coursework but had to use R because the usability gap to do
fairly simple things (e.g. ANOVAs) was so extraordinary. For example,
here are some notes from Jonathan Taylor's applied stats course
(taught in R):

http://www.stanford.edu/class/stats191/interactions.html

It would be a useful exercise to work all of these examples
(regressions with interactions, anovas, etc.) in statsmodels--
currently quite difficult but with formula shouldn't be too bad.

> Josef
>

Christopher Jordan-Squire

unread,
Jul 19, 2011, 10:38:41 AM7/19/11
to pystat...@googlegroups.com
+1
 

> Josef
>





Matthew Brett

unread,
Jul 19, 2011, 1:35:38 PM7/19/11
to pystat...@googlegroups.com
Hi,

Of course, logit has to exist somewhere; it could be a little helper
function in the Formula namespace.

Arbitrary functions work with (Sympy 0.7.0):

from sympy.utilities.lambdify import implemented_function
f = implemented_function('f', lambda x : x+1)

See y'all,

Matthew

Christopher Jordan-Squire

unread,
Jul 19, 2011, 1:44:23 PM7/19/11
to pystat...@googlegroups.com

Ah. The lambify funciton was what I was looking for. sympify wasn't doing what I wanted. Thanks.

-Chris JS

 
See y'all,

Matthew

Wes McKinney

unread,
Jul 19, 2011, 1:44:29 PM7/19/11
to pystat...@googlegroups.com

It seems to me it might not be a bad idea to support both varieties of
formula frameworks. Charlton is, as far as I can tell, an R-formula
emulator. Using charlton to implement R formulas in statsmodels is
very low-hanging fruit (sans testing) from what I can tell. What would
be nice to write is a charlton-to-sympy-based-formula conversion
function. Of course I know the zen of python "there should be one, and
preferably only one, obvious way to do it" but here I think we're
providing tools that meet different needs / kinds of users.

As with most other things, I'm going to hack on it when I can find a
solid block of time and see what I come up with.

-W

Matthew Brett

unread,
Jul 20, 2011, 7:48:48 AM7/20/11
to pystat...@googlegroups.com
Hi,

On Tue, Jul 19, 2011 at 6:44 PM, Wes McKinney <wesm...@gmail.com> wrote:
> It seems to me it might not be a bad idea to support both varieties of
> formula frameworks. Charlton is, as far as I can tell, an R-formula
> emulator. Using charlton to implement R formulas in statsmodels is
> very low-hanging fruit (sans testing) from what I can tell. What would
> be nice to write is a charlton-to-sympy-based-formula conversion
> function. Of course I know the zen of python "there should be one, and
> preferably only one, obvious way to do it" but here I think we're
> providing tools that meet different needs / kinds of users.
>
> As with most other things, I'm going to hack on it when I can find a
> solid block of time and see what I come up with.

Oh dear oh dear, I had really wanted not to rejoin this discussion in earnest.

I believe that Jonathan is away somewhere without internet, so he can't join.

I want to make two main points.

1) I believe the best way will be to draw on the expertise and coding
time of all the _people_ involved here. In this case I mean
specifically Nathaniel and Jonathan. I believe that statsmodels will
be far better if both can feed into its design. To make that happen,
will involve hard work, and long threads, as you see. I believe
strongly, based on past experience working with Jonathan in
particular, that this will work will be repaid many times over in the
medium and long term. I would be very surprised if there is a
short-cut here (no formulas, just replicate R, etc) that will have
anything like the same benefit.

2) For me, the discussion continues to be hard to follow, in terms of
the main issues. I propose that, in the interests of clarity, we try
and define the issues to do with formulas on the statsmodels github
wiki (it does not appear to be enabled at the moment) or some other
place to do collaborative editing. Do y'all agree?

See you,

Matthew

Skipper Seabold

unread,
Jul 20, 2011, 8:54:46 AM7/20/11
to pystat...@googlegroups.com

Sure, if this would be helpful to hash out the issues and move us
forward. Wiki should be enabled on gh now.

Skipper

Reply all
Reply to author
Forward
0 new messages