[pystatsmodels] The Return of Formula (?)

36 views
Skip to first unread message

josef...@gmail.com

unread,
May 16, 2010, 7:42:20 AM5/16/10
to pystat...@googlegroups.com
After Bruce's comment that we should look at formulas for categorical
variables, I reenabled it, and will merge the changes to trunk today.

here is my *current* opinion and information on it, one example is below

- it looks useful when working with categorical data
(I don't think it's very useful for non-categorical data, as we
where working on in the last year.)
- it's very difficult to understand because of all the indirections
a class produces a class which returns another class, and how do I
get the data in and out? ...
- the only usage examples, I have seen are in the test file
- in order to use it, we need to create examples and documentation for it
- I'm not sure everything is correct, in spite of passing the tests,
or maybe I don't get expected results because I don't know how to use it
- because of all the indirection, usage inside statistical models makes them
also difficult to understand
- it doesn't do the data handling, which has been more the focus of Skipper and
Vincent, the "namespace" in formulas need to be filled with actual data


So, I think it will be useful to "resurrect" formulas for a limited
usage, or take
the parts we understand.
But it will be quite a bit of work to figure out how to use it, below is an
example for Factor, I haven't figured out yet how to get results that I
would expect for contrast matrices.

For now, I don't have the time to struggle with it for several days, but maybe
during summer, if I don't think distributions, time series analysis and lot's
of other things are more fun.

Josef


>>> from scikits.statsmodels.sandbox import formula
>>> #import scikits.statsmodels.sandbox.contrast_old as contrast

define a categorical variable - factor

>>> f = ['a']*4 + ['b']*3 + ['c']*4
>>> fac = formula.Factor('ff', f)
>>> fac.namespace = {'ff':f}
>>> fac.values()
array(['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c'],
dtype='|S1')
>>> [f for f in dir(fac) if f[0] != '_']
['func', 'get_columns', 'keys', 'main_effect', 'name', 'names',
'namespace', 'ordinal', 'termname', 'values', 'verify']


create dummy variable

>>> fac.get_columns().shape
(3, 11)
>>> fac.get_columns().T
array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 1., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.],
[ 0., 0., 1.],
[ 0., 0., 1.],
[ 0., 0., 1.]])



this is a way of encoding effects from a categorical variable
different from using dummy variables
I never seen a reference for this. Does anyone know one?


>>> fac.main_effect(reference=1)
<scikits.statsmodels.sandbox.formula.Quantitative object at 0x03E22C70>
>>> fac.main_effect(reference=1)()
array([[ 1., 1., 1., 1., -1., -1., -1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., -1., -1., -1., 1., 1., 1., 1.]])
>>> fac.main_effect(reference=1).names()
['(ff==a)-(ff==b)', '(ff==c)-(ff==b)']
>>> fac.main_effect(reference=2).names()
['(ff==a)-(ff==c)', '(ff==b)-(ff==c)']
>>> fac.main_effect(reference=2)().shape
(2, 11)


columns for the design matrix

>>> fac.main_effect(reference=2)().T
array([[ 1., 0.],
[ 1., 0.],
[ 1., 0.],
[ 1., 0.],
[ 0., 1.],
[ 0., 1.],
[ 0., 1.],
[-1., -1.],
[-1., -1.],
[-1., -1.],
[-1., -1.]])
>>> fac.names()
['(ff==a)', '(ff==b)', '(ff==c)']

Vincent Davis

unread,
May 16, 2010, 10:39:31 AM5/16/10
to pystat...@googlegroups.com
On Sun, May 16, 2010 at 5:42 AM, <josef...@gmail.com> wrote:
- it's very difficult to understand because of all the indirections
 a class produces a class which returns another class, and how do I
get the data in and out?  ...

Glad to here it's not just me :)

Should we have a end goal/plan/outline for categorical and dummy variable handling? For me the doing my own was as much an exercise in understanding the problem as having a solution. I am not stuck on what I did. I think we would benefit from a clear description example uses and an outline how whether we want a class or function or both type of solution.
I guess I am suggestion we decide what type of wheel we want and what it should fit. Maybe the answer is having several option and just having more examples of there use. But someone other than me should make this call.

Vincent









After Bruce's comment that we should look at formulas for categorical
variables, I reenabled it, and will merge the changes to trunk today.

here is my *current* opinion and information on it, one example is below

- it looks useful when working with categorical data
 (I don't think it's very useful for non-categorical data, as we
where working on in the last year.)

 
- the only usage examples, I have seen are in the test file

josef...@gmail.com

unread,
May 16, 2010, 4:46:03 PM5/16/10
to pystat...@googlegroups.com
On Sun, May 16, 2010 at 10:39 AM, Vincent Davis <vin...@vincentdavis.net> wrote:
On Sun, May 16, 2010 at 5:42 AM, <josef...@gmail.com> wrote:
- it's very difficult to understand because of all the indirections
 a class produces a class which returns another class, and how do I
get the data in and out?  ...

Glad to here it's not just me :)

I updated trunk, here are some more examples, including some contrasts for factors
http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/revision/1992#scikits/statsmodels/sandbox/examples/ex_formula.py

I wasn't successful in combining a factor with other terms in a formula, or, more precisely, getting the design matrix and contrast matrices out of the combined formula

The merge includes a lot of other things, that I was working on in my branch, that I haven't really checked whether it is clean, and I still have to run the test suite on the current branch.

 

Should we have a end goal/plan/outline for categorical and dummy variable handling? For me the doing my own was as much an exercise in understanding the problem as having a solution. I am not stuck on what I did. I think we would benefit from a clear description example uses and an outline how whether we want a class or function or both type of solution.
I guess I am suggestion we decide what type of wheel we want and what it should fit. Maybe the answer is having several option and just having more examples of there use. But someone other than me should make this call.

just a brief answer for now,

since I never used a formula framework myself and used only simple cases of dummy variables, I don't have a strong opinion what the best way is.
Usually, I'm faster writing a 2-liner or 5-liner, than to figure out how to use a "framework"

Josef

 

josef...@gmail.com

unread,
May 17, 2010, 1:03:38 AM5/17/10
to pystat...@googlegroups.com
On Sun, May 16, 2010 at 4:46 PM, <josef...@gmail.com> wrote:


On Sun, May 16, 2010 at 10:39 AM, Vincent Davis <vin...@vincentdavis.net> wrote:
On Sun, May 16, 2010 at 5:42 AM, <josef...@gmail.com> wrote:
- it's very difficult to understand because of all the indirections
 a class produces a class which returns another class, and how do I
get the data in and out?  ...

Glad to here it's not just me :)

I updated trunk, here are some more examples, including some contrasts for factors
http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/revision/1992#scikits/statsmodels/sandbox/examples/ex_formula.py

I wasn't successful in combining a factor with other terms in a formula, or, more precisely, getting the design matrix and contrast matrices out of the combined formula

The merge includes a lot of other things, that I was working on in my branch, that I haven't really checked whether it is clean, and I still have to run the test suite on the current branch.

 

Should we have a end goal/plan/outline for categorical and dummy variable handling? For me the doing my own was as much an exercise in understanding the problem as having a solution. I am not stuck on what I did. I think we would benefit from a clear description example uses and an outline how whether we want a class or function or both type of solution.
I guess I am suggestion we decide what type of wheel we want and what it should fit. Maybe the answer is having several option and just having more examples of there use. But someone other than me should make this call.

just a brief answer for now,

since I never used a formula framework myself and used only simple cases of dummy variables, I don't have a strong opinion what the best way is.
Usually, I'm faster writing a 2-liner or 5-liner, than to figure out how to use a "framework"

an aside:
Even functions that just wrap one-liners can be useful if they avoid having to remember something that is difficult to figure out, hard to remember, and where we want to have the usage tested. I have several of these in sandbox.tsa around correlate, convolve and lfilter because I never remember which argument is which. 2 lines of code, 15 lines of docstring and 30 lines of tests/examples (just a guess)

The basic usage case is "construct exog" - "fit" - "get test statistics (t_test, f_test for linear restrictions)"

This is pretty easy in the standard (?) econometrics case with no categorical variables and no interaction terms

If there are categorical variables (e.g. red, blue and green) with more than two groups, then we need to construct the dummy variables for the design matrix, and the (contrast) matrix for linear restriction, for example
Does color have a statistically significant effect? Is the difference between red-blue, red-green, blue-green statistically significant? Does color have an effect in combination with size?

Last year, I only looked at the construction of the design/exog matrix, which is easy to do by hand or with simple helper functions, like data2dummy.

Now, I think constructing the contrast matrix is the less obvious part, e.g. the matrix of linear restrictions for the f-test for the null hypothesis that the coefficients of red, blue and green are the same. I don't have a nice set of helper functions for this.

This is where the classes in formula could be useful. But, I haven't figured it out yet, and I'm not sure it's really working for the full design matrix including some factors/dummy variables. I only managed to get some pieces to work and it still required work. If it's too complicated to use, everyone just writes their own.
The other nice part about formula is the construction of descriptive names for the dummy variables, e.g. '(ff==b)'

(I don't like the term namespace in formula, it sounds too abstract fto me, I would just rename it to "datadict", which sounds less scary.)

That's roughly how I (currently) see the use case for categorical and formula.

We can either do it explicitly-case-by-case, or provide helper functions/classes or get the full formula framework to work.

Josef






 

Bruce Southey

unread,
May 17, 2010, 9:52:47 AM5/17/10
to pystat...@googlegroups.com
Yeah,
I know what you mean as I have not been able to spend any time to explore what is available.

Not to repeat things again, but really most of this is rather moot without first determining how the user inputs a model (which also implies data). So you need to know what are X and Y variables (note could have multiple Y variables), which terms are fixed or random, and which are interactions (nested terms can be treated as interactions without fitting the corresponding main effect).

On top of that, you need to determine a syntax and if do you want shortcuts like a term transformed or use a special polynomial. For example, do you want the user to say 'x+x*x+x*x*x' or poly(x,3) or x.poly(3) and stand(x) or x.stand(0,1) to transformation x to have mean zero and variance 1?

Next you have to decide if you want to update the model like you can do in R. I usually don't find it useful but then I don't use R much anyhow.

Creating zero/one dummy variables is trivial but not the only option:
http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm

Creating the design matrix from a model is easy

I do not know sufficient about creating the contrast matrix and how to test individual terms. I took the long way by refitting the model.

Bruce

josef...@gmail.com

unread,
May 17, 2010, 10:49:25 AM5/17/10
to pystat...@googlegroups.com

I still think it's premature for us to discuss user interface before we even know what functionality we will actually provide.
formula is currently using a sympy style definition of Terms, Factor and Formula, ...
additionally we could directly use string parsing for formula string expressions

However, first I would like to get the contrast matrix part sorted out, with whatever interface is convenient internally. i.e. what are all the classes, methods and functions that we need. Once we know/have this, we can discuss details of a convenient user interface and terminology. Then I would also hide some of the internal complexities (make them unnecessary for the simple use, but keep them available for advanced use)

 

Next you have to decide if you want to update the model like you can do in R. I usually don't find it useful but then I don't use R much anyhow.

I don't know what update does in R.
 

Creating zero/one dummy variables is trivial but not the only option:
http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm


This looks good, it will make a good test case. If nobody else is faster, I try to replicate the results by the end of summer. (also working for a more general case that mixes categorical and continuous variables and their products)

From a quick skimming my impression is that they are recoding the design matrix, which looks like a redundant re-estimation of the same model to me.
I would work only with a default dummy encoding, and construct the (contrast ?) matrices for the use with f_test and t_test.
 
Creating the design matrix from a model is easy

I do not know sufficient about creating the contrast matrix and how to test individual terms. I took the long way by refitting the model.

Skipper and my problem last year was that we never really figured out what statisticians do with contrast matrices, so we didn't touch this part much.
The econometrics analogy is essentially testing linear restrictions, r*beta=0 for t_test, or R*beta=0 for f_test. Both work very well, for t_test we had simultaneous t_tests for each row of R*beta=0 which would produce exactly the results in the UCLA reference. (I don't remember whether we removed this feature for greater simplicity.)

The first thing to do, I think, is to be able to add a formula.Factor to a formula.Formula, so we can have the joint design/exog matrix and create the corresponding (contrast) matrices R for linear restrictions on the full model.

Then we could add an option to create some of the contrast matrices for the cases in the UCLA reference
    5.1 Simple Coding
    5.2 Deviation Coding
    5.3 Orthogonal Polynomial Coding
    5.4 Helmert Coding
    5.5 Reverse Helmert Coding
    5.6 Forward Difference Coding
    5.7 Backward Difference Coding


(I still consider this as a user convenience, not part of the core statistical models, like lm instead of lm.fit in R. But given Vincent's work, I think it is also time to improve convenience and usability, and not just increase the coverage of various econometric/statistical models.)

Josef

 

Bruce


Vincent Davis

unread,
May 17, 2010, 11:45:28 AM5/17/10
to pystat...@googlegroups.com
Skipper and my problem last year was that we never really figured out what statisticians do with contrast matrices, so we didn't touch this part much.
The econometrics analogy is essentially testing linear restrictions, r*beta=0 for t_test, or R*beta=0 for f_test. Both work very well, for t_test we had simultaneous t_tests for each row of R*beta=0 which would produce exactly the results in the UCLA reference. (I don't remember whether we removed this feature for greater simplicity.)

Maybe I am starting to understand what you are meaning by a contrast matrix and then I found ftestforWeb.pdf 
Thinking of this from making it simpler for a user, what you are saying is that keeping track of whats what when going from the raw data to categorical then to dummy (I think of categorizing data to be a step along the way to dummy variable) so that it is possible to maybe. (totally made up)

cat_array = CategorizeArray(exog, ['gender', 'enthnicity'], dummy=true )    #takes the gender and ethnicity variables and converts them to a categorical and      then dummy variable. Returns an array with dtype float
result = OLS(endg, cat_array).fit()    #fits the OLS
result.test.equal(cat_array.ethnicity)   # tests that the categories of ethnicity are all equal and returns results with labels ie asian, native america, rather than returning only column numbers from the array.

In this example result.test.equal() would need to be passed the array columns that are being tested and labels/names for them so that when you get the results they are available with names.

Do I get it or am I confused. (strange how sometime we have to ask someone else if we are confused because we are not sure ourselves)

josef...@gmail.com

unread,
May 17, 2010, 12:04:45 PM5/17/10
to pystat...@googlegroups.com
On Mon, May 17, 2010 at 11:45 AM, Vincent Davis <vin...@vincentdavis.net> wrote:
Skipper and my problem last year was that we never really figured out what statisticians do with contrast matrices, so we didn't touch this part much.
The econometrics analogy is essentially testing linear restrictions, r*beta=0 for t_test, or R*beta=0 for f_test. Both work very well, for t_test we had simultaneous t_tests for each row of R*beta=0 which would produce exactly the results in the UCLA reference. (I don't remember whether we removed this feature for greater simplicity.)

Maybe I am starting to understand what you are meaning by a contrast matrix and then I found ftestforWeb.pdf 
Thinking of this from making it simpler for a user, what you are saying is that keeping track of whats what when going from the raw data to categorical then to dummy (I think of categorizing data to be a step along the way to dummy variable) so that it is possible to maybe. (totally made up)

cat_array = CategorizeArray(exog, ['gender', 'enthnicity'], dummy=true )    #takes the gender and ethnicity variables and converts them to a categorical and      then dummy variable. Returns an array with dtype float
result = OLS(endg, cat_array).fit()    #fits the OLS
result.test.equal(cat_array.ethnicity)   # tests that the categories of ethnicity are all equal and returns results with labels ie asian, native america, rather than returning only column numbers from the array.

In this example result.test.equal() would need to be passed the array columns that are being tested and labels/names for them so that when you get the results they are available with names.

Do I get it or am I confused. (strange how sometime we have to ask someone else if we are confused because we are not sure ourselves)


Yes, that's the idea.

result.test.equal(cat_array.ethnicity)  
would then correspond to the standard t-value, result.t(), for continuous variables, to test whether ethnicity has no statistical effect on the endog

The rest are implementation details and the question how much flexibility we want to allow in the tests for other null hypothesis on the categorical variable.

Your reference looks pretty complete, most of it is implemented in f_test, although, I think, until now only for homogeneous linear restriction R*beta=0 and not yet for R*beta=r, and the same for t_test.

Josef
 

josef...@gmail.com

unread,
May 17, 2010, 12:13:33 PM5/17/10
to pystat...@googlegroups.com
On Mon, May 17, 2010 at 12:04 PM, <josef...@gmail.com> wrote:


On Mon, May 17, 2010 at 11:45 AM, Vincent Davis <vin...@vincentdavis.net> wrote:
Skipper and my problem last year was that we never really figured out what statisticians do with contrast matrices, so we didn't touch this part much.
The econometrics analogy is essentially testing linear restrictions, r*beta=0 for t_test, or R*beta=0 for f_test. Both work very well, for t_test we had simultaneous t_tests for each row of R*beta=0 which would produce exactly the results in the UCLA reference. (I don't remember whether we removed this feature for greater simplicity.)

Maybe I am starting to understand what you are meaning by a contrast matrix and then I found ftestforWeb.pdf 
Thinking of this from making it simpler for a user, what you are saying is that keeping track of whats what when going from the raw data to categorical then to dummy (I think of categorizing data to be a step along the way to dummy variable) so that it is possible to maybe. (totally made up)

cat_array = CategorizeArray(exog, ['gender', 'enthnicity'], dummy=true )    #takes the gender and ethnicity variables and converts them to a categorical and      then dummy variable. Returns an array with dtype float
result = OLS(endg, cat_array).fit()    #fits the OLS
result.test.equal(cat_array.ethnicity)   # tests that the categories of ethnicity are all equal and returns results with labels ie asian, native america, rather than returning only column numbers from the array.

In this example result.test.equal() would need to be passed the array columns that are being tested and labels/names for them so that when you get the results they are available with names.

Do I get it or am I confused. (strange how sometime we have to ask someone else if we are confused because we are not sure ourselves)


Yes, that's the idea.


result.test.equal(cat_array.ethnicity)  
would then correspond to the standard t-value, result.t(), for continuous variables, to test whether ethnicity has no statistical effect on the endog

The rest are implementation details and the question how much flexibility we want to allow in the tests for other null hypothesis on the categorical variable.

Your reference looks pretty complete, most of it is implemented in f_test, although, I think, until now only for homogeneous linear restriction R*beta=0 and not yet for R*beta=r, and the same for t_test.

for f_test and t_test, I have a messy example file
scikits\statsmodels\examples\example_ols_tftest.py

I never cleaned it up much, but it has lot's of cases that I used in order to check whether f_test and t_test work as expected.

Josef

 

Bruce Southey

unread,
May 17, 2010, 12:21:33 PM5/17/10
to pystat...@googlegroups.com

Josef

 

Vincent








      [ 1.,  0.],Type 3 Tests of Fixed Effects
Take a look at SAS, for example GLM:
The parameterization:
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#/documentation/cdl/en/statug/63033/HTML/default/statug_glm_sect029.htm

The actual testing :
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#/documentation/cdl/en/statug/63033/HTML/default/statug_glm_sect030.htm


Really type3 tests are what we are after, so you just to to create the appropriate contrast matrix L (interactions can make these hard to get correct). Then it is some matrix multiplications that I am not even going to try to replicate here to get the F value \
see the last section on Type 3 Tests of Fixed Effects).

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#/documentation/cdl/en/statug/63033/HTML/default/statug_mixed_sect025.htm


Bruce

Matthew Brett

unread,
May 17, 2010, 2:27:32 PM5/17/10
to pystat...@googlegroups.com, NIPY Developer's List
Hi,

I was glad to see formula coming up over your side of the coding
divide; we've also been working on it a little in nipy.

I think we're beginning to get the idea, and it seems like it's a very
good idea, so it would be very good to share some work in increasing
understanding.

First - Jonathan Taylor - who wrote this code - has commented on the
design a bit on the NIPY mailing list - we've put his comments into
our version of the code in:

http://github.com/nipy/nipy/blob/master/nipy/modalities/fmri/formula.py
http://github.com/nipy/nipy/blob/master/nipy/modalities/fmri/aliased.py

The original comments are here:

http://mail.scipy.org/pipermail/nipy-devel/2009-December/002387.html

> here is my *current* opinion and information on it, one example is below
>
> - it looks useful when working with categorical data
>  (I don't think it's very useful for non-categorical data, as we
> where working on in the last year.)
> - it's very difficult to understand because of all the indirections
>  a class produces a class which returns another class, and how do I
> get the data in and out?  ...

I think here you mean the aliased functions. That stuff is - as you
know - Sympy code, and after looking at it a while (with Ondrej Certik
last week for example), I think we concluded that the indirection
stuff could not be avoided because of the way Sympy makes new
functions. However, with a nice set of docstrings and examples, I
honestly think that won't matter because the concept of the
AliasedFunction is actually very simple, and goes like this:

An AliasedFunction is a symbolic function that carries with it its own
implementation as a numerical function. When we call 'lambidify' for
expressions containing this function, the function's own
implementation is used.

> - the only usage examples, I have seen are in the test file

We could certainly do with some more examples, but there are some you
may not have found in our tree:

http://github.com/nipy/nipy/tree/master/examples/formula/

> - in order to use it, we need to create examples and documentation for it
> - I'm not sure everything is correct, in spite of passing the tests,
>  or maybe I don't get expected results because I don't know how to use it
> - because of all the indirection, usage inside statistical models makes them
>  also difficult to understand
> - it doesn't do the data handling, which has been more the focus of Skipper and
>  Vincent, the "namespace" in formulas need to be filled with actual data

> For now, I don't have the time to struggle with it for several days, but maybe
> during summer, if I don't think distributions, time series analysis and lot's
> of other things are more fun.

Maybe more fun with some help? We're very keen to work on this (at
least, I know I am).

Thanks for thinking and working on this. I do think it could be very useful,

Matthew

Matthew Brett

unread,
May 17, 2010, 2:36:48 PM5/17/10
to pystat...@googlegroups.com, NIPY Developer's List
Hi,

> An AliasedFunction is a symbolic function that carries with it its own
> implementation as a numerical function.  When we call 'lambidify' for
> expressions containing this function, the function's own
> implementation is used.

Incidentally, we talked to the Sympy folks when they came here to
visit last week, and I believe that they agreed the idea of the
AliasedFunction was good and useful; I hope it will be part of Sympy
soon.

See you,

Matthew

Bruce Southey

unread,
May 18, 2010, 9:19:45 AM5/18/10
to pystat...@googlegroups.com
Hi,
Thanks for the input Matthew!

I have been trying to understand formula but I have gotten frustrated
with trying to use it (and really to busy to try harder). In particular
trying to use a alphanumeric variable. Part of this may be the
difference versions of it floating around butt really it is me.

It would be really great to provide a set of useful examples of formula
associated with data. My thoughts would be to create a set typical
usages from regression to at least general linear models (such as
analysis of covariance) and maybe mixed effects models.

One potential issue is the use of sympy since it would add dependency.
Of course, it may be possible to work around that so that it can be used
if installed.

Bruce


josef...@gmail.com

unread,
May 18, 2010, 11:47:48 AM5/18/10
to pystat...@googlegroups.com
On Tue, May 18, 2010 at 9:19 AM, Bruce Southey <bsou...@gmail.com> wrote:
> On 05/17/2010 01:27 PM, Matthew Brett wrote:
>>
>> Hi,
>>
>> I was glad to see formula coming up over your side of the coding
>> divide;  we've also been working on it a little in nipy.
>>
>> I think we're beginning to get the idea, and it seems like it's a very
>> good idea, so it would be very good to share some work in increasing
>> understanding.
>>
>> First - Jonathan Taylor - who wrote this code - has commented on the
>> design a bit on the NIPY mailing list - we've put his comments into
>> our version of the code in:
>>
>> http://github.com/nipy/nipy/blob/master/nipy/modalities/fmri/formula.py
>> http://github.com/nipy/nipy/blob/master/nipy/modalities/fmri/aliased.py
>>
>> The original comments are here:
>>
>> http://mail.scipy.org/pipermail/nipy-devel/2009-December/002387.html
>>

I am still several steps behind the formula framework in nipy. I'm
still trying to figure out how Jonathan's old formula framework, that
we still have in the statsmodels, works and how it might fit into the
"new" statsmodels.

Using sympy is an additional complication (and a dependency that we
don't want to require yet). I'm pretty skeptical about it's usage in
this area, especially I (still) find lambdify too much of a black box
with too little control over the actual numpy functions that are used.
(On the other hand, I was using sympy more for code generation and
getting analytical derivatives and Hessians which for example for
Jacobians might require ex-post adjustments of the axes for vectorized
solutions.)

It looks like there are still enough similarities between the old
version and the nipy version of the formula framework, to benefit, for
example, from working through different use cases.
The nipy version also has a new random effects class that is similar
to some experimental functions I wrote for random effects in panel
data models.

There are still parts that I don't understand, for example I never
tried to work through the math to figure out what
"contrast_from_cols_or_rows" is supposed to be or do.

And, I don't see anywhere how the final test statistics, e.g. the
F-tests or t-tests, are actually calculated from these "contrasts".

Another comment to put formulas a bit in perspective:
For us the core of linear models is mostly finished, there might be
some use of formulas in panel data analysis, and there are still
usability improvements to do.
But most of our focus has shifted to other models, as can be seen for
example in Skipper's gsoc plans.
System of Equations regression (SUR, 2SLS, FIML, structural VAR) is
not covered by formulas, but we were discussing that some kind of
formula structure would be useful.
I don't see any use of formulas in standard time series analysis,
ARIMA, Kalman Filter, GARCH, except maybe as a thin layer for short
hand notation.
I don't know whether formulas might be useful for General Method of
Moments, e.g. for instrumental variables estimation.
The formula framework could be useful in Generalized and Robust Linear
Models, but I haven't thought about that at all.
In the long term, sympy and automatic differentiation might be useful
to improve non-linear models and non-linear estimators.

>>
>>>
>>> here is my *current* opinion and information on it, one example is below
>>>
>>> - it looks useful when working with categorical data
>>>  (I don't think it's very useful for non-categorical data, as we
>>> where working on in the last year.)
>>> - it's very difficult to understand because of all the indirections
>>>  a class produces a class which returns another class, and how do I
>>> get the data in and out?  ...
>>>
>>
>> I think here you mean the aliased functions.   That stuff is - as you
>> know - Sympy code, and after looking at it a while (with Ondrej Certik
>> last week for example), I think we concluded that the indirection
>> stuff could not be avoided because of the way Sympy makes new
>> functions.   However, with a nice set of docstrings and examples, I
>> honestly think that won't matter because the concept of the
>> AliasedFunction is actually very simple, and goes like this:
>>
>> An AliasedFunction is a symbolic function that carries with it its own
>> implementation as a numerical function.  When we call 'lambidify' for
>> expressions containing this function, the function's own
>> implementation is used.

I think, I mean more the inverted version, instead of hiding numpy
calculations inside sympy functions, I would hide formulas inside
numpy functions and methods, so that the usage of the formula
framework doesn't take over our statistical models and those remain
"understandable".


>>
>>
>>>
>>> - the only usage examples, I have seen are in the test file
>>>
>>
>> We could certainly do with some more examples, but there are some you
>> may not have found in our tree:
>>
>> http://github.com/nipy/nipy/tree/master/examples/formula/

I had looked at them last year, but they assume too much context to
understand, and I didn't try to figure out all the nipy specific code.

>>
>>
>>>
>>> - in order to use it, we need to create examples and documentation for it
>>> - I'm not sure everything is correct, in spite of passing the tests,
>>>  or maybe I don't get expected results because I don't know how to use it
>>> - because of all the indirection, usage inside statistical models makes
>>> them
>>>  also difficult to understand
>>> - it doesn't do the data handling, which has been more the focus of
>>> Skipper and
>>>  Vincent, the "namespace" in formulas need to be filled with actual data
>>>
>>
>>
>>>
>>> For now, I don't have the time to struggle with it for several days, but
>>> maybe
>>> during summer, if I don't think distributions, time series analysis and
>>> lot's
>>> of other things are more fun.
>>>
>>
>> Maybe more fun with some help?  We're very keen to work on this (at
>> least, I know I am).
>>
>> Thanks for thinking and working on this.   I do think it could be very
>> useful,

I have put it on my summer plans. Since Vincent already started to
work in this direction, I hope we get some nice front-to-back usage
examples and the corresponding code worked out.

Josef

>>
>> Matthew
>>
>
> Hi,
> Thanks for the input Matthew!
>
> I have been trying to understand formula but I have gotten frustrated with
> trying to use it (and really to busy to try harder). In particular trying to
> use a alphanumeric variable. Part of this may be the difference versions of
> it floating around butt really it is me.

I think, that was one of the main reasons we dropped the formula
framework last year. Skipper and I couldn't figure out enough what to
do with it, and nothing looked like any text book formulas in
statistics or econometrics.

josef...@gmail.com

unread,
May 18, 2010, 4:38:54 PM5/18/10
to pystat...@googlegroups.com

These are good references,

additionally for future reference:
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#/documentation/cdl/en/statug/63033/HTML/default/statug_catmod_sect032.htm

has the alternative encoding of the main and product effects, e.g.
dummy variable encoding
y = a1 * d1 + a2 * d2 + a3 * d3 + const     d1,d2,d3 dummies for 3 levels
then imposing a1 + a2 + a3 = 0 leads to
y = b1 * (d1-d3) + b2 * (d2-d3) + const  with a1 = b1, a2 = b2, a3 = -(b1+b2)

(answer to my earlier question)

and
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#/documentation/cdl/en/statug/63033/HTML/default/statug_mixed_sect022.htm

has the theory for mixed effect, fixed and random effects model

Josef

 
Bruce


Matthew Brett

unread,
May 18, 2010, 7:46:12 PM5/18/10
to pystat...@googlegroups.com, Jonathan E. Taylor
Hi,

> I am still several steps behind the formula framework in nipy. I'm
> still trying to figure out how Jonathan's old formula framework, that
> we still have in the statsmodels, works and how it might fit into the
> "new" statsmodels.

I hope that Jonathan will have time to join in the discussion - I'll
Cc him - but I know he's very busy. My impression is (sorry, I
should look) that Formula has advanced a little since your copy, but I
might well be wrong.

The thing that Jonathan is trying to get at, I believe, is the
conception of the R formula. As y'all know, this is very powerful,
allowing you to symbolically specify your statistical model. I think
Formula is designed to provide that, as well as being a mathematically
intuitive way to understand the idea of the formula and its solution.
I'm speaking without great statistical knowledge.

> Using sympy is an additional complication (and a dependency that we
> don't want to require yet).

Yes - we obviously do require it - but it's very light (pure python)
compared to numpy or - at the extreme end - scipy.

> I'm pretty skeptical about it's usage in
> this area, especially I (still) find lambdify too much of a black box
> with too little control over the actual numpy functions that are used.

I found that a little hard at first. I think the copy that we have
probably can be simplified. I'll try and do that. The idea is
simple though - going from a function-as-sympy-symbol to
function-as-working-on-numerical-data. The rule is, given a function
name 'func' - if the sympy function 'func' carries its own
implementation (in func.alias) then use that, otherwise search for a
function called 'func' in the namespaces given.

> It looks like there are still enough similarities between the old
> version and the nipy version of the formula framework, to benefit, for
> example, from working through different use cases.
> The nipy version also has a new random effects class that is similar
> to some experimental functions I wrote for random effects in panel
> data models.
>
> There are still parts that I don't understand, for example I never
> tried to work through the math to figure out what
> "contrast_from_cols_or_rows"  is supposed to be or do.

That sounds like good place to start.

> And, I don't see anywhere how the final test statistics, e.g. the
> F-tests or t-tests, are actually calculated from these "contrasts".

I can find some of that stuff - I hope I can find time...

> In the long term, sympy and automatic differentiation might be useful
> to improve non-linear models and non-linear estimators.

Right - yes - I remember Jonathan was talking about that the last time
he was here.

> I think, I mean more the inverted version, instead of hiding numpy
> calculations inside sympy functions, I would hide formulas inside
> numpy functions and methods, so that the usage of the formula
> framework doesn't take over our statistical models and those remain
> "understandable".

I can't comment very sensibly about that. I personally like the idea
of a formula being something symbolic, whereas, after you have derived
your design matrix from the model, the estimation proceeds in
numerical / numpy space. I hope I've understood it correctly.

> I have put it on my summer plans. Since Vincent already started to
> work in this direction, I hope we get some nice front-to-back usage
> examples and the corresponding code worked out.

That would be good - I hope I can help - I need to get to grips with
this too. I should say that my personal experience with Jonathan's
code is that it can be hard to get to grips with, but each time we've
spent the time to understand what he was trying to do, it turned out
that that was indeed the best way to do it. It's often taken us more
than a year to realize that, but - I guess that's python - it's so
easy to read and write that we sometimes expect difficult ideas to be
simpler than they are.

See you,

Matthew

Vincent Davis

unread,
May 18, 2010, 8:08:01 PM5/18/10
to pystat...@googlegroups.com, Jonathan E. Taylor
On Tue, May 18, 2010 at 5:46 PM, Matthew Brett <matthe...@gmail.com> wrote:

The thing that Jonathan is trying to get at, I believe, is the
conception of the R formula.  As y'all know, this is very powerful,
allowing you to symbolically specify your statistical model.   I think
Formula is designed to provide that, as well as being a mathematically
intuitive way to understand the idea of the formula and its solution.
 I'm speaking without great statistical knowledge.

I am glad to hear it is powerful cause I keep hearing how  it is difficult to understand :) 
I must admit I need to spend more time looking at it but I am not really sure I know what it is supposed to do or what the purpose/goal is.

That would be good - I hope I can help - I need to get to grips with
this too.  I should say that my personal experience with Jonathan's
code is that it can be hard to get to grips with, but each time we've
spent the time to understand what he was trying to do, it turned out
that that was indeed the best way to do it.   It's often taken us more
than a year to realize that, but - I guess that's python - it's so
easy to read and write that we sometimes expect difficult ideas to be
simpler than they are.


Um, I will have to (consider) taking that on faith. :) I have to admit sometime I need to try to solve a problem myself just to better understand how difficult it really is.

Vincent


See you,

Matthew

Matthew Brett

unread,
May 18, 2010, 8:11:46 PM5/18/10
to pystat...@googlegroups.com, Jonathan E. Taylor
Hi,

>> That would be good - I hope I can help - I need to get to grips with
>> this too.  I should say that my personal experience with Jonathan's
>> code is that it can be hard to get to grips with, but each time we've
>> spent the time to understand what he was trying to do, it turned out
>> that that was indeed the best way to do it.   It's often taken us more
>> than a year to realize that, but - I guess that's python - it's so
>> easy to read and write that we sometimes expect difficult ideas to be
>> simpler than they are.
>
> Um, I will have to (consider) taking that on faith. :) I have to admit sometime I need to try to solve a problem myself just to better understand how difficult it really is.

Believe me - I've been there - it will probably take my lifetime to
escape the habit ;) In this case though, there's a little help for
your faith, because R has the formula object, albeit in less complete
form...

See you,

Matthew

Vincent Davis

unread,
May 18, 2010, 8:36:33 PM5/18/10
to pystat...@googlegroups.com, Jonathan E. Taylor
So what's the best way for for me to understand this formula concept. I am not really talking about the formula.py file but the concept of what is being accomplished? I am feeling left out.
Thanks
Vincent

josef...@gmail.com

unread,
May 18, 2010, 11:29:49 PM5/18/10
to pystat...@googlegroups.com
On Tue, May 18, 2010 at 7:46 PM, Matthew Brett <matthe...@gmail.com> wrote:
> Hi,
>
>> I am still several steps behind the formula framework in nipy. I'm
>> still trying to figure out how Jonathan's old formula framework, that
>> we still have in the statsmodels, works and how it might fit into the
>> "new" statsmodels.
>
> I hope that Jonathan will have time to join in the discussion - I'll
> Cc him - but I know he's very busy.   My impression is (sorry, I
> should look) that Formula has advanced a little since your copy, but I
> might well be wrong.
>
> The thing that Jonathan is trying to get at, I believe, is the
> conception of the R formula.  As y'all know, this is very powerful,
> allowing you to symbolically specify your statistical model.   I think
> Formula is designed to provide that, as well as being a mathematically
> intuitive way to understand the idea of the formula and its solution.
>  I'm speaking without great statistical knowledge.

You forget that I'm not coming from an R (or SAS or Stata) background.
My background is Gauss and Matlab, pure matrix languages. For me
econometrics is rather linear algebra, some functions and a call to
fmin or equivalent.
My Maple or Mathematica programming was usually when I was more
interested in analytical results, and numerical results were just for
illustration or as a sanity check.

So for any symbolic stuff, I have to fight my intuition all the way.
And my personal (potential) target audience will come from a similar
background.

>
>> Using sympy is an additional complication (and a dependency that we
>> don't want to require yet).
>
> Yes - we obviously do require it - but it's very light (pure python)
> compared to numpy or - at the extreme end - scipy.

Until now we have a no dependency rule, so that the way to
re-integration with scipy is still possible.
I see that this is intellectually attractive, after all we are
programming in python and not in C or assembler (at least I am not),
but my impression of symbolic programming is that the "compilers" are
not good enough yet to replace handmade coding in an array/matrix
language.
(Or it requires spending time enhancing sympy instead of writing stats code)

>> I have put it on my summer plans. Since Vincent already started to
>> work in this direction, I hope we get some nice front-to-back usage
>> examples and the corresponding code worked out.
>
> That would be good - I hope I can help - I need to get to grips with
> this too.  I should say that my personal experience with Jonathan's
> code is that it can be hard to get to grips with, but each time we've
> spent the time to understand what he was trying to do, it turned out
> that that was indeed the best way to do it.   It's often taken us more
> than a year to realize that, but - I guess that's python - it's so
> easy to read and write that we sometimes expect difficult ideas to be
> simpler than they are.

I'm not sure what conclusions we are supposed to draw from the last part.

I've followed the discussion on web frameworks, between the
frameworks-make-the-usage-easier camps and the
framework-are-too-complicated-and-inflexible-I-rather-do-it-myself
camps.

Cheers,

Josef
(I won't be able to work on this until July)

>
> See you,
>
> Matthew
>

Matthew Brett

unread,
May 18, 2010, 11:43:52 PM5/18/10
to pystat...@googlegroups.com
Hi,

>> That would be good - I hope I can help - I need to get to grips with
>> this too.  I should say that my personal experience with Jonathan's
>> code is that it can be hard to get to grips with, but each time we've
>> spent the time to understand what he was trying to do, it turned out
>> that that was indeed the best way to do it.   It's often taken us more
>> than a year to realize that, but - I guess that's python - it's so
>> easy to read and write that we sometimes expect difficult ideas to be
>> simpler than they are.
>
> I'm not sure what conclusions we are supposed to draw from the last part.

Ah - I guess from your phrasing you were feeling it was rude - I'm
sorry if so. My background is the matrix-based statistics
instantiated in an imaging package called SPM. I have also used R.
I was trying to say that I think a stats package will _probably_ (I
can't know, because I'm not trying to write one) need a formula - to
encapsulate the same things that an R formula does. I think part of
the complexity of Jonathan's code _may_ be (I don't understand it well
enough) that he's trying to implement formula things that
statisticians commonly use - and that wouldn't be surprising because
Jonathan is a mathematical statistician.

> I've followed the discussion on web frameworks, between the
> frameworks-make-the-usage-easier camps and the
> framework-are-too-complicated-and-inflexible-I-rather-do-it-myself
> camps.

Yes, but I don't know whether 'framework' is the right word here. The
formula is trying to encapsulate some abstract rules about how
statistical models work - but they are abstract in the same way math
is abstract.

I'm sorry - I realize that I'm speaking out of ignorance - I'm largely
ignorant of statistics, and completely ignorant of econometrics.

> (I won't be able to work on this until July)

Excellent - that's probably when I'm going to have time.

I think my hope here is to convey my and our experience, that it can
be tempting to drop the more complicated stuff, thinking that it's not
doing anything important, but we've almost always found that some
patience (and a few days feeling like we are retards) has saved us a
lot of time in thinking things through...

See you,

Matthew

Matthew Brett

unread,
May 19, 2010, 12:02:04 AM5/19/10
to pystat...@googlegroups.com, Jonathan E. Taylor
Hi,

On Tue, May 18, 2010 at 5:36 PM, Vincent Davis <vin...@vincentdavis.net> wrote:
>
> So what's the best way for for me to understand this formula concept. I am not really talking about the formula.py file but the concept of what is being accomplished? I am feeling left out.

I'm afraid I don't know a very good reference. Do you know R?
There's some online docs on what kind of thing the R formula does:

http://cran.r-project.org/doc/manuals/R-intro.html#Statistical-models-in-R

There's a useful section on 'Model Formulae' around page 1146 of the R
reference guide: http://cran.r-project.org/doc/manuals/fullrefman.pdf
. This cites:

References
Chambers, J. M. and Hastie, T. J. (1992) Statistical models. Chapter 2
of Statistical Models in S eds J. M. Chambers and T. J. Hastie,
Wadsworth & Brooks/Cole.

There's an introduction also here:

http://wiener.math.csi.cuny.edu/st/stRmanual/ModelFormula.html

and some examples of very basic usage at the top here:

http://www.personality-project.org/R/r.lm.html

Is that helpful for the general idea?

See you,

Matthew

josef...@gmail.com

unread,
May 19, 2010, 12:13:54 AM5/19/10
to pystat...@googlegroups.com
On Tue, May 18, 2010 at 11:43 PM, Matthew Brett <matthe...@gmail.com> wrote:
> Hi,
>
>>> That would be good - I hope I can help - I need to get to grips with
>>> this too.  I should say that my personal experience with Jonathan's
>>> code is that it can be hard to get to grips with, but each time we've
>>> spent the time to understand what he was trying to do, it turned out
>>> that that was indeed the best way to do it.   It's often taken us more
>>> than a year to realize that, but - I guess that's python - it's so
>>> easy to read and write that we sometimes expect difficult ideas to be
>>> simpler than they are.
>>
>> I'm not sure what conclusions we are supposed to draw from the last part.
>
> Ah - I guess from your phrasing you were feeling it was rude - I'm
> sorry if so.

No, not at all. What I meant was that you confirm our impression that
the formula structure/framework is tough, even though you have been
exposed to or worked with it for quite some time.

So, our conclusion was last year (and still is) to see how far we can
get with a more "standard" approach without formulas that has a much
lower learning curve and keeps Python simple.

My Return of Formula was prompted because now I see a usecase where it
might be helpful.

> My background is the matrix-based statistics
> instantiated in an imaging package called SPM.   I have also used R.
> I was trying to say that I think a stats package will _probably_ (I
> can't know, because I'm not trying to write one) need a formula - to
> encapsulate the same things that an R formula does.  I think part of
> the complexity of Jonathan's code _may_ be (I don't understand it well
> enough) that he's trying to implement formula things that
> statisticians commonly use - and that wouldn't be surprising because
> Jonathan is a mathematical statistician.

I just saw you next message

I know are more from reading docs than usage and because of the
license I don't look at the actual implementation.

My impression of R formula was that it is mainly a shorthand for user
input and the actual implementation for example in lm.fit is pure
matrix algebra. My impression of Jonathan's formulas is that it is
used as symbolic classes throughout the statistical code, making that
harder to follow or more time consuming to understand.

Having formulas as a shorthand for user input wasn't high on our
priorities, compared to expanding core coverage of
statistical/econometric methods in the way we learn them from books or
read them in papers.

>
>> I've followed the discussion on web frameworks, between the
>> frameworks-make-the-usage-easier camps and the
>> framework-are-too-complicated-and-inflexible-I-rather-do-it-myself
>> camps.
>
> Yes, but I don't know whether 'framework' is the right word here.  The
> formula is trying to encapsulate some abstract rules about how
> statistical models work - but they are abstract in the same way math
> is abstract.
>
> I'm sorry - I realize that I'm speaking out of ignorance - I'm largely
> ignorant of statistics, and completely ignorant of econometrics.
>
>> (I won't be able to work on this until July)
>
> Excellent - that's probably when I'm going to have time.
>
> I think my hope here is to convey my and our experience, that it can
> be tempting to drop the more complicated stuff, thinking that it's not
> doing anything important, but we've almost always found that some
> patience (and a few days feeling like we are retards) has saved us a
> lot of time in thinking things through...

I'm hopeful that with some fully worked out examples this summer, we
can get a better feeling for the advantages and boundaries for it's
usage.

Josef

>
> See you,
>
> Matthew
>

Vincent Davis

unread,
May 19, 2010, 12:15:16 AM5/19/10
to pystat...@googlegroups.com, Jonathan E. Taylor
On Tue, May 18, 2010 at 10:02 PM, Matthew Brett <matthe...@gmail.com> wrote:

Is that helpful for the general idea?

Yes, but let me paraphrase, It parses a  input like (a+b+c)^2 or y = x1+x2+x3 into a for more typical of what a python function/class might take as input.

Is that the basic idea of it?
Thanks for the references
Vincent


See you,

Matthew

josef...@gmail.com

unread,
May 19, 2010, 12:20:12 AM5/19/10
to pystat...@googlegroups.com
On Tue, May 18, 2010 at 8:36 PM, Vincent Davis <vin...@vincentdavis.net> wrote:
So what's the best way for for me to understand this formula concept. I am not really talking about the formula.py file but the concept of what is being accomplished? I am feeling left out.
Thanks
Vincent

I think the main point is that with a formula framework, the model is defined and manipulated symbolically, as in Maple, Mathematica, Sympy and only at the end you stick some data into it to do the actual numerical calculations.

So, instead of with actual data and arrays, it adds an additional layer of abstraction, we work initially with Symbols and symbolic Functions, or Terms and Formulas, do all the necessary transformation on the symbols (symbolic variables) and at the end evaluate with actual numbers. sympy.lambdify or assigning a namespace in formulas makes it possible to attach the numerical data and do the numerical calculations.

It potentially saves time because not all transformations are done with actual data, and it can be more flexible because we don't have to hardcode equations and formulas.

In your previous example, you would have the construction of the equations for the design/exog matrix and the contrast matrices first in symbolic form and then evaluate them on one or more datasets.

Josef

 

Matthew Brett

unread,
May 19, 2010, 12:22:20 AM5/19/10
to pystat...@googlegroups.com
> I think the main point is that with a formula framework, the model is defined and manipulated symbolically, as in Maple, Mathematica, Sympy and only at the end you stick some data into it to do the actual numerical calculations.
>
> So, instead of with actual data and arrays, it adds an additional layer of abstraction, we work initially with Symbols and symbolic Functions, or Terms and Formulas, do all the necessary transformation on the symbols (symbolic variables) and at the end evaluate with actual numbers. sympy.lambdify or assigning a namespace in formulas makes it possible to attach the numerical data and do the numerical calculations.

Very nice summary...

Matthew

Alan G Isaac

unread,
May 19, 2010, 7:39:01 AM5/19/10
to pystat...@googlegroups.com
On 5/18/2010 11:29 PM, josef...@gmail.com wrote:
> My background is Gauss and Matlab, pure matrix languages. For me
> econometrics is rather linear algebra, some functions and a call to
> fmin or equivalent.

Same.
This formula stuff looks like more trouble than its worth.
I'd like to see an example where it has a clear advantage
over more explicit algebra. The examples sure do not make
that obvious:
http://cran.r-project.org/doc/manuals/R-intro.html#Formulae-for-statistical-models

In any case, it has zero effect on the core functionality
of pystatsmodels, and I would hate to see nice data-friendly
objects twisted about to handle and/or produce formulae.

Alan

Alan G Isaac

unread,
May 19, 2010, 7:48:51 AM5/19/10
to pystat...@googlegroups.com
On 5/19/2010 12:20 AM, josef...@gmail.com wrote:
> It potentially saves time because not all transformations are done with
> actual data

numexpr?
http://www.scipy.org/SciPyPackages/NumExpr

josef...@gmail.com

unread,
May 19, 2010, 8:59:14 AM5/19/10
to pystat...@googlegroups.com
On Wed, May 19, 2010 at 7:39 AM, Alan G Isaac <ais...@american.edu> wrote:
> On 5/18/2010 11:29 PM, josef...@gmail.com wrote:
>>
>> My background is Gauss and Matlab, pure matrix languages. For me
>> econometrics is rather linear algebra, some functions and a call to
>> fmin or equivalent.
>
> Same.
> This formula stuff looks like more trouble than its worth.
> I'd like to see an example where it has a clear advantage
> over more explicit algebra.  The examples sure do not make
> that obvious:
> http://cran.r-project.org/doc/manuals/R-intro.html#Formulae-for-statistical-models

the main case I can (potentially) see, is going from there to
http://cran.r-project.org/doc/manuals/R-intro.html#Contrasts

keeping track of what all the columns of the design matrix refer to,
which is only really relevant/non-trivial for categorical and dummy
variables (and their interactions).

details are in the SAS references by Bruce.

Josef

josef...@gmail.com

unread,
May 19, 2010, 9:16:10 AM5/19/10
to pystat...@googlegroups.com
It's another way of improving speed, but as far as I understand, only
works for element-wise array operations.

In either case, gaining a few milliseconds is not on Skipper's gsoc
todo list, and our todo lists are already too long.

cythonizing, numexpr, symbolic calculations, automatic
differentiation, theanoization (?) and so on will have to wait, unless
they find a new champion.

Josef

Bruce Southey

unread,
May 19, 2010, 9:26:57 AM5/19/10
to pystat...@googlegroups.com
On 05/19/2010 07:59 AM, josef...@gmail.com wrote:
> On Wed, May 19, 2010 at 7:39 AM, Alan G Isaac<ais...@american.edu> wrote:
>
>> On 5/18/2010 11:29 PM, josef...@gmail.com wrote:
>>
>>> My background is Gauss and Matlab, pure matrix languages. For me
>>> econometrics is rather linear algebra, some functions and a call to
>>> fmin or equivalent.
>>>
>> Same.
>> This formula stuff looks like more trouble than its worth.
>>

Please propose another approach that works and avoids a user from having
to create each column by hand?
At the very least, we need an approach to enable a user to fit some
model Y=mu + x1b1 or Y=mu + x1b1 + x2b2 + x1*x2*b3
where x1, x2 and x3 may be regressors or factors.

>> I'd like to see an example where it has a clear advantage
>> over more explicit algebra. The examples sure do not make
>> that obvious:
>> http://cran.r-project.org/doc/manuals/R-intro.html#Formulae-for-statistical-models
>>
> the main case I can (potentially) see, is going from there to
> http://cran.r-project.org/doc/manuals/R-intro.html#Contrasts
>
> keeping track of what all the columns of the design matrix refer to,
> which is only really relevant/non-trivial for categorical and dummy
> variables (and their interactions).
>
No, it is relevant for any column in the design matrix.
Why should the first column be the column of ones? It is not always
needed and under sparse matrix theory, you probably want it elsewhere
to minimize fillins.
You need to know which columns are which to extract contrasts and
standard errors of specific terms.
> details are in the SAS references by Bruce.
>
> Josef
>
>
>> In any case, it has zero effect on the core functionality
>> of pystatsmodels, and I would hate to see nice data-friendly
>> objects twisted about to handle and/or produce formulae.
>>
>> Alan
>>
>>
Actually I very strongly disagree here. To me statsmodels is all about
creating systems of equations determined by the user as the rest is
either numpy/scipy or 'record keeping'. So ignoring formula-type of
approaches fails the 'core functionality' and 'data-friendly objects'
because I must know what objects contain, what is happening to them (for
example how to create and solve the desired system).

Bruce


josef...@gmail.com

unread,
May 19, 2010, 9:51:38 AM5/19/10
to pystat...@googlegroups.com
On Wed, May 19, 2010 at 9:26 AM, Bruce Southey <bsou...@gmail.com> wrote:
> On 05/19/2010 07:59 AM, josef...@gmail.com wrote:
>>
>> On Wed, May 19, 2010 at 7:39 AM, Alan G Isaac<ais...@american.edu>  wrote:
>>
>>>
>>> On 5/18/2010 11:29 PM, josef...@gmail.com wrote:
>>>
>>>>
>>>> My background is Gauss and Matlab, pure matrix languages. For me
>>>> econometrics is rather linear algebra, some functions and a call to
>>>> fmin or equivalent.
>>>>
>>>
>>> Same.
>>> This formula stuff looks like more trouble than its worth.
>>>
>
> Please propose another approach that works and avoids a user from having to
> create each column by hand?
> At the very least, we need an approach to enable a user to fit some model
> Y=mu + x1b1 or Y=mu + x1b1 + x2b2 + x1*x2*b3
> where x1, x2 and x3 may be regressors or factors.

statsmodels is still mainly a library and the user needs to do some
work, and we assume basic knowledge of python and numpy.

Until now, users have to keep track of the columns of the design
matrix and their meaning themselves

x2[:,None] == np.unique(x2) got us started with this discussion in the
first place

R = np.zeros(nlevels-1),nvar)
R[:, 2:nlevels-1+2] = np.eye(nlevels-1)
print modresult.f_test(R)

or something like this
that's about 90% of our code and work including the corresponding tests and docs
(90% is just a guess, maybe it's 85% or 95%)

Cheers,

Josef

Bruce Southey

unread,
May 19, 2010, 10:29:58 AM5/19/10
to pystat...@googlegroups.com
On 05/19/2010 08:51 AM, josef...@gmail.com wrote:
> On Wed, May 19, 2010 at 9:26 AM, Bruce Southey<bsou...@gmail.com> wrote:
>
>> On 05/19/2010 07:59 AM, josef...@gmail.com wrote:
>>
>>> On Wed, May 19, 2010 at 7:39 AM, Alan G Isaac<ais...@american.edu> wrote:
>>>
>>>
>>>> On 5/18/2010 11:29 PM, josef...@gmail.com wrote:
>>>>
>>>>
>>>>> My background is Gauss and Matlab, pure matrix languages. For me
>>>>> econometrics is rather linear algebra, some functions and a call to
>>>>> fmin or equivalent.
>>>>>
>>>>>
>>>> Same.
>>>> This formula stuff looks like more trouble than its worth.
>>>>
>>>>
>> Please propose another approach that works and avoids a user from having to
>> create each column by hand?
>> At the very least, we need an approach to enable a user to fit some model
>> Y=mu + x1b1 or Y=mu + x1b1 + x2b2 + x1*x2*b3
>> where x1, x2 and x3 may be regressors or factors.
>>
> statsmodels is still mainly a library and the user needs to do some
> work, and we assume basic knowledge of python and numpy.
>
> Until now, users have to keep track of the columns of the design
> matrix and their meaning themselves
>
I know and that is the main reason why I sit on the side lines. If I
have to do all that work then there is no point in statsmodels. I really
do the the superiority of what Jonathan provided over what I have done
(and sent).
:-)

>
>> So ignoring formula-type of approaches
>> fails the 'core functionality' and 'data-friendly objects' because I must
>> know what objects contain, what is happening to them (for example how to
>> create and solve the desired system).
>>
>> Bruce
>>
>>
>>
>>
Bruce

josef...@gmail.com

unread,
May 19, 2010, 10:50:21 AM5/19/10
to pystat...@googlegroups.com
One of the points of statsmodels is that it took me just a few hours
to program structural change tests, which include testing of
homogeneity of coefficients across groups, or testing Granger
Causality in a bivariate timeseries is a half an hour example, or Wes
just needed to write a loop over OLS to get basic vector
autoregression (plus work on additional functionality).

And in none of the cases, we have to worry that the results produced
by sm.OLS or others, e.g. the messy formula for the f_test, is
incorrect. (Although there might still be bugs in corner cases, so I
always check the final result.)

I can understand if none of this is relevant to you.

Josef

josef...@gmail.com

unread,
May 20, 2010, 11:40:14 AM5/20/10
to nipy-...@neuroimaging.scipy.org, Jonathan Taylor, pystat...@googlegroups.com
On Thu, May 20, 2010 at 3:32 AM, Jonathan Taylor
<jonatha...@stanford.edu> wrote:
> Meant to send this to the list, but mistakenly only sent it to Matthew.
>
> -- Jonathan

Thanks Jonathan, the example and explanation make it a lot clearer and
I think I will be able to work my way through it.

I have to play with "contrast_from_cols_or_rows" to figure out what
the math does, your examples illustrate the outcome enough.
What's not clear to me yet, is, whether this is also supposed to work
with continuous variables, e.g. when I have the interaction of a
continuous and a categorical variable.
I don't think the continuous part can be in the design matrix to get a
contrast matrix. Is the design matrix in the call to
contrast_from_cols_or_rows, in this case, only supposed to contain the
categorical (dummy variable/main effect) part? Or, is that case not
covered by this?

the rest are just minor comments

Josef


>
> ---------- Forwarded message ----------
> From: Jonathan Taylor <jonatha...@stanford.edu>
> Date: Wed, May 19, 2010 at 5:08 PM
> Subject: Re: [pystatsmodels] The Return of Formula (?)
> To: Matthew Brett <matthe...@gmail.com>
>
>
>
>>
>> > It looks like there are still enough similarities between the old
>> > version and the nipy version of the formula framework, to benefit, for
>> > example, from working through different use cases.
>> > The nipy version also has a new random effects class that is similar
>> > to some experimental functions I wrote for random effects in panel
>> > data models.
>> >
>
> The random effects stuff is really just experimental. The last time I
> touched it, I was
> trying to make functions to handle models of the covariance matrix that
> appear in random / mixed effects models.

Ok, we will see if this is easier with or without formulas in the
panel data case.

>
>>
>> > There are still parts that I don't understand, for example I never
>> > tried to work through the math to figure out what
>> > "contrast_from_cols_or_rows"  is supposed to be or do.
>>
>
> This function is meant to generate a contrast matrix to be used later
> as input to the f_test method of  a RegressionResults instance.  The
> "cols_or_rows"
> means that one can specify either an (n,*) matrix or a (*,p) matrix for a
> design matrix D of shape (n,p).
>
> In the symbolic formula, it is easier (? or more natural) to specify the
> matrix with shape (n,*) rather than (*,p).
> The output is always of shape (*,p) and can later be used in the Wald-type
> F-test of the f_test method.
>
>
> In [19]: f = Factor('category', [1,2,3])
>
> In [20]: data_frame =
> np.array([1,2,3,1,2,3,3,2,1,1,1,1,2,2,2]).view(np.dtype([('category',
> np.int)]))
>
> In [21]: D = f.design(data_frame, return_float=True)
>
> In [22]: L = f.main_effect.design(data_frame, return_float=True)
>
> In [23]: contrast_from_cols_or_rows(L, D)
> Out[23]:
> array([[ 1.,  0., -1.],
>        [ 0.,  1., -1.]])
>
>
>
>> That sounds like good place to start.
>>
>> > And, I don't see anywhere how the final test statistics, e.g. the
>> > F-tests or t-tests, are actually calculated from these "contrasts".
>>
>
> They are later used in the f_test method.
>
>>
>> I can't comment very sensibly about that.  I personally like the idea
>> of a formula being something symbolic, whereas, after you have derived
>> your design matrix from the model, the estimation proceeds in
>> numerical / numpy space.   I hope I've understood it correctly.
>>
>
> I agree with Matthew (and I think others) that the statistical models
> should not take Formula instances as arguments (as they do in R) but should
> take
> numpy arrays instead. The Formula is just meant to be a separate tool
> used to create a design matrix and, possibly, contrast matrices of interest.

Here, I think I misunderstood Matthew and how formula is used now in nipy.
This is more like what I also have in mind for formula.


> Here's a two way example:
>
>
> In [1]: run formula.py
>
> In [2]: data_frame = np.array([[1,0],[2,0],[3,0],[1,1],
> [2,0],[3,1],[3,1],[2,0],[1,1]]).view(np.dtype([('category', np.int),
> ('gender', np.int)]))
>
> In [3]: f1 = Factor('category', [1,2,3])
>
> In [4]: f2 = Factor('gender', [0,1])
>
> In [5]: Drec = (f1*f2).design(data_frame)
>
> In [6]: D = (f1*f2).design(data_frame, return_float=True)
>
> In [7]: interaction = f1.main_effect*f2.main_effect
>
> In [8]: contrast_from_cols_or_rows(inter
> interaction  intern
>
> In [8]: contrast_from_cols_or_rows(interaction.des
> interaction.design       interaction.design_expr
>
> In [8]: contrast_from_cols_or_rows(interaction.design(data_frame,
> return_float=True), D)
> Out[8]:
> array([[  1.00000000e+00,  -1.00000000e+00,   1.28197512e-16,
>           0.00000000e+00,  -1.00000000e+00,   1.00000000e+00],
>        [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
>           0.00000000e+00,  -1.00000000e+00,   1.00000000e+00]])
>
> In [9]: main1 = f1.main_effect
>
> In [10]: contrast_from_cols_or_rows(main1.design(data_frame,
> return_float=True), D)
> Out[10]:
> array([[  1.00000000e+00,   1.00000000e+00,  -1.28197512e-16,
>           0.00000000e+00,  -1.00000000e+00,  -1.00000000e+00],
>        [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
>           0.00000000e+00,  -1.00000000e+00,  -1.00000000e+00]])
>
> In [11]: main2 = f2.main_effect
>
> In [12]: contrast_from_cols_or_rows(main2.design(data_frame,
> return_float=True), D)
> Out[12]: array([ 1., -1.,  1.,  0.,  1., -1.])
>
> In [13]: Drec
> Out[13]:
> array([(1.0, 0.0, 0.0, 0.0, 0.0, 0.0), (0.0, 0.0, 1.0, 0.0, 0.0, 0.0),
>        (0.0, 0.0, 0.0, 0.0, 1.0, 0.0), (0.0, 1.0, 0.0, 0.0, 0.0, 0.0),
>        (0.0, 0.0, 1.0, 0.0, 0.0, 0.0), (0.0, 0.0, 0.0, 0.0, 0.0, 1.0),
>        (0.0, 0.0, 0.0, 0.0, 0.0, 1.0), (0.0, 0.0, 1.0, 0.0, 0.0, 0.0),
>        (0.0, 1.0, 0.0, 0.0, 0.0, 0.0)],
>       dtype=[('category_1*gender_0', '<f8'), ('category_1*gender_1', '<f8'),
> ('category_2*gender_0', '<f8'), ('category_2*gender_1', '<f8'),
> ('category_3*gender_0', '<f8'), ('category_3*gender_1', '<f8')])
>
> In [14]:
>
> This looks a little funny because the 'category_2*gender_1' entry of all the
> contrast matrices are 0 but that's because this is an unbalanced design and
> no instances of 'category=2,gender=1' were in the data set.
>
> For the simple ANOVA models these contrasts are fairly straightforward to
> specify by hand, but
> I think it's easier with the Formula, especially if you start mixing
> categorical with continuous, etc.
>
> R also allows different specification of the design matrix for categorical
> variables (at least somewhat) through the "contrasts" argument to "lm". The
> one we've called "main_effect" above corresponds to throwing
> away the column corresponding to the last entry in levels.

Once we have the basic structure, I will look again at the different
encoding schemes used in SAS and R, and how many or which options we
can make available.

>
>>
>> That would be good - I hope I can help - I need to get to grips with
>> this too.  I should say that my personal experience with Jonathan's
>> code is that it can be hard to get to grips with, but each time we've
>> spent the time to understand what he was trying to do, it turned out
>> that that was indeed the best way to do it.
>
> That's flattering and surely untrue :)
>
> I had to fix a few things in my copy of formula.py to get this to work, but
> I really don't know how old this branch I have is. I must confess that I
> haven't tried git yet and can't remember the last time I merged in this
> particular branch. Here is a patch if these examples don't run on the
> current version...
>
> -- Jonathan
>
>
> --
> Jonathan Taylor
> Dept. of Statistics
> Sequoia Hall, 137
> 390 Serra Mall
> Stanford, CA 94305
> Tel:   650.723.9230
> Fax:   650.725.8977
> Web: http://www-stat.stanford.edu/~jtaylo
>
>
>
> --
> Jonathan Taylor
> Dept. of Statistics
> Sequoia Hall, 137
> 390 Serra Mall
> Stanford, CA 94305
> Tel:   650.723.9230
> Fax:   650.725.8977
> Web: http://www-stat.stanford.edu/~jtaylo
>
> _______________________________________________
> Nipy-devel mailing list
> Nipy-...@neuroimaging.scipy.org
> http://mail.scipy.org/mailman/listinfo/nipy-devel
>
>
Reply all
Reply to author
Forward
0 new messages