- it's very difficult to understand because of all the indirections
a class produces a class which returns another class, and how do I
get the data in and out? ...
After Bruce's comment that we should look at formulas for categorical
variables, I reenabled it, and will merge the changes to trunk today.
here is my *current* opinion and information on it, one example is below
- it looks useful when working with categorical data
(I don't think it's very useful for non-categorical data, as we
where working on in the last year.)
- the only usage examples, I have seen are in the test file
|
|
Vincent Davis | |
On Sun, May 16, 2010 at 5:42 AM, <josef...@gmail.com> wrote:
- it's very difficult to understand because of all the indirections
a class produces a class which returns another class, and how do I
get the data in and out? ...Glad to here it's not just me :)
Should we have a end goal/plan/outline for categorical and dummy variable handling? For me the doing my own was as much an exercise in understanding the problem as having a solution. I am not stuck on what I did. I think we would benefit from a clear description example uses and an outline how whether we want a class or function or both type of solution.I guess I am suggestion we decide what type of wheel we want and what it should fit. Maybe the answer is having several option and just having more examples of there use. But someone other than me should make this call.
On Sun, May 16, 2010 at 10:39 AM, Vincent Davis <vin...@vincentdavis.net> wrote:
On Sun, May 16, 2010 at 5:42 AM, <josef...@gmail.com> wrote:
- it's very difficult to understand because of all the indirections
a class produces a class which returns another class, and how do I
get the data in and out? ...Glad to here it's not just me :)
I updated trunk, here are some more examples, including some contrasts for factors
http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/revision/1992#scikits/statsmodels/sandbox/examples/ex_formula.py
I wasn't successful in combining a factor with other terms in a formula, or, more precisely, getting the design matrix and contrast matrices out of the combined formula
The merge includes a lot of other things, that I was working on in my branch, that I haven't really checked whether it is clean, and I still have to run the test suite on the current branch.
Should we have a end goal/plan/outline for categorical and dummy variable handling? For me the doing my own was as much an exercise in understanding the problem as having a solution. I am not stuck on what I did. I think we would benefit from a clear description example uses and an outline how whether we want a class or function or both type of solution.I guess I am suggestion we decide what type of wheel we want and what it should fit. Maybe the answer is having several option and just having more examples of there use. But someone other than me should make this call.
just a brief answer for now,
since I never used a formula framework myself and used only simple cases of dummy variables, I don't have a strong opinion what the best way is.
Usually, I'm faster writing a 2-liner or 5-liner, than to figure out how to use a "framework"
Next you have to decide if you want to update the model like you can do in R. I usually don't find it useful but then I don't use R much anyhow.
Creating zero/one dummy variables is trivial but not the only option:
http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm
Creating the design matrix from a model is easy
I do not know sufficient about creating the contrast matrix and how to test individual terms. I took the long way by refitting the model.
Bruce
Skipper and my problem last year was that we never really figured out what statisticians do with contrast matrices, so we didn't touch this part much.
The econometrics analogy is essentially testing linear restrictions, r*beta=0 for t_test, or R*beta=0 for f_test. Both work very well, for t_test we had simultaneous t_tests for each row of R*beta=0 which would produce exactly the results in the UCLA reference. (I don't remember whether we removed this feature for greater simplicity.)
Skipper and my problem last year was that we never really figured out what statisticians do with contrast matrices, so we didn't touch this part much.
The econometrics analogy is essentially testing linear restrictions, r*beta=0 for t_test, or R*beta=0 for f_test. Both work very well, for t_test we had simultaneous t_tests for each row of R*beta=0 which would produce exactly the results in the UCLA reference. (I don't remember whether we removed this feature for greater simplicity.)
Maybe I am starting to understand what you are meaning by a contrast matrix and then I found ftestforWeb.pdfThinking of this from making it simpler for a user, what you are saying is that keeping track of whats what when going from the raw data to categorical then to dummy (I think of categorizing data to be a step along the way to dummy variable) so that it is possible to maybe. (totally made up)cat_array = CategorizeArray(exog, ['gender', 'enthnicity'], dummy=true ) #takes the gender and ethnicity variables and converts them to a categorical and then dummy variable. Returns an array with dtype floatresult = OLS(endg, cat_array).fit() #fits the OLSresult.test.equal(cat_array.ethnicity) # tests that the categories of ethnicity are all equal and returns results with labels ie asian, native america, rather than returning only column numbers from the array.In this example result.test.equal() would need to be passed the array columns that are being tested and labels/names for them so that when you get the results they are available with names.
Do I get it or am I confused. (strange how sometime we have to ask someone else if we are confused because we are not sure ourselves)
On Mon, May 17, 2010 at 11:45 AM, Vincent Davis <vin...@vincentdavis.net> wrote:
Skipper and my problem last year was that we never really figured out what statisticians do with contrast matrices, so we didn't touch this part much.
The econometrics analogy is essentially testing linear restrictions, r*beta=0 for t_test, or R*beta=0 for f_test. Both work very well, for t_test we had simultaneous t_tests for each row of R*beta=0 which would produce exactly the results in the UCLA reference. (I don't remember whether we removed this feature for greater simplicity.)
Maybe I am starting to understand what you are meaning by a contrast matrix and then I found ftestforWeb.pdfThinking of this from making it simpler for a user, what you are saying is that keeping track of whats what when going from the raw data to categorical then to dummy (I think of categorizing data to be a step along the way to dummy variable) so that it is possible to maybe. (totally made up)cat_array = CategorizeArray(exog, ['gender', 'enthnicity'], dummy=true ) #takes the gender and ethnicity variables and converts them to a categorical and then dummy variable. Returns an array with dtype floatresult = OLS(endg, cat_array).fit() #fits the OLSresult.test.equal(cat_array.ethnicity) # tests that the categories of ethnicity are all equal and returns results with labels ie asian, native america, rather than returning only column numbers from the array.In this example result.test.equal() would need to be passed the array columns that are being tested and labels/names for them so that when you get the results they are available with names.
Do I get it or am I confused. (strange how sometime we have to ask someone else if we are confused because we are not sure ourselves)
Yes, that's the idea.would then correspond to the standard t-value, result.t(), for continuous variables, to test whether ethnicity has no statistical effect on the endog
result.test.equal(cat_array.ethnicity)
The rest are implementation details and the question how much flexibility we want to allow in the tests for other null hypothesis on the categorical variable.
Your reference looks pretty complete, most of it is implemented in f_test, although, I think, until now only for homogeneous linear restriction R*beta=0 and not yet for R*beta=r, and the same for t_test.
Josef
Vincent
[ 1., 0.],Type 3 Tests of Fixed Effects
Take a look at SAS, for example GLM:
Bruce
The thing that Jonathan is trying to get at, I believe, is the
conception of the R formula. As y'all know, this is very powerful,
allowing you to symbolically specify your statistical model. I think
Formula is designed to provide that, as well as being a mathematically
intuitive way to understand the idea of the formula and its solution.
I'm speaking without great statistical knowledge.
That would be good - I hope I can help - I need to get to grips with
this too. I should say that my personal experience with Jonathan's
code is that it can be hard to get to grips with, but each time we've
spent the time to understand what he was trying to do, it turned out
that that was indeed the best way to do it. It's often taken us more
than a year to realize that, but - I guess that's python - it's so
easy to read and write that we sometimes expect difficult ideas to be
simpler than they are.
See you,
Matthew
Is that helpful for the general idea?
See you,
Matthew
So what's the best way for for me to understand this formula concept. I am not really talking about the formula.py file but the concept of what is being accomplished? I am feeling left out.ThanksVincent