Since singular matrices should, at least in social sciences, raise
alarm bells, I would like to get at least a warning to the user, if
the design matrix is singular or close to singular.
To see how different versions of solving the normal equations work,
especially np.linalg.lstsq, np.linalg.solve and np.matrix inversion, I
wrote an example script and functions to calculate the OLS estimator
with linear restrictions. Eventually, I would like to include the
estimator in statsmodels, especially since it will allow different
parameterization for dummy variables and factors. In this case the
underlying OLS exog/design matrix is singular, but together with some
identifying restrictions, the estimator has a unique solution.
Currently it is just a quickly done script, which works for its
initial purpose.
What are the opinions about what to do with singular or highly
multi-colinear design matrices? Raise an overridable exception, print
a warning or ignore them?
Skipper might have done already some adjustments since I last looked
out for this.
Josef
I will have to take a closer look when I finish up some stuff, but ...
> One question that has been bugging me a while ago is the treatment of
> singular exog/design matrices. Although we haven't tried this, I think
> the current implementation of the regression works either completely
> or mostly for singular exog/design matrices because of the use of
> pinv, the Moore-Penrose inverse.
>
Yes, the pseudoinverse works for rank-deficient matrices, whether this
is a feature or a bug for our purposes is up for debate/study.
> Since singular matrices should, at least in social sciences, raise
> alarm bells, I would like to get at least a warning to the user, if
> the design matrix is singular or close to singular.
>
> To see how different versions of solving the normal equations work,
> especially np.linalg.lstsq, np.linalg.solve and np.matrix inversion, I
> wrote an example script and functions to calculate the OLS estimator
> with linear restrictions. Eventually, I would like to include the
> estimator in statsmodels, especially since it will allow different
> parameterization for dummy variables and factors. In this case the
> underlying OLS exog/design matrix is singular, but together with some
> identifying restrictions, the estimator has a unique solution.
> Currently it is just a quickly done script, which works for its
> initial purpose.
>
> What are the opinions about what to do with singular or highly
> multi-colinear design matrices? Raise an overridable exception, print
> a warning or ignore them?
>
I need to take a closer look at the implications and your example, but
off the top of my head I would prefer a warning. I remember doing
something in Stata this summer with the Longley dataset, I believe,
which has high multicollinearity and it issued a warning, dropped the
offending variable (which I found annoying), and went ahead to
estimate the equation. I'd have to go back and check this though.
But I think a warning is appropriate (like stats does when the sample
is too small for say a normal distribution), then users should do some
multicollinearity diagnostics., eg., check the condition number for
the design (which I added a rough method for this to stattools that
needs a test) or the Variance Inflation Factor (which I haven't gotten
around to doing but is really easy too).
> Skipper might have done already some adjustments since I last looked
> out for this.
No I haven't done much. There is an isestimable functions in tools,
which might be of use, though the way it was written I thought that it
was only made as a helper for contrasts. There is also another
version of rank there, which calculate the rank based on the
generalized inverse and not the full rank of the matrix, so np.rank !=
tools.rank all the time.
Skipper
Nice succinct discussion of using the pseudoinverse in the least
squares solution, though not econometrics related.
http://robotics.caltech.edu/~jwb/courses/ME115/handouts/pseudo.pdf
I worry that the standard user will not pay enough attention to
realize that the design matrix is singular. For example, in
econometrics when I make a mistake with the dummy
variables (not normalize or drop them correctly), then standard
matrix inversion or linalg solve will break down. That's what I'm
used to. The results with Moore-Penrose look nice, if you just
look at the parameter estimates and the reported standard errors
of them. I don't think that, as a user, I ever checked whether the
program calculated the degrees of freedom correctly or in the way
I would expect it. So I would prefer to get a warning or at least
some visible indication something might be wrong.
For the case with ANOVA or dummy variable regression, I think,
an explicit normalization it is much better. Whether the
user does it, e.g. by dropping one of the dummy variables, as
you also describe, or if the program does it, e.g. with restricted
least squares will depend on our implementation. But with
Moore-Penrose, I don't know how to interpret the estimated
parameters. Since generalized inverse are not unique, which
solution is obtained, depends on the implementation. For pinv,
I don't have an intuition about what it means for the parameters.
Also, I'm not sure all F and t tests produce the correct results,
but this can be checked.
In contrast, with an explicit normalization, e.g. dropping one
dummy variables or constraining the sum of the coefficients,
there is a clear interpretation of the parameters as comparison
to the benchmark dropped label or as an average. From your
description, this also seems to be what you are doing.
For some results e.g. with ANOVA, or if we only care about
the best fit (maybe first stage of a 2-stage least squares),
it might be possible to get away with the singularity or
near singularity, but for a generic linear model regression,
it looks too much like "use at your own risk".
To implement your description of what SAS is doing in actually
finding out which coefficients receive a "B", i.e. are not identified,
might require some work. For this I found it useful to print out
the covariance matrix of the parameter estimates converted to a
correlation matrix.
Until now, we haven't tried an example to reproduce ANOVA
with regression.
Cheers
Josef
>
> Bruce
>
>
Shouldn't the standard errors look bad? Collinearity means
the parameter estimates cannot be precise. Perfect collinearity
should imply a singular parameter covariance matrix.
Alan
PS My view is that a warning should be logged, at the least.
Bruce,
Since you have a lot more experience with this than Skipper and I, do
you have an example or SAS script (or similar) that we could use for a
test case with singular design matrix?
Given that we allow for singular design matrices in statsmodels, it
would be good to test whether the results are really correct.
The current tests only look at no-singular, at most highly multicollinear cases.
Thanks,
Josef
No, not with Moore-Penrose, pinv, or with linalg.lstsq since they
pick a numerical solution that looks "nice" for underdetermined
system of equations. See below.
Josef
>
> Alan
>
> PS My view is that a warning should be logged, at the least.
>
simulated data from example 3 in try_rols.py
>>> xss.shape
(1000, 3)
>>> np.linalg.inv(np.dot(xss.T,xss))
Traceback (most recent call last):
...
LinAlgError: Singular matrix
>>> from scikits.statsmodels import OLS
>>> res = OLS(yss, xss).fit()
>>> res.params
array([[ 1.0426365 ],
[ 0.67998605],
[ 1.35997209]])
>>> res.bse
array([ 6.28708789e-02, 2.17954586e-05, 4.35909172e-05])
above looks nice
same parameter estimate with lstsq:
>>> np.linalg.lstsq(xss,yss) # rank is two
(array([[ 1.0426365 ],
[ 0.67998605],
[ 1.35997209]]), array([], dtype=float64), 2, array([
4.07942183e+04, 1.58232448e+01, 1.17760908e-12]))
zero eigenvalue
>>> np.linalg.eig(np.dot(xss.T,xss))
(array([ 1.66416825e+09, 2.50375075e+02, -4.09259615e-23]),
array([[ 6.71155921e-04, 9.99999775e-01, -2.86780784e-13],
[ 4.47213495e-01, -3.00150053e-04, -8.94427191e-01],
[ 8.94426990e-01, -6.00300105e-04, 4.47213595e-01]]))
direct calculation of covariance matrix of estimates with pinv, I
didn't check missing normalization
>>> np.sqrt(np.diag(np.linalg.pinv(np.dot(xss.T,xss))))
array([ 6.31981487e-02, 2.19089133e-05, 4.38178265e-05])
>>> res.bse/np.sqrt(np.diag(np.linalg.pinv(np.dot(xss.T,xss))))
array([ 0.99482153, 0.99482153, 0.99482153])
Took the words out of my mouth. Either a SAS script or just some
example data would be helpful. I tried to look at ANOVA cases this
summer, but it's fairly far afield from the stats I study.
Skipper
I do not work with such systems, so maybe I'm not seeing into this...
What sense do standard errors make when the data are perfectly collinear?
E.g., if two series are perfectly collinear, in the absence of restrictions,
there is absolutely no way to decide how much weight to put on each series
if both are included in a regression. So there is no way to compute standard
errors. If the series are nearly collinear, the the se's should similarly
be very large.
So we may have a way to force "nice" looking results out (and what is it?) but
the se's must be meaningless.
Naturally if you are allowed to add arbitrary restrictions,
you can get anything you want. But doing so should provoke
a warning.
Alan
FWIW, I'm pretty sure using the Moore-Penrose pseudoinverse gives a
unique solution that minimizes the L2-norm of the error vector.
I haven't done any proofs or looked it up, but I would guess that the
biases introduced (in our use case) are then (theoretically) known,
though there are surely some numerical problems.
I think, it minimizes the norm of the underdetermined parameter vector
>
> I haven't done any proofs or looked it up, but I would guess that the
> biases introduced (in our use case) are then (theoretically) known,
> though there are surely some numerical problems.
I played with an example where I introduced the constant twice
into the exog/design matrix. In this case, the sum of the two
coefficients is identified and correctly estimated, and could
be tested with standard t test. However, the split up is
chosen by the numerical procedure, in the example it was just
an equal split. I don't think you can talk much about bias in
this case, because the parameter is not identified, and could
be anything, as long as the sum, or whichever linear restriction
is identified, is the same.
I think an F test on all coefficients should still be correct.
Similar, it should be possible to extract the linear subspace,
that is identified from the covariance matrix of the error estimates,
e.g. in the example the sum of two constants . But I never tried this.
In Bayesian analysis you could just put a prior to force identification,
using pinv, we let the computer choose the identification.
However, it can be useful in statistical and econometrics
application and using pinv or lstsq is more robust and more flexible
but it looks like econometricians need a warning message.
Josef
xss is declared as:
xss = np.column_stack((np.ones(n_sample), z, z*2))
So xss has 3 columns but xss is not full column rank as it has rank 2
(because the third column is two times the second column). Thus, what
you got is what you should have expected!
Perhaps you meant one of these instead:
xss = np.column_stack((np.ones(n_sample), z, z*z))
xss = np.column_stack((np.ones(n_sample), z, z**2))
There is absolutely no requirement that the design matrix be full rank
under general linear model theory, and in fact, it should be expected.
So if a user expects the design to be full-rank and it is not, then the
user expectations are wrong even when the design matrix is numerically
singular (like due to polynomials).
If you want to identify cases when there is an expectation that the
design matrix has full column rank, then you must first check that
before continuing. In my experience, I find that completely unnecessary
task mainly because virtually all my models are not full column rank.
Also, I usually know when a result is suspect from the output like
perfect fits or very low residual variances or unusual degrees of freedom.
Bruce
The less than full column rank was by design, since I wanted to check
the behavior with a singular design.
I changed the examples around a bit when I was doing this,
but the next case looks explicitly at the case with an identifying constraint
R beta = r with R = np.array([0, -1, 1]) and r = 0
which imposes that the two coefficients are the same. Which, if I
remember correctly, is the same as what pinv does in this case, however
I impose the constraint explicitly.
>
>
> There is absolutely no requirement that the design matrix be full rank under
> general linear model theory, and in fact, it should be expected. So if a
> user expects the design to be full-rank and it is not, then the user
> expectations are wrong even when the design matrix is numerically
> singular (like due to polynomials).
I'm still not sure how you treat design matrices that originally don't have
full rank, e.g. a fixed effects model with a full set of dummy variables
and a constant.
If you want an interpretation of the individual/group fixed effects, that is
the coefficients on the dummy variables and the constant, then, at some
stage, a choice about a unique solution has to be taken, either the
user, or the statistical package, or the numerical procedure.
From your initial statement, I thought you restrict the dummy variables
yourself, also SAS and other packages offer a choice which normalization
to use. If the design matrix is still singular at the computation stage, then
the numerical procedure choses among the multiplicity of solutions.
I'm arguing currently in favor of the user choice, eventually statsmodels
might also offer different normalization to the user (one reason to do rols).
But in the third case, when neither the user nor the statistical program
have (consiously) chosen a normalization, then the numerical routine
will. In this case, I want to emit the warning because only users that
have more experience with singular or near-singular matrices will
detect that something strange might be going on.
Just an illustrative example:
Skipper a while ago wanted to see whether we can add a
Maximum Likelihood estimator to statsmodels. For a
synthetic data example, everything worked pretty well and MLE
was very close to OLS. Then he tried the longley dataset as
a test case, and spend a considerable amount of time going
through the scipy optimizers to see which one might produce
a correct result. As it turned out, the longley data set has
a very high multicollinearity and it is only possible to use
a non-linear function optimizers when all optional stopping
criteria for required precision and maximum number of
iterations are all turned up. If statsmodels had complained
early on about what's going on, it would have been much less
time consuming to find out what the problem with the data or
the estimator is.
Josef
> Just an illustrative example:
> Skipper a while ago wanted to see whether we can add a
> Maximum Likelihood estimator to statsmodels. For a
> synthetic data example, everything worked pretty well and MLE
> was very close to OLS.
I thought that was a rather well known result - at least under 'typical'
situations (reasonable data size and fixed effects models). As I recall,
if you set the log likelihood equations under a fixed effects model, you
end up with the normal equations after doing the differentiation. So the
difference between MLE and least squares is the error in estimating the
residual variance, which in turn leads to REML.
I am not familiar with the longley data to comment about it but I see it
is part of the scikit.
Bruce
I don't know much about REML, I only looked at it once, very very briefly.
In many cases we had a good idea what to expect theoretically (not so
much in some of the details of glm and rlm, since neither of us (in the gsoc)
is a statistician).
The problem was that we were debugging and writing the functions, so we
were not sure about anything, was it a bug, a numerical failure in some
routine, a difference in definition, a misapplication of the theory,
or the data.
Usually the data was the last place we looked.
Even if we knew the expected theoretical results, the computer sometimes
refused to agree with us, which of course creates some nice, time-consuming
hunting.
And Skipper had to do a lot of this during the summer.
>
> I am not familiar with the longley data to comment about it but I see it is
> part of the scikit.
It's one of the test cases and examples that we use, but I actually
never looked at the meaning of it.
Josef
>
> Bruce
>
>
>
Bruce, Since you have a lot more experience with this than Skipper and I, do you have an example or SAS script (or similar) that we could use for a test case with singular design matrix? Given that we allow for singular design matrices in statsmodels, it would be good to test whether the results are really correct. The current tests only look at no-singular, at most highly multicollinear cases. Thanks, Josef
Is REML mixed effects models (proc MIXED in SAS)? That might be what
mixed is in the sandbox. It is still dependent on the nipy formula
framework, and I didn't have time to give myself the introduction to
the theory and then refactor the code. I encountered most of these
models for the first time this summer.
> I am not familiar with the longley data to comment about it but I see it is
> part of the scikit.
>
The Longley data is US macro data from a 1967 article testing the
accuracy of least squares software (come to think of it then, it makes
sense it gave a rough time for MLE as we discovered).
http://www.itl.nist.gov/div898/strd/lls/data/Longley.shtml
Skipper
Hmm, what would the response variable then be in this design matrix?
To be honest, I don't fully understand this use case or the ANOVA NIST
example. The Longley data has a variable, total employment, that is
thought to be "explained" by several observed variables, so it makes
sense for me specify an endogenous/response variable and a design
matrix of (presumably) exogenous variables (with full-column rank) I
started to look at some examples of ANOVA (and was introduced to it in
a class several years ago), but I just haven't had the time to get my
head around it.
> Anyhow, I send another email with some code modified from what John Cole and
> I did some time ago.
>
Will have a look.
Skipper
Sorry, I just found some notes I made when looking at the Contrast
class (and saw your attachment that I missed) and it clicked for me.
I think the main thing with this now is that we don't have a good data
type object to parse a design like this (and I'm not sure I want a
general one that seems to imply using a formula framework), so the
model actually needs to be specified. With the existing code (note
that this is for my branch, since I just fixed the 'xi' function for
this case within the last few days and 'xi' needs to be renamed...)
and your example you have
import numpy as np
import scikits.statsmodels as sm
instr = np.floor(np.arange(10,60, step=2)/10) # makes the instruments
instr = sm.tools.xi(instr, drop=True) # makes the dummy matrix
instr = sm.add_constant(instr) # *appends* a constant
y= [196.3052, 196.1240, 196.1890, 196.2569, 196.3403,
196.3042, 196.3825, 196.1669, 196.3257, 196.0422,
196.1303, 196.2005, 196.2889, 196.0343, 196.1811,
196.2795, 196.1748, 196.1494, 196.1485, 195.9885,
196.2119, 196.1051, 196.1850, 196.0052, 196.2090]
mod_results = sm.OLS(y, instr).fit()
Then you have your mean parameters and ANOVA stats.
I could make an interface for that kind of input to work (sm.ANOVA()
with no warning about the rank?), but it would seem inconsistent with
the design right now to have OLS handle it.
Skipper
I will also try to look at this coming weekend, and collect my ANOVA scripts,
I'm out of time this week.
I have a rewritten version of one-way, and maybe n-way, ANOVA, that for the
one-way case is fully tested against all Nist benchmark cases. When there
are no numerical problems (badly scaled) then the results of
scipy.stats.f_oneway are correct (against the "easy" Nist cases).
For general n-way ANOVA, statsmodels.ols looks more flexible, but we
would need some helper functions (like xi, or the formula framework) to
make it easy to set up the design matrix. Using ols, we would still need
to decide which parameterization to offer.
results = sm.OLS(y, x[:,1:]).fit() #drop constant, not enough in n-way
results = sm.OLS(y, x[:,:5]).fit() #drop last category,
interpretation: relative to one (last) category
or with restriction params[1:] = 0 # interpretation: deviation from mean
This matters for the interpretation of params, but the normalization or
just the using pinv on the singular matrix is irrelevant for
diff = results.params[2:] - results.params[1] # use first category
as benchmark
which would be the same as dropping the first category in the regression.
I haven't looked recently at xi (or whatever name it will have), but the
main difficulty for n-way ANOVA or regression is in handling multiple
variables with additive or multiplicative factors in setting up the design
matrix. (?)
I thought of a similar use/test case, in which the normalization doesn't
matter and pinv just works without problems. In a fixed effects panel
data model with a flat panel (T constant, N ->inf), the fixed effects
cannot be estimated consistently and are usually just ignored.
In this case, the reported parameters for the dummy variables
are irrelevant and any choice of normalization or using a singular
exog/design matrix with pinv should (theoretical expectation)
not have any effect on the estimated coefficients of the
explanatory variables.
With ols and the F and t tests (contrast results), I think we have all
the pieces to do the estimation and testing, we just haven't gotten
around to looking at it systematically for an ANOVA type question.
Josef
I do not understand how ix works.
> instr = sm.add_constant(instr) # *appends* a constant
> y= [196.3052, 196.1240, 196.1890, 196.2569, 196.3403,
> 196.3042, 196.3825, 196.1669, 196.3257, 196.0422,
> 196.1303, 196.2005, 196.2889, 196.0343, 196.1811,
> 196.2795, 196.1748, 196.1494, 196.1485, 195.9885,
> 196.2119, 196.1051, 196.1850, 196.0052, 196.2090]
> mod_results = sm.OLS(y, instr).fit()
>
> Then you have your mean parameters and ANOVA stats.
>
Actually the ANOVA stats are incomplete because you only have the
solutions and the output is insufficient for more advanced models than a
one-way ANOVA.
> I could make an interface for that kind of input to work (sm.ANOVA()
> with no warning about the rank?),
Why?
If you are implementing a general linear models approach then there is
no difference between ANOVA or regression. The only quirks with ANOVA is
that you have to create the dummy variables, keep track of what each
dummy variable is and remember to 'merge' the correct dummy variables in
getting the overall test for that effect.
> but it would seem inconsistent with
> the design right now to have OLS handle it.
>
> Skipper
>
Currently the design is only for multiple linear regression not OLS.
That is provided I put the data into the correct format then I can solve
the system using ordinary least squares but the output is still incomplete.
Bruce
Results for hypothesis are available for any (set of) linear restriction on the
regression parameters, either F or t tests.
some (messy) examples are in examples/example_ols_tftest.py
This way, it is possible to check whether a variable that is included in the
form of dummy variables has no statistically significant explanatory effect,
which, as I understand it, is the purpose of ANOVA.
Josef
> Bruce
>
It should work using my branch.
https://code.launchpad.net/~jsseabold/statsmodels/statsmodels-skipper
> I do not understand how ix works.
xi just parses a column of categorical data for unique elements
(strings or numerical) and makes a dummy matrix out of it. In the
beginning of this summer I was most familiar with Stata, so that is
where I'm coming from. http://www.stata.com/help.cgi?xi
This function isn't as nice as Stata's yet (and would have to be more
than a function), and isn't robust at all. I keep changing it when I
find another use case that I need.
>
>> instr = sm.add_constant(instr) # *appends* a constant
>> y= [196.3052, 196.1240, 196.1890, 196.2569, 196.3403,
>> 196.3042, 196.3825, 196.1669, 196.3257, 196.0422,
>> 196.1303, 196.2005, 196.2889, 196.0343, 196.1811,
>> 196.2795, 196.1748, 196.1494, 196.1485, 195.9885,
>> 196.2119, 196.1051, 196.1850, 196.0052, 196.2090]
>> mod_results = sm.OLS(y, instr).fit()
>>
>> Then you have your mean parameters and ANOVA stats.
>>
>
> Actually the ANOVA stats are incomplete because you only have the solutions
> and the output is insufficient for more advanced models than a one-way
> ANOVA.
>
>> I could make an interface for that kind of input to work (sm.ANOVA()
>> with no warning about the rank?),
>
> Why?
> If you are implementing a general linear models approach then there is no
> difference between ANOVA or regression. The only quirks with ANOVA is that
> you have to create the dummy variables, keep track of what each dummy
> variable is and remember to 'merge' the correct dummy variables in getting
> the overall test for that effect.
>
I will have to read up on your proc GLM. This is far afield from what
I am doing in my work, which is what is driving the development of
statsmodels from my end.
My answer to why would, I guess, be so that the use of xi is hidden
from the user and so your sm.ANOVA(d).fit() works. Right now, I am
mostly cleaning all of my data myself and making my own "designs". I
would like to keep statsmodels, as Josef has mentioned, just as a
library because it seems that the data handling is so discipline
specific. That way if someone just wants to use statsmodels, they
don't have to read an introduction to the formula framework to just
try it out. This is the problem that I have with R and the Nipy
formula framework. I just didn't need it, and, well, I could use R if
I wanted to... I don't think we broke compatibility with the formula
framework in removing the dependency on it (as Wes has also shown with
the pandas package, which again covers his use cases at work). The
only thing we changed is explicitly having to specify what is the
response variable and what is the design.
xi, will create variable names for the fields of a recarray if it is
given a recarray that are either the unique numbers or the unique
strings. One of the quick and dirty solutions that I am thinking
about in my work is just subclassing recarrays and adding some
convenience functions and extra metainformation.
I do not use bzr and definitely do not want to download the branch if I
can help it.
>
>> I do not understand how ix works.
>>
> xi just parses a column of categorical data for unique elements
> (strings or numerical) and makes a dummy matrix out of it. In the
> beginning of this summer I was most familiar with Stata, so that is
> where I'm coming from. http://www.stata.com/help.cgi?xi
>
What I meant is that the code I see did not work even when based on the
doc string and reading the code. I have never used stata.
> This function isn't as nice as Stata's yet (and would have to be more
> than a function), and isn't robust at all. I keep changing it when I
> find another use case that I need.
>
Yeah but I do not find Stata's ix very useful. Yes, I am spoiled by
SAS's glmmod if I need to go that way. Really I prefer that SAS and R
take care of that aspect for me because I do not care to know what the
dummy variables are. The solutions from a general linear model are
generally worthless because these are not necessarily estimable functions.
>>> instr = sm.add_constant(instr) # *appends* a constant
>>> y= [196.3052, 196.1240, 196.1890, 196.2569, 196.3403,
>>> 196.3042, 196.3825, 196.1669, 196.3257, 196.0422,
>>> 196.1303, 196.2005, 196.2889, 196.0343, 196.1811,
>>> 196.2795, 196.1748, 196.1494, 196.1485, 195.9885,
>>> 196.2119, 196.1051, 196.1850, 196.0052, 196.2090]
>>> mod_results = sm.OLS(y, instr).fit()
>>>
>>> Then you have your mean parameters and ANOVA stats.
>>>
>>>
>> Actually the ANOVA stats are incomplete because you only have the solutions
>> and the output is insufficient for more advanced models than a one-way
>> ANOVA.
>>
>>
>>> I could make an interface for that kind of input to work (sm.ANOVA()
>>> with no warning about the rank?),
>>>
>> Why?
>> If you are implementing a general linear models approach then there is no
>> difference between ANOVA or regression. The only quirks with ANOVA is that
>> you have to create the dummy variables, keep track of what each dummy
>> variable is and remember to 'merge' the correct dummy variables in getting
>> the overall test for that effect.
>>
>>
> I will have to read up on your proc GLM. This is far afield from what
> I am doing in my work, which is what is driving the development of
> statsmodels from my end.
>
Actually I am not as surprised as I was when I realized that you were
not familiar with general linear models. A lot of comments much sense in
the narrow view provided by regression.
> My answer to why would, I guess, be so that the use of xi is hidden
> from the user and so your sm.ANOVA(d).fit() works.
My point is that it works because of general linear models so yes the
creation of dummy variables is hidden from the user because these are
just an immediate step. See for example:
http://www.uvm.edu/~dhowell/gradstat/psych341/lectures/GeneralLinearModel/glm.html
> Right now, I am
> mostly cleaning all of my data myself and making my own "designs". I
> would like to keep statsmodels, as Josef has mentioned, just as a
> library because it seems that the data handling is so discipline
> specific.
I disagree that data handling is discipline specific as statistics works
because the actual discipline is irrelevant. Ultimately all the data is
a mixture of discrete and continuous variables that may be alphanumeric
and parts may be missing. The trick is be able to handle these things
correctly.
> That way if someone just wants to use statsmodels, they
> don't have to read an introduction to the formula framework to just
> try it out. This is the problem that I have with R and the Nipy
> formula framework. I just didn't need it, and, well, I could use R if
> I wanted to...
With R everything tends to very explicit so it appears inflexible. But
R's dataframe is designed to make other things much easier.
But you do use this framework, just not all of it in Python. The goal is
to move as much of that framework into Python as possible and not rely
on you to remember to do certain immediate steps in the correct order.
Further, if you want other people to you your work then you have to
think about them not you.
> I don't think we broke compatibility with the formula
> framework in removing the dependency on it (as Wes has also shown with
> the pandas package, which again covers his use cases at work). The
> only thing we changed is explicitly having to specify what is the
> response variable and what is the design.
>
No, you just changed the problem in that you solve a user supplied
system of equations. From what I can see of Wes's work, he moved some of
the creation of the design matrix back into Python rather than relying
on the user to get the correct design matrix. However it still appears
to be a regression framework.
I really do not want to be negative but I have given a lot of thought to
this. One of my key issue with what you have done is that it is very
much focused on regression and not general linear models. This would not
be a problem except that you have taken shortcuts that prevent any
extension without adding major rewrites of your code. So it makes very
hard for me even to try to contribute back. My other key issue is that I
do not see how you handle missing data except to force the user to
correctly handle it.
Bruce
Just a clarification, Skipper, Alan and I know how to do econometrics,
Skippers current application is panel data econometrics, my main
applications are in time series regression. We "suffered" all summer
doing "statistics" with generalized linear models and robust linear models,
which we barely knew before, there are a lot of other things in the sandbox,
like general additive models, that need a lot of attention, but that's not a
part that is relevant for our own work.
I never used an ANOVA explicitly (for real) in my life, but I know
panel data with
fixed effects and random effects, and I know how to do multinomial logit with
panel data, there are a lot of unaddressed questions on instrumental
variables and Generalized Method of Moments, that are not relevant for
statistical analysis with experimental data, but bread and butter for any
econometrician (maybe not every).
So of course our focus will be on regression, but we are very open to any
"statistics" usage. It was one of the reason to start this mailing list
to have a discussion about the design of statsmodels.
Josef
>
> Bruce
>
I think Skippers idea of writing a ANOVA class is a good idea. It can hand of
the actual calculation to OLS but does all the anova specific calculation
in the new class.
It would be somewhat similar to the glm and rlm estimators, where the
core estimation is reweighted least squares and much of the basic
calculations are done by calling on WLS. This is an example where
we use one estimation class directly as a library function for another
estimator.
Josef
>
>
>>
>> Bruce
>>
>
I'm a bit slow sometimes. In statsmodels, we don't have any
"general linear model" in the sense of statistical packages
like SAS's glm.
We have statsmodels.regression.ols which is just a special case
(subclass) of generalized least squares, statsmodels.regression.gls.
It's confusing because the underlying model is similar but the
purpose is different. On the other hand, in the SAS manual, I have
a difficult time finding generalized least squares and I didn't
see anything for instrumental variables. I guess the focus in
statsmodels.regression is pretty different from the linear
models in SAS.
So, a "general linear model" would be a good addition to statsmodels.
Josef
>
>>
>>
>>>
>>> Bruce
>>>
>>
>
Then you do not have R's lm either.
>
> We have statsmodels.regression.ols which is just a special case
> (subclass) of generalized least squares, statsmodels.regression.gls.
> It's confusing because the underlying model is similar but the
> purpose is different.
It is only confusing if you do not know that general linear models is
a unified approach to handle both regression and ANOVA (including
analysis of covariance) under the assumption of normality (Google
search indicates this is a very old term but Cohen in 1968 showed the
relationship). Generalized linear models (rather new as published in
1972) further extends general linear models to include all
distributions belonging to the exponential family (and a few others as
well). This is not to be confused with generalized additive models
(about 1990) that extend the linear part of generalized linear models
to include things like splines.
The downsize of each of these is that each step dramatically increases
the complexity and computational burden.
This sort of avoids the extension of linear models into nonlinear
models but generalized linear models are nonlinear. Also avoids that
all of the above have mixed effects versions with fixed and random
effects.
> On the other hand, in the SAS manual, I have
> a difficult time finding generalized least squares and I didn't
> see anything for instrumental variables. I guess the focus in
> statsmodels.regression is pretty different from the linear
> models in SAS.
Depending on how you want, generalized linear models is available as
genmod (SAS 9.2 documentation):
http://support.sas.com/documentation/cdl/en/statug/59654/HTML/default/genmod_toc.htm
Also you have probit, logistic, lifereg and, to a lesser extend, freq
procedures that can also be used.
>
> So, a "general linear model" would be a good addition to statsmodels.
>
For the most part it does require modification of the steps before and
after fitting the model.
Bruce
These are extensions of the basic linear model/OLS in one direction,
what I have more in mind, in which is closer to Skippers and my experience,
is OLS, and GLS as the foundation for extension in this area
http://www.stata.com/bookstore/cspd.html#contents
http://www.stata.com/bookstore/etm.html#contents
http://www.stata.com/bookstore/imeus.html#contents
(just some representative links that a google search found)
these are two of the books on my desk:
http://ca.wiley.com/WileyCDA/WileyTitle/productCd-047189530X,descCd-tableOfContents.html
http://www.stata.com/bookstore/ea.html#contents
and this I misplaced:
http://press.princeton.edu/TOCs/c5386.html
>
>
>> On the other hand, in the SAS manual, I have
>> a difficult time finding generalized least squares and I didn't
>> see anything for instrumental variables. I guess the focus in
>> statsmodels.regression is pretty different from the linear
>> models in SAS.
>
> Depending on how you want, generalized linear models is available as
> genmod (SAS 9.2 documentation):
> http://support.sas.com/documentation/cdl/en/statug/59654/HTML/default/genmod_toc.htm
>
> Also you have probit, logistic, lifereg and, to a lesser extend, freq
> procedures that can also be used.
I meant generalized *least squares* as extension to OLS with non spherical
errors, eg. autocorrelation GLSAR, or heteroscedasticity, WLS
The closest, I found in SAS is MIXED
http://support.sas.com/documentation/cdl/en/statug/59654/HTML/default/statug_mixed_sect001.htm
which from a quick reading allows some patterns of covariance structures.
>
>
>>
>> So, a "general linear model" would be a good addition to statsmodels.
>>
> For the most part it does require modification of the steps before and
> after fitting the model.
I will try this next weekend, I think we have almost all the results already
available in OLS to do the same statistical analysis as your glm.
But I tried this only very briefly.
Josef
>
> Bruce
>
I remembered a good description that justifies another panel data package
in R, but focused on the perspective of econometricians. Parts of it
are a good read for the differences in the requirements for a statistical
model depending on its application, econometrics versus statistics
http://cran.r-project.org/web/packages/plm/vignettes/plm.pdf
We don't have the "manpower" to provide specialized routines across
all fields, so I hope we can get at least one general versions of some
of these models into statsmodels.
Josef
PS:
(When I was looking for this reference, I also found how we can
benchmark the missing tests for GLSAR, in case Skipper hasn't
found it already:
https://stat.ethz.ch/pipermail/r-help/2005-February/065826.html
And this looks like a fun example, but I don't know enough R to
understand it http://jackman.stanford.edu/classes//350C/07/fatality.r
and I don't have Stock, Watson.)
Thanks, that should come in handy.
> We don't have the "manpower" to provide specialized routines across
> all fields, so I hope we can get at least one general versions of some
> of these models into statsmodels.
>
> Josef
>
> PS:
> (When I was looking for this reference, I also found how we can
> benchmark the missing tests for GLSAR, in case Skipper hasn't
> found it already:
> https://stat.ethz.ch/pipermail/r-help/2005-February/065826.html
>
I did find this actually. An example of it is in the longley dataset
folder in the R_gls.s script. I can't remember whether I wrote an
explicit test for it off the top of my head though I know I
"eyeballed" them and they were close.
> And this looks like a fun example, but I don't know enough R to
> understand it http://jackman.stanford.edu/classes//350C/07/fatality.r
> and I don't have Stock, Watson.)
>
That would be a nice comparison too. We use Stock and Watson for our
Master's students in the first sequence (cross-section, panel, IVE,
not so much for time series).
Skipper
> The closest, I found in SAS is MIXED
> http://support.sas.com/documentation/cdl/en/statug/59654/HTML/default/statug_mixed_sect001.htm
>
> which from a quick reading allows some patterns of covariance structures.
>
Mixed is far more flexible than glm as it is more orientated to
likelihood theory. So you can do really complex models that have random
effects with different variance/covariance structures and residuals with
different variance/covariance structures. Of course the data usually
limits what you can fit but there are around 30 structure involved
excluding the user-defined option with those or even a user-defined
structure.
See the tables for repeated option but these structures also apply for
random statement
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/mixed_sect19.htm#stat_mixed_mixedrepeat
Unlike R's lme, mixed does fixed effects only models.
I would suggest for looking at the ETS software for specific
econometrics and times series analysis:
http://support.sas.com/documentation/onlinedoc/ets/index.html
Bruce
I only had time for a quick look, but the list of procedures is much more what
I'm familiar with. I will look more closely when I have time.
Thanks for the reference, I didn't know about ETS.
Josef
>
> Bruce
>
forwarding form Skipper
On Tue, Sep 15, 2009 at 11:16 PM, Skipper Seabold <jsse...@gmail.com> wrote:
> FYI, I just found this while sitting in a SAS lab earlier...
>
> GINV=G2 | G4
>
> specifies the type of generalized inverse to be used when
> computing the covariance matrix. G4 selects the Moore-Penrose
> generalized inverse. The default is GINV=G2.
>
> Rather than deleting linearly related rows and columns of the
> covariance matrix, the Moore-Penrose generalized inverse averages the
> variance effects between collinear rows. When the option GINV=G4 is
> used, the Moore-Penrose generalized inverse is used to calculate
> standard errors and the covariance matrix of the parameters as well as
> the change vector for the optimization problem. For singular systems,
> a normal G2 inverse is used to determine the singular rows so that the
> parameters can be marked in the parameter estimates table. A G2
> inverse is calculated by satisfying the first two properties of the
> Moore-Penrose generalized inverse; that is AA^+A = A, and A^+AA^+ = A. Whether or not you
> use a G4 inverse, if the covariance matrix is singular, the parameter
> estimates are not unique. Refer to Noble and Daniel (1977, pp.
> 337–340) for more details about generalized inverses.
>
> Taken from:
> http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/etsug_model_sect018.htm
>
this http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/etsug_model_sect035.htm
also contains a "nice TODO" list, if someone or we ever find the time.
I worked a bit on the ANOVA interpretation of statsmodels.OLS, but
didn't have enough time to get more than a very rough version yet.
Josef