33 views

Skip to first unread message

Mar 11, 2021, 12:13:16 AMMar 11

to pystatsmodels

Hi there,

I was sent to the group from @jseabold https://twitter.com/jseabold/status/1369842577114038278

I'm trying to understand why statsmodel produces a very different estimate than R and Stata (which produce the same estimate). I'm fairly certain it is how statsmodel handles rank deficient matrices vs. R's lm_robust function, but it raises the question of which estimates are correct. I appreciate any insights folks can offer!

The output graph is the fifth block of the jupyter notebook:

The output graph for R is the code chunk labelled "abortion_dd":

Mar 11, 2021, 12:23:44 AMMar 11

to pystatsmodels

I have not looked at the details of your examples

difference betwee statsmodels and R and Stata for collinear variables

essentially

R and Stata drop some variables among the collinear variables so the design matrix has full rank

R uses rank revealing, pivoting QR and drops the most collinear variables

I never saw an explanation for which variables are dropped by Stata

statsmodels uses a Moore-Penrose generalized inverse, pinv, which can be interpreted as a regularized or penalized estimate.

The question shows up every once in a while, e.g.

and maybe in some of our github issues.

If you drop the variables manually that R drops, then statsmodels and R results should agree.

Josef

--

You received this message because you are subscribed to the Google Groups "pystatsmodels" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/2959df5b-0483-4272-a3fa-c331dd43e529n%40googlegroups.com.

Mar 13, 2021, 3:17:49 PMMar 13

to pystat...@googlegroups.com

On Thu, Mar 11, 2021 at 5:23 AM <josef...@gmail.com> wrote:

>

>

>

> On Thu, Mar 11, 2021 at 12:13 AM Kyle Butts <butts...@gmail.com> wrote:

>>

>>

>> Hi there,

>>

>> I was sent to the group from @jseabold https://twitter.com/jseabold/status/1369842577114038278

>>

>> I'm trying to understand why statsmodel produces a very different estimate than R and Stata (which produce the same estimate). I'm fairly certain it is how statsmodel handles rank deficient matrices vs. R's lm_robust function, but it raises the question of which estimates are correct. I appreciate any insights folks can offer!

>>

>> The output graph is the fifth block of the jupyter notebook:

>> https://github.com/scunning1975/mixtape_learnr/blob/main/python/Differences_in_Differences.ipynb

>>

>> The output graph for R is the code chunk labelled "abortion_dd":

>> https://github.com/scunning1975/mixtape_learnr/blob/main/R/Difference_in_Differences.Rmd

>

>

>

> I have not looked at the details of your examples

>

> difference betwee statsmodels and R and Stata for collinear variables

>

> essentially

> R and Stata drop some variables among the collinear variables so the design matrix has full rank

> R uses rank revealing, pivoting QR and drops the most collinear variables

> I never saw an explanation for which variables are dropped by Stata

>

> statsmodels uses a Moore-Penrose generalized inverse, pinv, which can be interpreted as a regularized or penalized estimate.

>

> The question shows up every once in a while, e.g.

> https://stackoverflow.com/questions/40935624/differences-in-linear-regression-in-r-and-python/40937228#40937228

> and maybe in some of our github issues.

>

> If you drop the variables manually that R drops, then statsmodels and R results should agree.

FWIW, I replicated the R results [1] using PQR from scipy. I've added
>

>

>

> On Thu, Mar 11, 2021 at 12:13 AM Kyle Butts <butts...@gmail.com> wrote:

>>

>>

>> Hi there,

>>

>> I was sent to the group from @jseabold https://twitter.com/jseabold/status/1369842577114038278

>>

>> I'm trying to understand why statsmodel produces a very different estimate than R and Stata (which produce the same estimate). I'm fairly certain it is how statsmodel handles rank deficient matrices vs. R's lm_robust function, but it raises the question of which estimates are correct. I appreciate any insights folks can offer!

>>

>> The output graph is the fifth block of the jupyter notebook:

>> https://github.com/scunning1975/mixtape_learnr/blob/main/python/Differences_in_Differences.ipynb

>>

>> The output graph for R is the code chunk labelled "abortion_dd":

>> https://github.com/scunning1975/mixtape_learnr/blob/main/R/Difference_in_Differences.Rmd

>

>

>

> I have not looked at the details of your examples

>

> difference betwee statsmodels and R and Stata for collinear variables

>

> essentially

> R and Stata drop some variables among the collinear variables so the design matrix has full rank

> R uses rank revealing, pivoting QR and drops the most collinear variables

> I never saw an explanation for which variables are dropped by Stata

>

> statsmodels uses a Moore-Penrose generalized inverse, pinv, which can be interpreted as a regularized or penalized estimate.

>

> The question shows up every once in a while, e.g.

> https://stackoverflow.com/questions/40935624/differences-in-linear-regression-in-r-and-python/40937228#40937228

> and maybe in some of our github issues.

>

> If you drop the variables manually that R drops, then statsmodels and R results should agree.

a working version of a "pqr" method in addition to "pinv" and "qr" in

a branch, if there's interest in a PR. Need to finish the tests and

the details. Keeping the NAs around for the dropped columns especially

in the covariance may present some more issues in the generalizations

we have.

I'm seeing some differences in which of the columns are dropped vs. R

in some very simple contrived examples in the case of a perfect linear

combination. The R matrix and the permutations are slightly different,

so sometimes it just drops a different column. I need to track down

what eigen is using for PQR. I'm not surprised, but I'm not sure

whether there's a precision issue or a slightly different algorithm

being used or I made a simple mistake in my R vs Python code.

[1] https://gist.github.com/jseabold/42b39c4f3c5b7d6b1f73ac7749db2f3b

Mar 13, 2021, 6:05:53 PMMar 13

to pystatsmodels

we also have a PR https://github.com/statsmodels/statsmodels/pull/6935

Kevin is against adding it.

I'm in general in favor of having an option to get something similar to R or Stata, but have not looked at the details yet.

Question is how to add the option, for sure not as default.

AFAR, R also has a very low threshold of only 1e-7 for collinearity, which is quite different from our numpy rcond threshold.

Josef

--

You received this message because you are subscribed to the Google Groups "pystatsmodels" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/CAKF%3DDjvnfVG1RQhW3Fg%2B6kyFaFsq4T9E6ReyK9ap-se3tj9Eaw%40mail.gmail.com.

Mar 13, 2021, 11:01:05 PMMar 13

to pystat...@googlegroups.com

On Sat, Mar 13, 2021 at 11:05 PM <josef...@gmail.com> wrote:

> we also have a PR https://github.com/statsmodels/statsmodels/pull/6935

Oh, nice. I should have searched. I'm sure there's a lot of
> we also have a PR https://github.com/statsmodels/statsmodels/pull/6935

interesting stuff in there.

I just looked when pivoting was added to scipy. I (somewhat) recall it

being worked on around the same time as I was adding QR for ANOVA.

> Kevin is against adding it.

to investigate list at least.

> I'm in general in favor of having an option to get something similar to R or Stata, but have not looked at the details yet.

https://johnwlambert.github.io/least-squares/

> Question is how to add the option, for sure not as default.

the default.

I'd vote for keeping it in the linear models with NaNs if the added

derived results code complexity isn't too gross.

> AFAR, R also has a very low threshold of only 1e-7 for collinearity, which is quite different from our numpy rcond threshold.

trying to understand the differences. As far as I could tell it relies

on eigen's automagic, which I assume is similar to using the default

in np.linalg.matrix_rank. I stopped short of seeing what eigen was

doing that wasn't dgeqp3. I'll check this again though.

https://github.com/cran/estimatr/blob/1d642a4c1fb42730d663d519261bf38372542b7f/src/lm_robust_helper.cpp#L112

https://eigen.tuxfamily.org/dox/classEigen_1_1ColPivHouseholderQR.html#ae712cdc9f0e521cfc8061bee58ff55ee

Anyway, the ranks were the same. Just the values of R for the

collinear columns and subsequent ordering and permutations were

different so it dropped one of the other collinear columns. I can only

guess why one column may be dropped more than another without thinking

about it a bit more than I have.

Skipper

Mar 14, 2021, 12:09:08 AMMar 14

to Statsmodels Mailing List

This kind of thing comes up quite a bit (people find that something in Statsmodels isn't exactly the same as it is in R), and I thought I would just add a perspective (that is really not meant to be discouraging, although it might sound like that).

I'd like to make the case that those who are interested in things like this should consider creating a new package adding the feature as an extension of Statsmodels. If the idea works out and well-tested and mature, then it could be merged with Statsmodels if that's what the authors want at that point.

The issue is that I really think that there are not enough maintainers for Statsmodels, and yet most of the burden associated with software development is the maintenance. In my opinion, it has become difficult to know when to accept even a PR that implements a cool new feature (like pivoting QR, for example), because we don't have a way to know if the person submitting that PR is willing to continue to maintain the feature (and all of the annoying corner cases that will inevitably come up as time goes on).

So, for the PRs that I look at, accepting a PR means not only weighing the pros and cons of the proposed implementation, but also assessing whether I am personally willing to maintain the code in the likely event that the original implementer becomes unavailable.

But obviously I have no objections if Josef / Kevin / etc. want to eventually merge pivoting QR (or anything else!).

Chad

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu