Differences between statsmodel.OLS vs. R's lm_robust

104 views
Skip to first unread message

Kyle Butts

unread,
Mar 11, 2021, 12:13:16 AM3/11/21
to pystatsmodels

Hi there,

I was sent to the group from @jseabold https://twitter.com/jseabold/status/1369842577114038278

I'm trying to understand why statsmodel produces a very different estimate than R and Stata (which produce the same estimate). I'm fairly certain it is how statsmodel handles rank deficient matrices vs. R's lm_robust function, but it raises the question of which estimates are correct. I appreciate any insights folks can offer!

The output graph is the fifth block of the jupyter notebook:

The output graph for R is the code chunk labelled "abortion_dd":

josef...@gmail.com

unread,
Mar 11, 2021, 12:23:44 AM3/11/21
to pystatsmodels
I have not looked at the details of your examples

difference betwee statsmodels and R and Stata for collinear variables

essentially
R and Stata drop some variables among the collinear variables so the design matrix has full rank
R uses rank revealing, pivoting QR and drops the most collinear variables
I never saw an explanation for which variables are dropped by Stata

statsmodels uses a Moore-Penrose generalized inverse, pinv, which can be interpreted as a regularized or penalized estimate.

The question shows up every once in a while, e.g.
and maybe in some of our github issues.

If you drop the variables manually that R drops, then statsmodels and R results should agree.

Josef
 

--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/2959df5b-0483-4272-a3fa-c331dd43e529n%40googlegroups.com.

Skipper Seabold

unread,
Mar 13, 2021, 3:17:49 PM3/13/21
to pystat...@googlegroups.com
On Thu, Mar 11, 2021 at 5:23 AM <josef...@gmail.com> wrote:
>
>
>
> On Thu, Mar 11, 2021 at 12:13 AM Kyle Butts <butts...@gmail.com> wrote:
>>
>>
>> Hi there,
>>
>> I was sent to the group from @jseabold https://twitter.com/jseabold/status/1369842577114038278
>>
>> I'm trying to understand why statsmodel produces a very different estimate than R and Stata (which produce the same estimate). I'm fairly certain it is how statsmodel handles rank deficient matrices vs. R's lm_robust function, but it raises the question of which estimates are correct. I appreciate any insights folks can offer!
>>
>> The output graph is the fifth block of the jupyter notebook:
>> https://github.com/scunning1975/mixtape_learnr/blob/main/python/Differences_in_Differences.ipynb
>>
>> The output graph for R is the code chunk labelled "abortion_dd":
>> https://github.com/scunning1975/mixtape_learnr/blob/main/R/Difference_in_Differences.Rmd
>
>
>
> I have not looked at the details of your examples
>
> difference betwee statsmodels and R and Stata for collinear variables
>
> essentially
> R and Stata drop some variables among the collinear variables so the design matrix has full rank
> R uses rank revealing, pivoting QR and drops the most collinear variables
> I never saw an explanation for which variables are dropped by Stata
>
> statsmodels uses a Moore-Penrose generalized inverse, pinv, which can be interpreted as a regularized or penalized estimate.
>
> The question shows up every once in a while, e.g.
> https://stackoverflow.com/questions/40935624/differences-in-linear-regression-in-r-and-python/40937228#40937228
> and maybe in some of our github issues.
>
> If you drop the variables manually that R drops, then statsmodels and R results should agree.

FWIW, I replicated the R results [1] using PQR from scipy. I've added
a working version of a "pqr" method in addition to "pinv" and "qr" in
a branch, if there's interest in a PR. Need to finish the tests and
the details. Keeping the NAs around for the dropped columns especially
in the covariance may present some more issues in the generalizations
we have.

I'm seeing some differences in which of the columns are dropped vs. R
in some very simple contrived examples in the case of a perfect linear
combination. The R matrix and the permutations are slightly different,
so sometimes it just drops a different column. I need to track down
what eigen is using for PQR. I'm not surprised, but I'm not sure
whether there's a precision issue or a slightly different algorithm
being used or I made a simple mistake in my R vs Python code.

[1] https://gist.github.com/jseabold/42b39c4f3c5b7d6b1f73ac7749db2f3b

josef...@gmail.com

unread,
Mar 13, 2021, 6:05:53 PM3/13/21
to pystatsmodels

Kevin is against adding it.
I'm in general in favor of having an option to get something similar to R or Stata, but have not looked at the details yet.
Question is how to add the option, for sure not as default.

AFAR, R also has a very low threshold of only 1e-7 for collinearity, which is quite different from our numpy rcond threshold.

Josef

 


--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.

Skipper Seabold

unread,
Mar 13, 2021, 11:01:05 PM3/13/21
to pystat...@googlegroups.com
On Sat, Mar 13, 2021 at 11:05 PM <josef...@gmail.com> wrote:
> we also have a PR https://github.com/statsmodels/statsmodels/pull/6935

Oh, nice. I should have searched. I'm sure there's a lot of
interesting stuff in there.

I just looked when pivoting was added to scipy. I (somewhat) recall it
being worked on around the same time as I was adding QR for ANOVA.

> Kevin is against adding it.

Bummer. I see the argument both ways I guess. Covers all of my things
to investigate list at least.

> I'm in general in favor of having an option to get something similar to R or Stata, but have not looked at the details yet.

I found the math here to be a succinct, helpful refresher, if interested.

https://johnwlambert.github.io/least-squares/

> Question is how to add the option, for sure not as default.

Yeah, I also added a "pqr" option to the linear models but it wasn't
the default.

I'd vote for keeping it in the linear models with NaNs if the added
derived results code complexity isn't too gross.

> AFAR, R also has a very low threshold of only 1e-7 for collinearity, which is quite different from our numpy rcond threshold.

Ah, that's a good point. It didn't appear so in lm_robust, as I was
trying to understand the differences. As far as I could tell it relies
on eigen's automagic, which I assume is similar to using the default
in np.linalg.matrix_rank. I stopped short of seeing what eigen was
doing that wasn't dgeqp3. I'll check this again though.

https://github.com/cran/estimatr/blob/1d642a4c1fb42730d663d519261bf38372542b7f/src/lm_robust_helper.cpp#L112
https://eigen.tuxfamily.org/dox/classEigen_1_1ColPivHouseholderQR.html#ae712cdc9f0e521cfc8061bee58ff55ee

Anyway, the ranks were the same. Just the values of R for the
collinear columns and subsequent ordering and permutations were
different so it dropped one of the other collinear columns. I can only
guess why one column may be dropped more than another without thinking
about it a bit more than I have.

Skipper

Chad Fulton

unread,
Mar 14, 2021, 12:09:08 AM3/14/21
to Statsmodels Mailing List
This kind of thing comes up quite a bit (people find that something in Statsmodels isn't exactly the same as it is in R), and I thought I would just add a perspective (that is really not meant to be discouraging, although it might sound like that).

I'd like to make the case that those who are interested in things like this should consider creating a new package adding the feature as an extension of Statsmodels. If the idea works out and well-tested and mature, then it could be merged with Statsmodels if that's what the authors want at that point.

The issue is that I really think that there are not enough maintainers for Statsmodels, and yet most of the burden associated with software development is the maintenance. In my opinion, it has become difficult to know when to accept even a PR that implements a cool new feature (like pivoting QR, for example), because we don't have a way to know if the person submitting that PR is willing to continue to maintain the feature (and all of the annoying corner cases that will inevitably come up as time goes on).

So, for the PRs that I look at, accepting a PR means not only weighing the pros and cons of the proposed implementation, but also assessing whether I am personally willing to maintain the code in the likely event that the original implementer becomes unavailable.

But obviously I have no objections if Josef / Kevin / etc. want to eventually merge pivoting QR (or anything else!).

Chad
Reply all
Reply to author
Forward
0 new messages