Studentized Range Distribution

Skip to first unread message

Eric Farng

Sep 16, 2021, 8:25:14 AMSep 16
to pystatsmodels
Hello all,

I was looking to run Tukey HSD with statsmodels, and when I compared the results to R, I was quite startled to see very different numbers. Digging a little further, I noticed that the difference is from the statsmodels.stats.libqsturng approximation to the distribution.

I found some pascal code and ported it to python

Any thoughts if/how to proceed?


Sorry but newbie contributor here. I put the code in my fork, which I didn't realize would suggest creating the PR in the main repo.

Other differences with statsmodels.stats.libqsturng:
* Pascal code authors validated the correctness of their code in the paper, and is not an approximation
* libqsturng only supports p-values from 0.1 to 0.999
* libqsturng is same as R code _only_ up to 1 or 2 digits
I believe the difference is because libqsturng uses a table of pre-calculated values,
using the same algorithm as R, but the table has different values than R.
Since I don't know the code or the paper for libqsturng, I hesitate to update that table
which might have unexpected results.

Oct 2, 2021, 8:15:18 PMOct 2
to pystatsmodels

I've am working to add the Games-Howell test to stats models, which also uses the studentized range distribution that Tukey HSD in both R and statsmodels uses.  So I might have a few insights to help. 

  1. The file `stats/libsturng/` has a complete set of references for the code including the paper used.  Also the file `stats/libsturng/` references the same paper and provides the tables and associated script. I admit, I've not dug into this.  The paper you reference is much more recent, so without a huge amount of digging into it, seems arguable for us to use that rather than what is currently in statsmodels. The reference is listed as the 5th reference in the paper you linked.  Here it is: Gleason, J. R. (1999). An accurate, non-iterative approximation for studentized range quantiles. Computational Statistics & Data Analysis, (31), 147-158. . 
  2. As you note, changing the algorithm (and possibly moving from a table) is a major change, so will need a PR, and some careful checking. Given that the paper you reference is much more recent, you can put up an argument for making the switch. 
  3. Note that both R (at least the last time I checked in February) and Statsmodels have limits on what it handles, not just in terms of p-values, but also in terms of degrees of freedom (and more importantly, pairings of thes).  I have 2 PR's up to fix existing bugs around this in Statsmodels.  (Issue 6541 and Issue 7324).  Also, note that the p-values as listed in statsmodels, you want 1-p for testing significance, so the range is reasonable for what one actually needs, from my perspective. 
  4. At least my code using libsturng and this R code (which uses ptukey) return the same results within the ranges supported by both programs, on the examples I've checked.  That said, I think I have seen that the values are not exactly the same and I admit I haven't pushed the boundaries in a while.
I hope some of this helps.  Sorry for the delay in anyone responding, I've not been watching the group lately.  Trying to get back in it. And I'm curious to hear your thoughts. 

- Amelia
Reply all
Reply to author
0 new messages