[SciPy-User] Fisher exact test, anyone?

375 views
Skip to first unread message

josef...@gmail.com

unread,
Nov 13, 2010, 9:50:44 PM11/13/10
to SciPy Users List
http://projects.scipy.org/scipy/ticket/956 and
http://pypi.python.org/pypi/fisher/ have Fisher's exact
testimplementations.

It would be nice to get a version in for 0.9. I spent a few
unsuccessful days on it earlier this year. But since there are two new
or corrected versions available, it looks like it just needs testing
and a performance comparison.

I won't have time for this, so if anyone volunteers for this, scipy
0.9 should be able to get Fisher's exact.

As an aside:
There is a related ticket for numerical precision problems in the sf
calculations of discrete distributions,
http://projects.scipy.org/scipy/ticket/1218 . However, this requires a
rewrite of the current generic cdf calculation and adding a matching
sf version. The current implementation only works under special
assumptions which luckily are satisfied by all current discrete
distributions. I think, I will find time for this before 0.9.

Josef
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

Bruce Southey

unread,
Nov 14, 2010, 11:40:28 AM11/14/10
to SciPy Users List
On Sat, Nov 13, 2010 at 8:50 PM, <josef...@gmail.com> wrote:
> http://projects.scipy.org/scipy/ticket/956 and
> http://pypi.python.org/pypi/fisher/ have Fisher's exact
> testimplementations.
>
> It would be nice to get a version in for 0.9. I spent a few
> unsuccessful days on it earlier this year. But since there are two new
> or corrected versions available, it looks like it just needs testing
> and a performance comparison.
>
> I won't have time for this, so if anyone volunteers for this, scipy
> 0.9 should be able to get Fisher's exact.

I also do not have time for this month.

I briefly looked at the code at pypi link but I do not think it is
good enough for scipy. Also, I do not like when people license code as
'BSD' and there is a comment in cfisher.pyx '# some of this code is
originally from the internet. (thanks)'. Consequently we can not use
that code.

The code with ticket 956 still needs work especially in terms of the
input types and probably the API (like having a function that allows
the user to select either 1 or 2 tailed tests).

Bruce

Ralf Gommers

unread,
Nov 16, 2010, 8:04:52 AM11/16/10
to SciPy Users List
On Mon, Nov 15, 2010 at 12:40 AM, Bruce Southey <bsou...@gmail.com> wrote:
On Sat, Nov 13, 2010 at 8:50 PM,  <josef...@gmail.com> wrote:
> http://projects.scipy.org/scipy/ticket/956 and
> http://pypi.python.org/pypi/fisher/ have Fisher's exact
> testimplementations.
>
> It would be nice to get a version in for 0.9. I spent a few
> unsuccessful days on it earlier this year. But since there are two new
> or corrected versions available, it looks like it just needs testing
> and a performance comparison.
>
> I won't have time for this, so if anyone volunteers for this, scipy
> 0.9 should be able to get Fisher's exact.

https://github.com/rgommers/scipy/tree/fisher-exact
All tests pass. There's only one usable version (see below) so I didn't do performance comparison. I'll leave a note on #956 as well, saying we're discussing on-list.

I briefly looked at the code at pypi link but I do not think it is
good enough for scipy. Also, I do not like when people license code as
'BSD' and there is a comment in cfisher.pyx  '# some of this code is
originally from the internet. (thanks)'. Consequently we can not use
that code.

I agree, that's not usable. The plain Python algorithm is also fast enough that there's no need to bother with Cython.

The code with ticket 956 still needs work especially in terms of the
input types and probably the API (like having a function that allows
the user to select either 1 or 2 tailed tests).

Can you explain what you mean by work on input types? I used np.asarray and forced dtype to be int64. For the 1-tailed test, is it necessary? I note that pearsonr and spearmanr also only do 2-tailed.

Cheers,
Ralf

josef...@gmail.com

unread,
Nov 16, 2010, 10:01:53 AM11/16/10
to SciPy Users List

adding 1 tailed tests would be a nice bonus.

I think, we should add them as much as possible. Currently one-sided
versus two-sided is still somewhat inconsistent across functions. I
added one-sided tests to some functions.
Tests based on symmetric distributions (t or normal) like the t-tests
don't necessarily need both because the one sided test can essentially
be recovered from the two sided test, half or double the pvalue.

I added a comment to the ticket, fisher3 looks good except for the
python 2.4 incompatibility.

Thanks Ralf for taking care of this,

Josef


>
> Cheers,
> Ralf

Bruce Southey

unread,
Nov 16, 2010, 10:45:53 AM11/16/10
to scipy...@scipy.org
I have no problem including this if we can agree on the API because everything else is internal that can be fixed by release date. So I would accept a place holder API that enable a user in the future to select which tail(s) is performed.

1) It just can not use np.asarray() without checking the input first. This is particularly bad for masked arrays.

2) There are no dimension checking because, as I understand it, this can only handle a '2 by 2' table. I do not know enough for general 'r by c' tables or the 1-d case either.

3) The odds-ratio should be removed because it is not part of the test. It is actually more general than this test.

4) Variable names such as min and max should not shadow Python functions.

5) Is there a reference to the algorithm implemented? For example, SPSS provides a simple 2 by 2 algorithm:
http://support.spss.com/ProductsExt/SPSS/Documentation/Statistics/algorithms/14.0/app05_sig_fisher_exact_test.pdf

6) Why exactly does the dtype need to int64? That is, is there something wrong with hypergeom function? I just want to understand why the precision change is required because the input should enter with sufficient precision.


Bruce

Ralf Gommers

unread,
Nov 16, 2010, 7:10:19 PM11/16/10
to SciPy Users List

It is always possible to add a keyword "tail" later that defaults to 2-tailed. As long as the behavior doesn't change this is perfectly fine, and better than having a placeholder.

1) It just can not use np.asarray() without checking the input first. This is particularly bad for masked arrays.

Don't understand this. The input array is not returned, only used internally. And I can't think of doing anything reasonable with a 2x2 table with masked values. If that's possible at all, it should probably just go into mstats.
 
2) There are no dimension checking because, as I understand it, this can only handle a '2 by 2' table. I do not know enough for general 'r by c' tables or the 1-d case either.

Don't know how easy it would be to add larger tables. I can add dimension checking with an informative error message.
 
3) The odds-ratio should be removed because it is not part of the test. It is actually more general than this test.

Don't feel strongly about this either way. It comes almost for free, and R seems to do the same.

4) Variable names such as min and max should not shadow Python functions.

Yes, Josef noted this already, will change.

5) Is there a reference to the algorithm implemented? For example, SPSS provides a simple 2 by 2 algorithm:
http://support.spss.com/ProductsExt/SPSS/Documentation/Statistics/algorithms/14.0/app05_sig_fisher_exact_test.pdf

Not supplied, will ask on the ticket and include it.

6) Why exactly does the dtype need to int64? That is, is there something wrong with hypergeom function? I just want to understand why the precision change is required because the input should enter with sufficient precision.

This test:
fisher_exact(np.array([[18000, 80000], [20000, 90000]]))
becomes much slower and gives an overflow warning with int32. int32 is just not enough. This is just an implementation detail and does not in any way limit the accepted inputs, so I don't see a problem here.

Don't know what the behavior should be if a user passes in floats though? Just convert to int like now, or raise a warning?

Cheers,
Ralf

josef...@gmail.com

unread,
Nov 16, 2010, 7:38:09 PM11/16/10
to SciPy Users List

There is some discussion in the ticket about more than 2by2,
additions would be nice (and there are some examples on the matlab
fileexchange), but 2by2 is the most common case and has an unambiguous
definition.


>
>>
>> 3) The odds-ratio should be removed because it is not part of the test. It
>> is actually more general than this test.
>>
> Don't feel strongly about this either way. It comes almost for free, and R
> seems to do the same.

same here, it's kind of traditional to return two things, but in this
case the odds ratio is not the test statistic, but I don't see that it
hurts either

>
>> 4) Variable names such as min and max should not shadow Python functions.
>
> Yes, Josef noted this already, will change.
>>
>> 5) Is there a reference to the algorithm implemented? For example, SPSS
>> provides a simple 2 by 2 algorithm:
>>
>> http://support.spss.com/ProductsExt/SPSS/Documentation/Statistics/algorithms/14.0/app05_sig_fisher_exact_test.pdf
>
> Not supplied, will ask on the ticket and include it.

I thought, I saw it somewhere, but don't find the reference anymore,
some kind of bisection algorithm, but having a reference would be
good.
Whatever the algorithm is, it's fast, even for larger values.

>>
>> 6) Why exactly does the dtype need to int64? That is, is there something
>> wrong with hypergeom function? I just want to understand why the precision
>> change is required because the input should enter with sufficient precision.
>>
> This test:
> fisher_exact(np.array([[18000, 80000], [20000, 90000]]))
> becomes much slower and gives an overflow warning with int32. int32 is just
> not enough. This is just an implementation detail and does not in any way
> limit the accepted inputs, so I don't see a problem here.

for large numbers like this the chisquare test should give almost the
same results, it looks pretty "asymptotic" to me. (the usual
recommendation for the chisquare is more than 5 expected observations
in each cell)
I think the precision is required for some edge cases when
probabilities get very small. The main failing case, I was fighting
with for several days last winter, and didn't manage to fix had a zero
at the first position. I didn't think about increasing the precision.

>
> Don't know what the behavior should be if a user passes in floats though?
> Just convert to int like now, or raise a warning?

I wouldn't do any type checking, and checking that floats are almost
integers doesn't sound really necessary either, unless or until users
complain. The standard usage should be pretty clear for contingency
tables with count data.

Josef

>
> Cheers,
> Ralf

Ralf Gommers

unread,
Nov 17, 2010, 8:24:24 AM11/17/10
to SciPy Users List

Thanks for checking. https://github.com/rgommers/scipy/commit/b968ba17 should fix remaining things. Will wait for a few days to see if we get a reference to the algorithm. Then will commit.

Cheers,
Ralf


josef...@gmail.com

unread,
Nov 17, 2010, 8:56:50 AM11/17/10
to SciPy Users List
On Wed, Nov 17, 2010 at 8:24 AM, Ralf Gommers

It looks good to me. I think you can commit whenever you want. We can
add the reference also later through the doceditor.

Thanks for finding the hypergeom imprecision example. We can use it
for the sf ticket.

I think, from a statistical viewpoint this imprecision is pretty
irrelevant, whether a pvalue is 1e-9 or 1e-8, wouldn't change the
conclusion about strongly rejecting the null hypothesis.
I usually use almost_equal instead of approx_equal which often
wouldn't even fail on differences like this. Many statistical packages
report only 3 decimals by default because everything else only gives a
false sense of precision given the randomness of the problem.

However, independently of this, discrete cdf and sf need a workover,
but maybe not so urgently if these are the only problems.

Bruce Southey

unread,
Nov 19, 2010, 12:35:58 PM11/19/10
to SciPy Users List
On Wed, Nov 17, 2010 at 7:24 AM, Ralf Gommers
> _______________________________________________
> SciPy-User mailing list
> SciPy...@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>

Sorry but I don't agree. But I said I do not have time to address this
and I really do not like adding the code as it is.

Bruce

Ralf Gommers

unread,
Nov 21, 2010, 3:23:15 AM11/21/10
to SciPy Users List
Sorry but I don't agree. But I said I do not have time to address this
and I really do not like adding the code as it is.

Bruce, I replied in detail to your previous email, so I'm not sure what you want me to do here. If you don't have time for more discussion, and Josef (as stats maintainer) is happy with the addition, I think it can go in. Actually, it did go in right before your email, but that's doesn't mean it's too late for some changes.

Cheers,
Ralf

Bruce Southey

unread,
Nov 21, 2010, 2:03:51 PM11/21/10
to SciPy Users List
On Sun, Nov 21, 2010 at 2:23 AM, Ralf Gommers
> _______________________________________________
> SciPy-User mailing list
> SciPy...@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>

I know and the reason for my negativity is that this commit goes
against what I had proposed to provide single stats functions that
handle the various ndarray types not just the 'standard array. Also it
lacks the flexibility to handle general R by C cases which are very
common. But that requires time to find how to do those cases. The
error is that there is no dimensionality check .

I find it shocking that a statistical test returns a 'odds ratio' that
has nothing to do with the actual test nor with any of the other
related statistical tests like chisquare. If you accept that then we
must immediately add that odds ratio to ALL statistical tests.

josef...@gmail.com

unread,
Nov 21, 2010, 3:18:16 PM11/21/10
to SciPy Users List

In this case, I see no reason to support any other array types. The
function asks for 4 numbers and returns 2 numbers, no nans or missing
values are allowed. (There is nothing to calculate if one of the 4
values is missing.) I haven't checked carefully this time, but, I
think, any array_like type that allows indexing should by correctly
used.


> Also it
> lacks the flexibility to handle general R by C cases which are very
> common. But that requires time to find how to do those cases. The
> error is that there is no dimensionality check .

Handling more than 2x2 would be a nice extension, but is no reason to
not commit the current version.

It might be good to raise a ValueError if the input is not 2x2. I need
to check whether for example 2x3 wouldn't silently produce an
unintended result.

>
> I find it shocking that a statistical test returns a 'odds ratio' that
> has nothing to do with the actual test nor with any of the other
> related statistical tests like chisquare. If you accept that then we
> must immediately add that odds ratio to ALL statistical tests.

R returns the posterior odds ratio, we return the prior odds-ratio. I
don't care much either way, but the test is (implicitly) on the odds
ratio.
Other tests provide their own statistic, which is not always the
actual statistic used in the calculation of the p-value. If it were
not for backward compatibility, I would add lots of things to ALL
statistical tests, as you can see in some tests in statsmodels.

And did I mention: It's really fast.

Josef

>
> Bruce
> _______________________________________________
> SciPy-User mailing list
> SciPy...@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

Ralf Gommers

unread,
Nov 23, 2010, 9:31:19 AM11/23/10
to SciPy Users List
On Mon, Nov 22, 2010 at 4:18 AM, <josef...@gmail.com> wrote:
On Sun, Nov 21, 2010 at 2:03 PM, Bruce Southey <bsou...@gmail.com> wrote:

> Also it
> lacks the flexibility to handle general R by C cases which are very
> common. But that requires time to find how to do those cases. The
> error is that there is no dimensionality check .

Handling more than 2x2 would be a nice extension, but is no reason to
not commit the current version.

It might be good to raise a ValueError if the input is not 2x2. I need
to check whether for example 2x3 wouldn't silently produce an
unintended result.

I added this check in r6939.

Cheers,
Ralf
Reply all
Reply to author
Forward
0 new messages