Bug in statsmodels.stats.proportion.binom_test? (or brain fart)

272 views
Skip to first unread message

Stuart Reynolds

unread,
Sep 28, 2016, 7:59:06 PM9/28/16
to pystatsmodels
I'm trying to test the likelihood that probability of success for a binomial variable is >x  (actually, I'd prefer to know Pr( x_lower > p && p < x_upper ) ).

I think that statsmodels.stats.proportion.binom_test calculates this, but here's its output:

k= 3 (observed successes)
n
= 3 (nobs)
x
: 0.0 Pr(p<x) 0.0 Pr(p>x), 1.0
x
: 0.1 Pr(p<x) 0.001 Pr(p>x), 1.0
x
: 0.2 Pr(p<x) 0.008 Pr(p>x), 1.0
x
: 0.3 Pr(p<x) 0.027 Pr(p>x), 1.0
x
: 0.4 Pr(p<x) 0.064 Pr(p>x), 1.0
x
: 0.5 Pr(p<x) 0.125 Pr(p>x), 1.0
x
: 0.6 Pr(p<x) 0.216 Pr(p>x), 1.0
x
: 0.7 Pr(p<x) 0.343 Pr(p>x), 1.0
x
: 0.8 Pr(p<x) 0.512 Pr(p>x), 1.0
x
: 0.9 Pr(p<x) 0.729 Pr(p>x), 1.0
x
: 1.0 Pr(p<x) 1.0 Pr(p>x), 1.0

That is, it says that: given the evidence of 3/3 successes, 
   - There's a 1/8 chance that p<0.5
   - There's a 100% chance that p>=x for all x in [0,1]

.... this doesn't seem right (since many probabilities of success,p, generate the sequence 1,1,1).
Shouldn't it be the case that Pr(p>x) == 1-Pr(p>x)?

Bug or brain fart?


Here's the code that generated the above:

import statsmodels.stats.proportion
import numpy as np

step = 0.1
for k,n in [(3,3),(3,4),(4,4)]:
 print
 print "k=",k, "observed successes"
 print "n=",n, "nobs"
 for x in np.arange(0.0, 1.0+step, step):
   print "x:",x, \
    "Pr(p<x)", statsmodels.stats.proportion.binom_test(k, n, prop=x, alternative='larger'), \
    "Pr(p>x),", statsmodels.stats.proportion.binom_test(k, n, prop=x, alternative='smaller')


josef...@gmail.com

unread,
Sep 28, 2016, 9:46:56 PM9/28/16
to pystatsmodels
I think the above is tooo Bayesian.

x is the probability under the Null Hypothesis
The p-value is the probability of seeing a value at least as extreme
as the observed value (k) under the Null Hypothesis (proportion=x),
under the sampling distribution.

I haven't done this in a while, so it's not so easy to think about the
one-sided case.

if H0: x=0.5 (or any other value < 1), and the alternative is that the
probability is smaller than x, then observing only successes "is much
more likely under the null than under the alternative"

>>> from scipy import stats
>>> from statsmodels.stats import proportion as prop
>>> prop.proportion_confint(3, 3, method='beta') # still nan ??? I thought we merged the fix
(0.29240177382128663, nan)
>>> prop.proportion_confint(3, 3, method='binom_test')
(0.36840314986403866, 1)
>>> prop.binom_test(3, 3, prop=0.36840314986403866, alternative='larger')
0.050000000000000003

>>> prop.binom_test(3, 3, prop=0.36840314986403866, alternative='2s')
0.050000000000000003
>>> prop.binom_test(3, 3, prop=0.36840314986403866)
0.050000000000000003


I'm not sure what's going on. From the standard argument, we need to
look at the probability for the rejection region, which in your
example is one-sided.

One problem in this case is that the "exact" tests like binom or exact
confidence intervals like "beta" are conservative and underreject in
general because of the discreteness of the sample space and the small
sample in this case.

as in my example proportion_confint would give you the "classical"
confidence interval for the proportion.

AFAIK, this is the Bayesian confidence interval for some (?) prior
>>> prop.proportion_confint(3, 3, method='jeffreys')
(0.4644167569570492, 0.99984936397456348)


Josef

josef...@gmail.com

unread,
Sep 28, 2016, 10:17:28 PM9/28/16
to pystatsmodels
I had lost track of this one
https://github.com/statsmodels/statsmodels/issues/2742

Stuart Reynolds

unread,
Sep 28, 2016, 10:59:54 PM9/28/16
to pystatsmodels
Thanks for the reply. There's a typo in my original post. 

"Shouldn't it be the case that Pr(p>x) == 1-Pr(p>x)?"
==>
"Shouldn't it be the case that Pr(p>x) == 1-Pr(p<x)?"

I think you understood, but just to be super clear, the answers aren't symmetric as the must be.
If:

x: 0.1 Pr(p<x)=0.001, Pr(p>x)=zzz

then

x'=1-x: 0.9 Pr(p<x')=zzz, Pr(p>x')=0.001

must follow.

josef...@gmail.com

unread,
Sep 28, 2016, 11:39:54 PM9/28/16
to pystatsmodels
On Wed, Sep 28, 2016 at 10:59 PM, Stuart Reynolds
<stuart....@gmail.com> wrote:
> Thanks for the reply. There's a typo in my original post.
>
> "Shouldn't it be the case that Pr(p>x) == 1-Pr(p>x)?"
> ==>
> "Shouldn't it be the case that Pr(p>x) == 1-Pr(p<x)?"
>
> I think you understood, but just to be super clear, the answers aren't
> symmetric as the must be.
> If:
>
> x: 0.1 Pr(p<x)=0.001, Pr(p>x)=zzz
>
> then
>
> x'=1-x: 0.9 Pr(p<x')=zzz, Pr(p>x')=0.001
>
> must follow.

The problem is that as a "classical" person, I'm not sure what Pr(p <
x) and similar mean.

The probability space is sampling from binomial counts. p-values are
statements about the probabilities for observing a specific sample.
I think the p-values don't have to necessarily add up to 1 because of
the discreteness, i.e. something like
Pr(p>=x) != 1-Pr(p=<x) because PR(p=x) might be a mass point (and is
double counted).

It is not a problem in the continuous case, but in the discrete case
Pr(k > y) + Pr(k = y) is (in general) different from Pr(k > y), and
some recommendation is that something like Pr(k > y) + 0.5 * Pr(k = y)
would have better coverage on average although it won't maintain size,
be "exact" in all cases.
(k is the binomial random variable, y is the count at the rejection boundary.)

in words: the p-value is the probability that we see something more
(>) extreme, versus at least as (>=) extreme.

(classical: A specific confidence interval either includes the true
value or not, we don't know. However, if we use the right confidence
interval calculation, then if we have many different samples, then in
1-alpha fraction of the cases the confidence interval includes the
true value. With discrete sample spaces the fraction cannot be exactly
1-alpha, and "exact" methods require *at least* 1-alpha coverage, but
the coverage is on average much larger.)

I don't know if that helps.

Josef

Stuart Reynolds

unread,
Sep 30, 2016, 8:25:12 PM9/30/16
to pystat...@googlegroups.com
   http://www.sumsar.net/blog/2014/01/bayesian-first-aid-binomial-test/ (which referenence to R package)

It (I think reasonably) requires specifying a prior distribution for prob(p).
If we assume, with no observations of k, that all p's are equally likely, then after seeing k=3, n=3, you'll get prob(p)>0 for all p, not (p=k/n, which seems very unlikely in the world).
From this point of view, when should see prob(p)=1 or prob(p)=0, or confidence intervals with width 0, only with very unreasonable priors.

Would be nice to have a functions to measure:
 prob(p>x| n,k,prior)
 prob(p<x| n,k,prior)
 prob( p in [lower, upper] | n,k,prior)
 ci( n,k, prior )
 
... for binominal.

Its a little surprising that libraries for this are so hard to find. It seems essential for generating confidence bounds on classification algorithm performance with low n.


josef...@gmail.com

unread,
Sep 30, 2016, 9:02:09 PM9/30/16
to pystatsmodels
On Fri, Sep 30, 2016 at 8:25 PM, Stuart Reynolds
<stu...@stuartreynolds.net> wrote:
> Here's a good (Bayesian) way of think about it:
>
> http://stats.stackexchange.com/questions/13225/what-is-the-distribution-of-the-binomial-distribution-parameter-p-given-a-samp
> http://www.sumsar.net/blog/2014/01/bayesian-first-aid-binomial-test/
> (which referenence to R package)
>
> It (I think reasonably) requires specifying a prior distribution for
> prob(p).
> If we assume, with no observations of k, that all p's are equally likely,
> then after seeing k=3, n=3, you'll get prob(p)>0 for all p, not (p=k/n,
> which seems very unlikely in the world).
> From this point of view, when should see prob(p)=1 or prob(p)=0, or
> confidence intervals with width 0, only with very unreasonable priors.
>
> Would be nice to have a functions to measure:
> prob(p>x| n,k,prior)
> prob(p<x| n,k,prior)
> prob( p in [lower, upper] | n,k,prior)
> ci( n,k, prior )
>
> ... for binominal.
>
> Its a little surprising that libraries for this are so hard to find. It
> seems essential for generating confidence bounds on classification algorithm
> performance with low n.

There are many Bayesian packages in python, so there is no real need
for statsmodels to cover it, with some possible exceptions. (We are
also introducing "priors" through penalization, but we still go for
inference based on sampling distribution.) (*)

beta is the natural conjugate for binomial, so it is relatively easy
to work with the beta-binomial pair.
However, if your prior is uniform on [0, 1], then you should be a bit
surprised to see (3, 3) since half of your prior probability is below
0.5.

(IIRC: using scipy.stats
prior
>>> stats.beta.pdf(np.linspace(0,1,11), 1, 1)
array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

posterior sf after seeing 3,3
>>> stats.beta.sf(np.linspace(0,1,11), 4, 1)
array([ 1. , 0.9999, 0.9984, 0.9919, 0.9744, 0.9375, 0.8704,
0.7599, 0.5904, 0.3439, 0. ])

>>> stats.beta.pdf(np.linspace(0,1,11), 4, 1)
array([ 0. , 0.004, 0.032, 0.108, 0.256, 0.5 , 0.864, 1.372,
2.048, 2.916, 4. ])

90% credibility (posterior) interval (because of monotonicity)
>>> stats.beta.isf(0.9, 4, 1)
0.56234132519034907

or something like this
)

(*) I thought it would be useful to have conjugate prior and sample
distributions somewhere given that they don't need the heavy MCMC
machinery, but it never looked exciting enough to me to get it higher
on my own priorities. And for everything except for the simple cases
the heavy Bayesian computational machine is necessary.


Also, there are 37 open issues for classical proportions
https://github.com/statsmodels/statsmodels/search?q=proportion&state=open&type=Issues&utf8=%E2%9C%93


Josef

Stuart Reynolds

unread,
Oct 1, 2016, 1:52:51 AM10/1/16
to pystat...@googlegroups.com
Perfect. Thanks so much

josef...@gmail.com

unread,
Oct 5, 2016, 10:34:15 PM10/5/16
to pystatsmodels
just to add because I was staring at proportion_confint today:

"beta" (exact) is beta distribution with a prior of beta(0, 0) (not
a proper distribution, 0 prior observations)

"jeffreys" looks like beta distribution with a prior of beta(0.5, 0.5)
(a prior of a half observation for success and a half observation for
fail)

both use equal tail confidence intervals, not highest density region
intervals (as a specified above)

Josef
Reply all
Reply to author
Forward
0 new messages