Q: K-S test on binomial dist in R?

jo...@tman.dnsalias.com

unread,

Jun 28, 2001, 8:47:25 PM6/28/01

to

Can someone help? In R, I am generating a vector of 1000 samples from
Bin (1000, 0.25). I then do a Kolmogorov Smirnov test to test if the
vector has been drawn from a population of Bin (1000, 0.25). I would
expect a reasonably high p-value.....

Either I am doing something wrong in R, or I am misunderstanding how this
test should work (both quite possible)...

Thanks,
JT..

> #### 1000 random samples from binomial dist with mean =.25, n=100...
> o<-rbinom (1000, 100, .25)
> mean (o);
[1] 25.178
> var (o);
[1] 19.61193
> ks.test (o, "pbinom", 100, .25);

One-sample Kolmogorov-Smirnov test

data: o
D = 0.0967, p-value = 1.487e-08
alternative hypothesis: two.sided

p-value is mighty small, leading me to reject the null hypothesis that
the sample has been drawn from the Bin(100, 0.25) distribution!!!

Glen Barnett

unread,

Jun 29, 2001, 3:09:26 AM6/29/01

to

<jo...@tman.dnsalias.com> wrote in message
news:xeQ_6.1949$xd.3...@typhoon.snet.net...

> Can someone help? In R, I am generating a vector of 1000 samples from

Observations from Bin(100, 0.25), not samples from Bin(1000, 0.25).

> Bin (1000, 0.25). I then do a Kolmogorov Smirnov test to test if the
> vector has been drawn from a population of Bin (1000, 0.25). I would

(again, Bin(100, 0.25) )

> expect a reasonably high p-value.....

You should expect a p-value drawn from a uniform distribution on (0,1);
a value below .05 is just as likely as one between .45 and .5.

> Either I am doing something wrong in R, or I am misunderstanding how this
> test should work (both quite possible)...

One point (that may not be relevant to how small your p-value is) - the
KS test is for *continuous* distributions, and the binomial is discrete.
This will affect the distribution of the test statistic, but I wouldn't have
thought to this extent. I expect the explanation will be something else.

> > #### 1000 random samples from binomial dist with mean =.25, n=100...

1000 random observations from the binomial distribution with parameters
p =.25 , n=100 (NB the mean is 25, not .25). You are muddling up the binomial
you're drawing from with the bernoulli trials it's based on.

Glen

Herman Rubin

unread,

Jun 29, 2001, 9:03:23 AM6/29/01

to

In article <9hh92n$7p7$1...@merki.connect.com.au>,
Glen Barnett <glenb...@geocities.com> wrote:

><jo...@tman.dnsalias.com> wrote in message
>news:xeQ_6.1949$xd.3...@typhoon.snet.net...
>> Can someone help? In R, I am generating a vector of 1000 samples from

>Observations from Bin(100, 0.25), not samples from Bin(1000, 0.25).

>> Bin (1000, 0.25). I then do a Kolmogorov Smirnov test to test if the
>> vector has been drawn from a population of Bin (1000, 0.25). I would

>(again, Bin(100, 0.25) )

>> expect a reasonably high p-value.....

>You should expect a p-value drawn from a uniform distribution on (0,1);
>a value below .05 is just as likely as one between .45 and .5.

>> Either I am doing something wrong in R, or I am misunderstanding how this
>> test should work (both quite possible)...

>One point (that may not be relevant to how small your p-value is) - the
>KS test is for *continuous* distributions, and the binomial is discrete.
>This will affect the distribution of the test statistic, but I wouldn't have
>thought to this extent. I expect the explanation will be something else.

The KS test is conservative for discrete distributions. This
means that the p-value is OVER estimated.

However, I do not believe that it is that much overestimated
here. Either something is wrong with an algorithm used, or
the pseudo-random numbers are not good. This latter situation
is far more likely than usually believed; one should at least
mix physical random and pseudo-random. The problem with
pseudo-random is the necessary lack of independence.

>> > #### 1000 random samples from binomial dist with mean =.25, n=100...

>1000 random observations from the binomial distribution with parameters
>p =.25 , n=100 (NB the mean is 25, not .25). You are muddling up the binomial
>you're drawing from with the bernoulli trials it's based on.

>Glen

--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
hru...@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

john v verkuilen

unread,

Jun 30, 2001, 5:09:02 PM6/30/01

to

"Glen Barnett" <glenb...@geocities.com> writes:

>One point (that may not be relevant to how small your p-value is) - the
>KS test is for *continuous* distributions, and the binomial is discrete.
>This will affect the distribution of the test statistic, but I wouldn't have
>thought to this extent. I expect the explanation will be something else.

Biases the true alpha a little low but it should be pretty close to right.
Jean Dickinson Gibbon's book on nonparametrics has an entire chapter on
the KS and related tests and goes through this in extensive detail.

Jay
--
J. Verkuilen ja...@uiuc.edu
"Depend upon it, sir, when a man knows he is to be hanged in a fortnight, it
concentrates his mind wonderfully." --Dr. Samuel Johnson

A. G. McDowell

unread,

Jun 30, 2001, 5:34:46 PM6/30/01

to

In article <xeQ_6.1949$xd.3...@typhoon.snet.net>,
jo...@tman.dnsalias.com writes

Some more oddities:

> o<-rbinom(10000, 1, 0.25)
> ks.test(o, "pbinom", 1, 0.25)

One-sample Kolmogorov-Smirnov test

data: o
D = 0.75, p-value = < 2.2e-16
alternative hypothesis: two.sided

> length(o[o==0])
[1] 7491
> length(o[o==1])
[1] 2509
> o<-rep(0,10000)
> ks.test(o, "pbinom", 1, 0.25)

One-sample Kolmogorov-Smirnov test

data: o
D = 0.75, p-value = < 2.2e-16
alternative hypothesis: two.sided

> length(o[o==0])
[1] 10000
> length(o[o==1])
[1] 0

Here zeroing out the data does not change the reported D value

After playing about with
ks.test(c(rep(0, X), rep(1, 1000-x)), "pbinom", 1, p)
for a bit I conjecture that ks.test() takes no account
whatsoever of ties, but merely sorts the input values
and looks for max (position/N - pbinom(value, 1, p)).
Anybody got the source handy?
--
A. G. McDowell

A. G. McDowell

unread,

Jul 1, 2001, 1:57:57 AM7/1/01

to

I have put in an R bug report on this. That should at least tell us
whether this is a bug, a limitation, or an occasion for me to curl up
and die. Its URL at the moment is

http://r-bugs.biostat.ku.dk/R/incoming?id=1007;expression=ks.test;user=g
uest

--
A. G. McDowell

Robert Dodier

unread,

Jul 5, 2001, 9:31:18 AM7/5/01

to

Concerning apparent strangeness in R's K-S test,
A. G. McDowell <mcdo...@mcdowella.demon.co.uk> wrote, in part:

> After playing about with
> ks.test(c(rep(0, X), rep(1, 1000-x)), "pbinom", 1, p)
> for a bit I conjecture that ks.test() takes no account
> whatsoever of ties, but merely sorts the input values
> and looks for max (position/N - pbinom(value, 1, p)).
> Anybody got the source handy?

I'm running R v1.1.1 (August 15, 2000).

library(ctest)
ks.test

yields the following. (I've omitted most of the function.)

function (x, y, ..., alternative = c("two.sided", "less", "greater"))
{
alternative <- match.arg(alternative)

[SNIP]

METHOD <- "One-sample Kolmogorov-Smirnov test"
n <- length(x)
x <- y(sort(x), ...) - (0:(n - 1))/n
STATISTIC <- switch(alternative, two.sided = max(abs(c(x,
x - 1/n))), greater = max(c(x, x - 1/n)), less = -min(c(x,
x - 1/n)))

[SNIP]
}

It appears that the computation doesn't take ties into
account, as A.G. McDowell speculates above.
In the presence of ties, I don't know what is the correct
computation. Suggestions, anyone?

Incidentally, in the branch for the two-sample test,
there is a warning message if there are ties.

I hope this sheds some light on the problem.

Regards,
Robert Dodier
--
``Socrates used to meditate all day in the snow, but Descartes'
mind worked only when he was warm.'' -- Bertrand Russell

A. G. McDowell

unread,

Jul 5, 2001, 4:14:48 PM7/5/01

to

>
>It appears that the computation doesn't take ties into
>account, as A.G. McDowell speculates above.
>In the presence of ties, I don't know what is the correct
>computation. Suggestions, anyone?
>
>Incidentally, in the branch for the two-sample test,
>there is a warning message if there are ties.
>
>I hope this sheds some light on the problem.
>
>Regards,
>Robert Dodier

You can find detailed answers from Prof Ripley by going to the website
and looking at http://r-bugs.biostat.ku.dk/R/incoming?id=1007;expression
=ks.test;user=guest - or thereabouts: I guess the status of the report
may have changed, or may change in the future. I believe myself that the
statistic required for the conservative version of the Kolomogorov-
Smirnov test for the entirely discrete case could be produced by the
following code changes:

replace

x <- y(sort(x), ...) - (0 : (n-1)) / n

with

x <- sort(x)
untied <- c(x[1:n-1] != x[2:n], TRUE)
x <- y(x, ...) - (0 : (n-1)) / n
x <- x[untied]

While I was testing this I computed the statistic a large number of
times for a particular random case and observed that the statistic was
indeed conservative: a plot of sorted p-values showed not a straight
line from near 0.0 at the small end to near 1.0 at the large end, but a
curve bowed above this line in the middle. So the conservatism of
the approximation is losing you power. There are references in
"Practical Nonparametric Statistics" (Conover) and "Kendall's Advanced
Theory of Statistics 2A: Classical Inference and the Linear Model"
(Stuart, Ord, & Arnold) to exact evaluation of P-values.

Note also the advice from S-plus to use chi-squared. I do not have
anything like the knowledge or experience to tell if using the K-S test
for a discrete distribution is a good idea in practice, even if you can
produce a conservative significance for it.

There is a fairly basic underlying problem. All either version of the
code asks for is the value of the cumulative distribution at the points
encountered in the data. It doesn't really know where it is supposed to
be continuous and where it isn't. The existing code returns a strong
signal in the case when a supposedly continuous distribution is
returning duplicate values. The amended code would always assume that
duplicate values are perfectly reasonable, which may not be the case. On
the face of it, there isn't enough information available for any single
routine to make the right decision all the time.
--
A. G. McDowell

Cyril Goutte

unread,

Jul 6, 2001, 8:36:38 AM7/6/01

to

"A. G. McDowell" wrote:

> may have changed, or may change in the future. I believe myself that the
> statistic required for the conservative version of the Kolomogorov-
> Smirnov test for the entirely discrete case could be produced by the
> following code changes:
>
> replace
>
> x <- y(sort(x), ...) - (0 : (n-1)) / n
>
> with
>
> x <- sort(x)
> untied <- c(x[1:n-1] != x[2:n], TRUE)
> x <- y(x, ...) - (0 : (n-1)) / n
> x <- x[untied]

Could you explain why you think the way R calculates the KS statistic D
is
incorrect ? Seems to me that R (contrary to earlier version of S-plus)
is precisely calculating the correct value for D.
Whether this statistic has a known distribution or whether the
distribution
obtained in the continuous case gives useful results for discrete
variables
is a different matter altogether.

Wouldn't the "righ" thing to do be to derive the distribution of the
discrete-case statistic instead of tweaking it to fit the
continuous-case
distribution ?

On the other hand, it seems strange that ks.test accepts discrete
distributions as input without at least complaining.

Cyril.

--
Cyril Goutte cyril....@inrialpes.fr
INRIA Rhone-Alpes Tel: (+33) 4 76 61 55 13
Zirst - 655 avenue de l'Europe - Montbonnot Fax: (+33) 4 76 61 54 77
38334 Saint Ismier Cedex - France www.inrialpes.fr/is2

A. G. McDowell

unread,

Jul 7, 2001, 3:02:08 AM7/7/01

to

In article <9i4bc9$bdm$1...@abricotier.inrialpes.fr>, Cyril Goutte
<"cyril_dot_goutte "@inrialpes.fr> writes

>
>Could you explain why you think the way R calculates the KS statistic D
>is
>incorrect ? Seems to me that R (contrary to earlier version of S-plus)
>is precisely calculating the correct value for D.
>Whether this statistic has a known distribution or whether the
>distribution
>obtained in the continuous case gives useful results for discrete
>variables
>is a different matter altogether.

This may be very sensitive to the exact definition you choose and how
you interpret it. The definition of the Kolmogorov-Smirnov statistic is
based on the maximum difference found between the theoretical and
empirical cumulative distributions: Dn = sup_x |S(x) - F(x)|

If you follow Conover (Practical Nonparametric Statistics) then you will
evaluate the empirical distribution function S(x) as the proportion of
samples <= x. So if you
have the sample (c(rep(0, 900), rep(1, 100)) then you will work out
that S(0) = 0.9 and S(1) = 1.0. Then you have that D =
max(abs(0.9-F(0)), abs(1.0-F(1)).

If you follow Stuart, Ord, and Arnold (Kendall's etc., Classical
Inference and the Linear Model) you calculate Sn(x) as

0 if x < x(1)
r/n if x(r) <= x < x(r+1)
1 if x(n) <= x

If there are ties in the x(r), so that x(r) = x(r + 1), then some values
of r/n become unattainable, because there is no x such that
x(r) <= x < x(r+1) = x(r)
If x(1)=x(2)....=x(900) < x(901) = x(902).. = x(1000)
it would seem that S(0) = 900/1000 and s(1) = 1000/1000

R's calculation pretty much agrees with both of these only when there
are no ties - which should be all the time in the continuous case.

>
>Wouldn't the "righ" thing to do be to derive the distribution of the
>discrete-case statistic instead of tweaking it to fit the
>continuous-case
>distribution ?

I guess the question of which statistic you want to compute and test is
really a question about what your null and alternative hypotheses are,
but I have an interesting example. Here ks.testUntied is ks.test after
my suggested modifications.

> ks.test(c(rep(0, 900), rep(1, 100)), "pbinom", 1, 0.1)

One-sample Kolmogorov-Smirnov test

data: c(rep(0, 900), rep(1, 100))
D = 0.9, p-value = < 2.2e-16
alternative hypothesis: two.sided

> ks.test(c(rep(0, 800), rep(1, 200)), "pbinom", 1, 0.1)

One-sample Kolmogorov-Smirnov test

data: c(rep(0, 800), rep(1, 200))
D = 0.9, p-value = < 2.2e-16
alternative hypothesis: two.sided

> ks.testUntied(c(rep(0, 900), rep(1, 100)), "pbinom", 1, 0.1)

One-sample Kolmogorov-Smirnov test

data: c(rep(0, 900), rep(1, 100))
D = 0.001, p-value = 1
alternative hypothesis: two.sided

> ks.testUntied(c(rep(0, 800), rep(1, 200)), "pbinom", 1, 0.1)

One-sample Kolmogorov-Smirnov test

data: c(rep(0, 800), rep(1, 200))
D = 0.101, p-value = 2.758e-09
alternative hypothesis: two.sided

>

Notice two things:
1/ ks.test() returns the same D value in both cases, whether the
proportion of 0s and 1s is in accord with the theoretical distribution
or not. (Of course, in the continuous case, both samples are bizarre,
but here we are considering what to do in the discrete case). No
possible calculation of its distribution can modify this behaviour. On
the other hand, ks.testUntied() does distinguish between the two
samples.
2/ ks.testUntied() returns a D value > 0 for the 'perfect/too-good'
sample. It is calculating s(0) as 899/1000 instead of 900/1000, because
it has inherited the line

x <- y(x, ...) - (0 : (n-1)) / n

from ks.test(). This is clearly a bug in ks.testUntied(). In ks.test() I
don't know whether it represents a bug, a conflict in definitions, or a
difference that becomes important with probability 0.
--
A. G. McDowell

Cyril Goutte

unread,

Jul 10, 2001, 5:33:50 AM7/10/01

to

I think there are two separate points in this discussion :
- How to calculate the empirical cdf from a sample
- Evaluating the fit of the sample to a distribution (potentially
using the empirical cdf)

I still fail to see that the empirical cdf may be sensitive to
the "exact definition" you choose. Indeed :

"A. G. McDowell" wrote:
> If you follow Conover (Practical Nonparametric Statistics) ...

> if you have the sample (c(rep(0, 900), rep(1, 100)) then you will work out
> that S(0) = 0.9 and S(1) = 1.0.

> If you follow Stuart, Ord, and Arnold (Kendall's etc., Classical
> Inference and the Linear Model) ...

> it would seem that S(0) = 900/1000 and s(1) = 1000/1000

which is the same, and what seems (to me) the correct answer.

Of course based on the example you provide, I'll have to take back
my optimistic statement that R is calculating the correct value of D.

> I guess the question of which statistic you want to compute and test is
> really a question about what your null and alternative hypotheses are,

Well, for a given hypothesis, you could come up with several statistics,
(which is the case for goodness-of-fit tests if I am not mistaken) so
there isn't really a one-to-one relationship between hypothesis and
statistics such that when you take one given hypothesis, you _have to_
use a given statistic.

Nice examples anyway. BTW I have Matlab code that does the correct
calculation of D if that may help.

Cyril.

--
Cyril Goutte