Re: Kolmogorov-Smirnov 2-sample test

Andy Ross

unread,

Jul 22, 2010, 5:42:45 AM7/22/10

to

Bill Rowe wrote:
> On 7/20/10 at 3:41 AM, dar...@wolfram.com (Darren Glosemeyer) wrote:
>
>> Here is some code written by Andy Ross at Wolfram for the two
>> sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes
>> the test statistic, and KSBootstrapPValue provides a bootstrap
>> estimate of the p-value given the two data sets, the number of
>> simulations for the estimate and the test statistic.
>
>> In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x
>> &]]/Length[data]
>
>> In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] :=
>> Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
>> udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
>> n2 = Length[data2], T},
>> e1 = empiricalCDF[sd1, #] & /@ udat;
>> e2 = empiricalCDF[sd2, #] & /@ udat;
>> T = Max[Abs[e1 - e2]];
>> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
>> ]
>
> After looking at your code above I realized I posted a very bad
> solution to this problem. But, it looks to me like there is a
> problem with this code. The returned result
>
> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
>
> seems to have a extra factor in it. Specifically 1/Sqrt[n1].
> Since n1 is the number of samples in the first data set,
> including this factor means you will get a different result by
> interchanging the order of the arguments to the function when
> the number of samples in each data set is different. Since the
> KS statistic is based on the maximum difference between the
> empirical CDFs, the order in which the data sets are used in the
> function should not matter.
>

You are absolutely correct. The factor should be removed. I believe it
is a remnant of an incomplete copy and paste.

-Andy

Andy Ross

unread,

Jul 23, 2010, 7:09:03 AM7/23/10

to

I've corrected the error in my code from before. The p-value
computation was giving low estimates because I was using RandomChoice
rather than RandomSample. I believe this should do the job (though
rather slowly).

empiricalCDF[data_, x_] := Length[Select[data, # <= x &]]/Length[data]

splitAtN1[udat_, n1_] := {udat[[1 ;; n1]], udat[[n1 + 1 ;; -1]]}

KolmogorovSmirnov2Sample[data1_, data2_] :=
Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
n2 = Length[data2], T},
e1 = empiricalCDF[sd1, #] & /@ udat;
e2 = empiricalCDF[sd2, #] & /@ udat;
T = Max[Abs[e1 - e2]];

(Sqrt[(n1*n2)/(n1 + n2)]) T // N
]

KS2BootStrapPValue[data1_, data2_, T_, MCSamp_] :=
Block[{n1 = Length[data1], udat = Join[data1, data2], dfts},
dfts = ConstantArray[0, MCSamp];
Do[
dfts[[i]] =
KolmogorovSmirnov2Sample @@ splitAtN1[RandomSample[udat], n1]
, {i, MCSamp}
];
Length[Select[dfts, # >= T &]]/MCSamp // N
]

Example:

data1 = {0.386653, 1.10925, 0.871822, -0.266199, 2.00516, -1.48574,
-0.68592, -0.0461418, -0.29906, 0.209381};

data2 = {-0.283594, -1.08097, 0.915052, 0.448915, -0.88062, -0.140511,
-0.0812646, -1.1592, 0.138245, -0.314907};

In[41]:= KolmogorovSmirnov2Sample[data1, data2]

Out[41]= 0.67082

Using 1000 bootstrap samples...

In[42]:= KS2BootStrapPValue[data1, data2, .67082, 1000]

Out[42]= 0.791

-Andy

Ray Koopman

unread,

Jul 24, 2010, 5:07:07 AM7/24/10

to

The ugly procedural code in ks2 returns the exact p-value very
quickly:

AbsoluteTiming@ks2[data1,data2]

{0.005842 Second,{30,0.78693}}

Aaron Bramson

unread,

Jul 26, 2010, 6:47:34 AM7/26/10

to

Hello everybody and thank you,

This has been very helpful, and now the two-sided K-S test for Mathematica
is online for everybody to enjoy.

I have implemented the new code from Andy and from Ray on my data set and
the code from Ray works out better for me...though I don't have the skill to
decipher what that "ugly" code is doing, I've verified several results so
I'm using those exact p-values. I'm going to build a table of the p-values
from these tests (which is made into plot over time with the test being
performed on the individual-trial data streams of two cohorts at each time
step).

I have one last question, or maybe it's a request.. In Ray's code if I put
in two data sets wherein all the points are at the same value (e.g. all
zero) the result is not a K-stat of 0, and a p-value of 1, but rather
{-\[Infinity], 0.}. That doesn't seem like the right answer (and in any
case not the answer that I expect or can use) so this input combination
doesn't work with how the technique calculates the stats. So I'd like to
request a small change to the code Ray provided so that if the inputs are
all identical the output is {0,1} instead of {-\[Infinity], 0.}. I could do
this post-facto with a replacement rule, but it would probably be better and
faster to do this in the original calculation. But with THAT code I don't
know where to make the appropriate changes.

Again, thanks everybody for your help.

Best,
Aaron

p.s. I may end up using the Kuiper test and I might therefore have a similar
question about implementing that in Mathematica very soon.

On Fri, Jul 23, 2010 at 7:09 AM, Andy Ross <an...@wolfram.com> wrote:

> Andy Ross wrote:
> > Bill Rowe wrote:
> >> On 7/20/10 at 3:41 AM, dar...@wolfram.com (Darren Glosemeyer) wrote:
> >>
> >>> Here is some code written by Andy Ross at Wolfram for the two
> >>> sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes
> >>> the test statistic, and KSBootstrapPValue provides a bootstrap
> >>> estimate of the p-value given the two data sets, the number of
> >>> simulations for the estimate and the test statistic.

> >>> In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x
> >>> &]]/Length[data]
> >>> In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] :=

> >>> Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
> >>> udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
> >>> n2 = Length[data2], T},
> >>> e1 = empiricalCDF[sd1, #] & /@ udat;
> >>> e2 = empiricalCDF[sd2, #] & /@ udat;
> >>> T = Max[Abs[e1 - e2]];

> >>> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
> >>> ]
> >> After looking at your code above I realized I posted a very bad
> >> solution to this problem. But, it looks to me like there is a
> >> problem with this code. The returned result
> >>
> >> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
> >>
> >> seems to have a extra factor in it. Specifically 1/Sqrt[n1].
> >> Since n1 is the number of samples in the first data set,
> >> including this factor means you will get a different result by
> >> interchanging the order of the arguments to the function when
> >> the number of samples in each data set is different. Since the
> >> KS statistic is based on the maximum difference between the
> >> empirical CDFs, the order in which the data sets are used in the
> >> function should not matter.
> >>
> >
> > You are absolutely correct. The factor should be removed. I believe it
> > is a remnant of an incomplete copy and paste.
> >
> > -Andy
>

Ray Koopman

unread,

Jul 26, 2010, 7:06:22 AM7/26/10

to

ks2a[y1_,y2_] := Block[{n1 = Length@y1, n2 = Length@y2,
pool = Sort@Join[y1,y2], x,n,u}, If[Equal@@pool, {0,1.}, {x =
Max@Abs[n2*Tr@UnitStep[y1-#]-n1*Tr@UnitStep[y2-#]&/@Rest@Union@pool],
n = n1+n2; u = Table[0,{n2+1}]; Do[ Which[
i+j == 0, u[[j+1]] = 1,
i+j < n && pool[[i+j]] < pool[[i+j+1]] && Abs[n2*i-n1*j] >= x,
u[[j+1]] = 0,
i == 0, u[[j+1]] = u[[j]],
j > 0, u[[j+1]] += u[[j]]], {i,0,n1},{j,0,n2}];
N[1 - Last@u/Multinomial[n1,n2]]}] ]

ks2a[{1,1,1},{1,1,1,1}]

{0,1.}