Tests for sameness of peak of two continous distributions

Christoph Ruehlemann

unread,

Feb 26, 2013, 11:54:54 AM2/26/13

to corplin...@googlegroups.com, Franca Kirchberg

Dear all,

Suppose you have two continuous distributions X and Y (for positions of certain items in texts, expressed as values ranging between 0 and 1), and you assume for certain reasons that they will have their *peaks* in the same positional segment (say, between 0.8 and 0.9). Suppose further your histograms and estimated density curves for the two distributions suggest that you are right (that the distributions have their peaks in the same segment; see below), how can you formulate appropriate hypotheses and how can you test them?

Best

Chris

Stefan Th. Gries

unread,

Feb 26, 2013, 12:03:26 PM2/26/13

to corplin...@googlegroups.com

Any reason you don't simply do ks.test?
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

Christoph Ruehlemann

unread,

Feb 26, 2013, 12:28:30 PM2/26/13

to corplin...@googlegroups.com

Because ks.test seems to test whether the two distributions seen as a whole are different. What I am interested in is more specific: I'd like to test whether the distributions have their highest frequencies in the same value segment ...

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To post to this group, send email to corplin...@googlegroups.com.
Visit this group at http://groups.google.com/group/corpling-with-r?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Stefan Th. Gries

unread,

Feb 26, 2013, 12:39:03 PM2/26/13

to corplin...@googlegroups.com

> Because ks.test seems to test whether the two distributions seen as a whole are different. What I am interested in is more specific: I'd like to test whether the distributions have their highest frequencies in the same value segment ...

But the ks.test DOES react to that because if the peak of the
cumulative distribution function / density is in different places then
ks.test will be significant even if both vectors have the normal
distribution. The following code shows this clearly by comparing two
normal distributions with the same n (100) and the same sd (2) but
with different means (5 and 3). As you can see, ks.test is ***.

##################
set.seed(1)
qwe <- rnorm(100, 5, 2)
asd <- rnorm(100, 3, 2)
par(mfrow=c(3,2))
plot(qwe, col="blue") # panel 1
plot(asd, col="red") # panel 2
hist(qwe, xlim=c(0,10), ylim=c(0, 0.4), freq=FALSE, col="blue") # panel 3
lines(density(qwe), col="blue"); lines(density(asd), col="red")
hist(asd, xlim=c(0,10), ylim=c(0, 0.4), freq=FALSE, col="red") # panel 4
lines(density(asd), col="red"); lines(density(qwe), col="blue")
plot(ecdf(qwe), xlim=c(0,10), col="blue") # panel 5
lines(ecdf(asd), col="red", lwd=0.5, pch="")
plot(ecdf(asd), xlim=c(0,10), col="red") # panel 6
lines(ecdf(qwe), col="blue", lwd=0.5, pch="")
par(mfrow=c(1,1))

ks.test(qwe, asd)
##################

STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

Christoph Ruehlemann

unread,

Feb 26, 2013, 12:55:12 PM2/26/13

to corplin...@googlegroups.com

Cool stuff!! Thank you.

So, if the test is not significant, that is conclusive evidence that the two distributions must have their peak in the same segment?

Stefan Th. Gries

unread,

Feb 26, 2013, 12:58:43 PM2/26/13

to corplin...@googlegroups.com

Well, I am not the foremost expert on distribution fitting, but I
think it's one strategy to test it, yes. Another one, but more
involved, would be to do chi-squared tests for goodness of fit.

Christoph Ruehlemann

unread,

Feb 27, 2013, 11:28:38 AM2/27/13

to corplin...@googlegroups.com, Franca Kirchberg

Still grappling with this issue ...

If theoretical premises and visual inspection of the data suggest that two continuous distributions are the same, how can I formulate the hypotheses to test for that assumed sameness?

For the ks.test, the H0 will have to state the distributions are the same, while the H1 will have to state they are different. If the test result is NOT significant, this is, as far as I know, NOT evidence that H0 is true and H1 is false - an insignificant result does not support either hypothesis.

Any idea how to fix that problem?

Best
Chris

Stefan Th. Gries

unread,

Feb 27, 2013, 11:33:43 AM2/27/13

to corplin...@googlegroups.com

> If the test result is NOT significant, this is, as far as I know, NOT evidence that H0 is true and H1 is false - an insignificant result does not support either hypothesis. Any idea how to fix that problem?

But that's true of any result of a significance test:

- If you do a ks.test and the result IS significant (p=0.008), then
this is still not proof that the H1 is correct, because the p-value
means that the observed D-value of the ks.test would happen by chance
0.008 of the time anyway.
- If you do any test and the result is not significant, then this is
not proof that the H1 is not correct, because any number of reasons
can be responsible for the ns: lack of the expected effect is one,
yes, but so is sample size, large variability, ...

If your H1 is sameness (D=0), then, when the result is ns, for the
time being you'd have to admit that you cannot accept H1.

Christoph Ruehlemann

unread,

Feb 27, 2013, 1:04:16 PM2/27/13

to corplin...@googlegroups.com

>If your H1 is sameness (D=0), then, when the result is ns, for the time being you'd have to admit that you cannot accept H1.

But can the H1 state sameness (which is normally what the H0 states)? And, as said before, if the result is ns, you cannot accept the H0 (in this case, differentness) either - a pretty bad fix...

Stefan Th. Gries

unread,

Feb 27, 2013, 1:08:40 PM2/27/13

to corplin...@googlegroups.com

> But can the H1 state sameness (which is normally what the H0 states)?

Well, think about it by considering the logic shapiro.test, whose H0
is "the current data do not differ from a normal distribution" and
whose H1 is "the current data differ from a normal distribution".
Thus, if your ks.test is ns, you have no data to believe that your one
curve is diff from the other.

Christoph Ruehlemann

unread,

Feb 27, 2013, 1:26:43 PM2/27/13

to corplin...@googlegroups.com

To make sure I get this right:

As regards ns results, you recommend applying the same logic to ks.test as to shapiro.test: if shapiro.test is ns, this is evidence that the distribution is normal. By the same logic, if ks.test is ns this is evidence that the distributions are the same. So, rejecting H1 (non-normality for shapiro.test, differentness for ks.test) allows you to accept H0 (normality and, respectively, diffreentness) - correct?

Stefan Th. Gries

unread,

Feb 27, 2013, 1:27:48 PM2/27/13

to corplin...@googlegroups.com

> By the same logic, if ks.test is ns this is evidence that the distributions are the same.

I would think so, yes.

ludovic de cuypere

unread,

Feb 27, 2013, 1:28:40 PM2/27/13

to corplin...@googlegroups.com

Dear Christoph

Perhaps you could use a Mann-Whitney U / Wilcoxon RS test? The location-shift assumption doesn't need to hold.

Kind regards

Ludovic De Cuypere

Date: Wed, 27 Feb 2013 19:04:16 +0100
Subject: Re: [CorpLing with R] Tests for sameness of peak of two continous distributions
From: chrisru...@googlemail.com
To: corplin...@googlegroups.com

Christoph Ruehlemann

unread,

Feb 27, 2013, 1:32:35 PM2/27/13

to corplin...@googlegroups.com

with wilcox.test you run into the same kind of problem as with ks.test: the result is still ns ;)

Chris

Reply all

Reply to author

Forward