Q what conclusion could be draw from data, p-value and CI

Cosine

unread,

Apr 29, 2021, 3:08:08 AM4/29/21

to

We conducted a test on two groups (A and B). We used a
15-item scale to measure the results. A cut-off score of
6 (scores ranging from 0 to 15, with the higher score being
indicative for stronger reaction) was set to differentiate
the individuals with a clinical reaction from normal individuals.

The null hypothesis is that the two groups have no difference.
The alternative hypothesis is that the reaction of the members
of Group A is greater than that of Group B.

We defined the difference = score of A - score of B.
We chose the alpha = 0.05

We got the following data summarized in the table below.

Case Mean Difference P-value 95%CI N
1 0.15 0.001 0.05-0.25 2000
2 2.10 0.005 1.25-2.95 1200
3 1.30 0.089 -2.10-3.70 400

In addition to the following analysis, what else could we
draw from the data?

Case-1:
P-value < alpha -> significant
95%CI all > 0 -> A > B

Case-2:
the same as those of A

Case-3:
P-value > alpha -> insignificant
95%CI consists of 0 -> not sure if A > B or A < B

duncan smith

unread,

Apr 29, 2021, 11:07:53 AM4/29/21

to

It looks like some kind of class exercise / assignment. I'd say look at
the results carefully, and think what other information you'd like to
have in order to make sense of it. I'd have several questions to ask,
starting with exactly what these 3 cases are (I'm pretty sure what
they're not).

Duncan

Cosine

unread,

Apr 29, 2021, 12:29:24 PM4/29/21

to

Cosine 在 2021年4月29日星期四下午3:08:08 [UTC+8] 的信中寫道：

We could intitutively connect the P-value inference with the CI inference by P-value < alpha <=> reject H0 <=> ( 1-alpha )CI doesn't consists of 0.
But is there a formal way to prove the latter part, i.e, making inference by CI?

We could also draw the conclusion of clinical significance if we have additional information on a clinically meaningful value. Then we could
say that the result is clinically significant if 1) the CI consists of that clinical measure, and 2) the width of the CI is narrow enough. Nevertheless,
are there ways to determine if the width of the CI is too wide objectively?

David Jones

unread,

Apr 29, 2021, 1:21:05 PM4/29/21

to

A standard, and best, way of constructing a confidence interval in any
general situation is to define the confidence interval to contain
exactly all those values for which the signifance test that the true
value is that particular value is not rejected. This is standard stuff
in any reliable text-book or statistics course.

> We could also draw the conclusion of clinical significance if we
> have additional information on a clinically meaningful value. Then we
> could say that the result is clinically significant if 1) the CI
> consists of that clinical measure, and 2) the width of the CI is
> narrow enough. Nevertheless, are there ways to determine if the width
> of the CI is too wide objectively?

You need to revise this to say that you have a result of clinical
importance if the confidence interval contains only values that are
large enough to be medically useful, and NO OTHERS. That last
stipulation replaces your concern about the confidence interval being
too wide.

Rich Ulrich

unread,

Apr 29, 2021, 1:58:02 PM4/29/21

to

On Thu, 29 Apr 2021 00:08:06 -0700 (PDT), Cosine <ase...@gmail.com>
wrote:

>We conducted a test on two groups (A and B). We used a
>15-item scale to measure the results. A cut-off score of
> 6 (scores ranging from 0 to 15, with the higher score being
> indicative for stronger reaction) was set to differentiate
>the individuals with a clinical reaction from normal individuals.
>
> The null hypothesis is that the two groups have no difference.
>The alternative hypothesis is that the reaction of the members
>of Group A is greater than that of Group B.
>
> We defined the difference = score of A - score of B.
> We chose the alpha = 0.05
>
> We got the following data summarized in the table below.
>
>Case Mean Difference P-value 95%CI N
> 1 0.15 0.001 0.05-0.25 2000
> 2 2.10 0.005 1.25-2.95 1200
> 3 1.30 0.089 -2.10-3.70 400
>
> In addition to the following analysis, what else could we
>draw from the data?

Bad reporting. Is the N a total, for equal group sizes?

Whatever the "cases" are, they are vastly different in SD.
Perhaps Case 1 has scores near zero for all. Or: It will be more
sensible if Case 1 happened to report "Average item score"
whereas the others reported "Scale Total". That would make
the adjusted line for Case 1 read
1 2.25 0.001 0.75-3.75 2000

I haven't done calculations to be sure, but that does
seem like a large SE (on all three) for the reported Ns and
a 15 point scale.

Then too, some numbers have to be wrong. For Case
3, the mean difference is the midpoint of (-1.1, 3.7), not
of the reported (-2.1, 3.7). I assume -1.1 is correct.

But, more seriously, the test results (CI) are inconsistent with
the reported p-values. The SE for each comparison, the
denominator of the t-tests, is about 1/4th the range of the
CI. Using that for a close approximation gives me t-tests of
3.0, 4.94, and 1.08, respectively. The difference for case 2
is clearly the largest, and it is smaller than "p-value = 0.005".

>
> Case-1:
> P-value < alpha -> significant
> 95%CI all > 0 -> A > B
>
> Case-2:
> the same as those of A
>
> Case-3:
> P-value > alpha -> insignificant
> 95%CI consists of 0 -> not sure if A > B or A < B

If this is a homework assignment, as Duncan suggests,
you should give credit where credit is due.

--
Rich Ulrich

Cosine

unread,

Apr 29, 2021, 2:45:01 PM4/29/21

to

Rich Ulrich 在 2021年4月30日星期五上午1:58:02 [UTC+8] 的信中寫道：

This has nothing to do with homework or whatsoever.

The table came from Table I of this following paper.

Aarts, S., B. Winkens and M. van den Akker (2012). "The insignificance of statistical significance." European Journal of General Practice 18(1): 50-52.

But the 95% CI of case 3 was printed as: 21.10-3.70.

Rich Ulrich

unread,

Apr 29, 2021, 9:43:03 PM4/29/21

to

On Thu, 29 Apr 2021 11:44:58 -0700 (PDT), Cosine <ase...@gmail.com>
wrote:

>Rich Ulrich ? 2021?4?30? ?????1:58:02 [UTC+8] ??????

Without looking, I would guess that I correctly nailed the
distinction of Case 1 vs. 2 and 3. And they were trying to make
a point which turns out to ba a point about incompetent readers.

I'm reminded of an article I read, maybe 1985, that documented
the surprisingly high error rate for footnotes to scientific studies.
(9hat is - where references cited gave the wrong page, named the
journal wrong, or whatever). The next issue of the journal
included a note that apologized for three errors in the footnotes
of that article.

Or, to the point: I don't have much respect for people who talk
about "the insignificance of statistical significance".
It doesn't surprise me a bit that they carelessly screwed up a table
both logically and typographically, because such people are not
careful people.

>
>But the 95% CI of case 3 was printed as: 21.10-3.70.

Okay. You guessed wrong on the correction. Was -1.10.
not 21.10 or -2.10.

--
Rich Ulrich

David Duffy

unread,

Apr 30, 2021, 1:49:53 AM4/30/21

to

Cosine <ase...@gmail.com> wrote:
> In addition to the following analysis, what else could we
> draw from the data?

A different way thinking about what a P-value is telling you is via the
literature on estimation or calibration of posterior P-values eg Sellke
et al (2001). The argument is basically seen for a result with P=0.05,
when you have set alpha=0.05 - if what you saw is the true effect size,
then you only have a 50% chance of getting a significant result if you
repeated exactly the same study (same N etc).

For simple states of affairs,

"...here is the basic and surprising conclusion for normal testing, first
established (theoretically) by Berger and Sellke (1987). Suppose it is
known, a priori, that about 50% of the drugs tested have a negligible
effect. (We shortly consider the more general case.) Then:

"1. Of the Di for which the p value ~ .05, at least 23% (and typically
close to 50%) will have negligible effect.

"2. Of the Di for which the p value ~ .01, at least 7% (and typically
close to 15%) will have negligible effect.

If H0 and H1 have equal prior probabilities of 1/2, Sellke et al give

alpha(p) = 1/(1 + 1/(-e p log(p)))

as the posterior probability of H0, and as a frequentist calibration of
p. This is only simple for "precise" alternative hypotheses, obviously.

Relatedly, in genetic linkage analysis, where we set the critical
alpha to 0.0003 (chosen because there are 22 (pairs of) chromosomes),
the power to replicate a *true* finding using the same size and type
dataset (with P close to 0.0003) is ~20% (obtained via simulations).

You can think about the three results in your example and the
"replication crisis" through this lens.

Cosine

unread,

Apr 30, 2021, 8:42:20 AM4/30/21

to

How do we determine if the width of the CI is adequate or too wide?

The corrected data of Table I is given below:

Case Mean Difference P-value 95%CI N
1 0.15 0.001 0.05-0.25 2000
2 2.10 0.005 1.25-2.95 1200

3 1.30 0.089 -1.10-3.70 400

For the data provided by the above paper, the author wrote:

Let us reconsider the above-mentioned hypothetical study. The null hypothesis states that the mean difference between females and males on the GDS-15 (scale ranging from 0 to 15) is zero. Hence, if zero is detected in the 95% CI, the null hypothesis is not rejected. Examples of possible study results, using an α of 5%, are displayed in Table I. ...
Example 2 is not only statistically significant but also clinically relevant; the difference between females and males on the GDS-15 is approximately two whole points. Moreover, the confidence interval is quite
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
narrow, which indicates that the sample size is large enough to make a proper judgement.
^^^^^^^^^
What is the basis for the author to make this judgment?

The author also wrote:
Example 3 is not statistically significant. The confidence interval in this example is very large (almost six
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
points), which makes it difficult to draw any firm conclusions. Since the confidence interval in this
^^^^^^^
Again, why could the author make this statement? What did it mean by almost 6 points?

example includes both negative and positive values, it is not yet clear if there is a difference between these two groups (if females report more depressive symptoms than males or vice versa). Consequently, this study should be repeated using a larger sample size, which will decrease the width of the confidence interval.

Rich Ulrich

unread,

May 1, 2021, 7:36:06 PM5/1/21

to

On Fri, 30 Apr 2021 05:42:17 -0700 (PDT), Cosine <ase...@gmail.com>
wrote:

>How do we determine if the width of the CI is adequate or too wide?
>
> The corrected data of Table I is given below:
>
>Case Mean Difference P-value 95%CI N
> 1 0.15 0.001 0.05-0.25 2000
> 2 2.10 0.005 1.25-2.95 1200
> 3 1.30 0.089 -1.10-3.70 400
>

Here's some computation showing Cohen's d for each Case.
Cohen's d is the ueual recommendation for two-group
comparisons of effect size. That seems very relevant to the
reported title of that paper.

Cohen's d = (m1-m2) / s_w for the Means and Within SD.

The s_w can be recovered from the t-test: note, the t is
incorporated in the computation of the CI, approximately
+/- 2 (easier than 1.96) for the 95% CI.

t-test t= (m1-m2)/ s_diff where I compute the standard error of
the difference, using the common s_w for Case 3, N= 400 as 200+200:

The variance of a difference is equal to the sum of the variances,
thus,

s_diff= sqrt( s_w**2 /200 + s_w**2 /200)
= sqrt( 2* s_w**2 /200)
= s_w /10

Or, s_w= 10* s_diff .

For Case 3, the range for +/- 1.96 is about 4* s_diff.
For Case 3, the range is 4.8, so that s_diff is 1.2.
Thus s_w is computed as 10 times that, or 12.

Cohen's d would be a "small" effect, 1.1 (from 1.3/12; but
that is less relevant than the fact that "12" is impossible as the
SD for scores between (0,15) -- If all scores are at 0 and
15, equally distributed, the maximum SD of 7.5 is achieved,
as you get by re-scaling of a 0-1 variable to 0-15.

Computations for Cases 1 and 2 get s_w's of 1.12 and 7.36
(nearly the max of 7.5); and Cohen's d's, respectively, of 0.13
and 0.29. Case 2 has a moderate difference.

I don't like to criticize a paper from a distance, that is, without
actually reading it. I'm using the numbers and description,
as given.

Am I all confused, and screwing up? or is this example, as
it has been presented, totally bad?

>For the data provided by the above paper, the author wrote:
>

>Let us reconsider the above-mentioned hypothetical study. The null hypothesis states that the mean difference between females and males on the GDS-15 (scale ranging from 0 to 15) is zero. Hence, if zero is detected in the 95% CI, the null hypothesis is not rejected. Examples of possible study results, using an ? of 5%, are displayed in Table I. ...

>Example 2 is not only statistically significant but also clinically relevant; the difference between females and males on the GDS-15 is approximately two whole points. Moreover, the confidence interval is quite
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>narrow, which indicates that the sample size is large enough to make a proper judgement.
>^^^^^^^^^
> What is the basis for the author to make this judgment?

Knowing the subject matter (almost) always matters.

Females rate higher on typical depression scales (U.S.)
because of non-depressive artifacts, like, TALKING more
with people about everything, including mood. Women
also see doctors more often, which is not entirely accounted
for by pregnancy or menustration. Thus - such results as
these be followed by showing that there are items that
/matter/ that are relevant and differ.

>
> The author also wrote:
>Example 3 is not statistically significant. The confidence interval in this example is very large (almost six
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>points), which makes it difficult to draw any firm conclusions. Since the confidence interval in this
>^^^^^^^
> Again, why could the author make this statement? What did it mean by almost 6 points?

That's what he calls, 4.8. "Clumsy" makes many mistakes.
"Careless" fails to catch them.

>
>example includes both negative and positive values, it is not yet clear if there is a difference between these two groups (if females report more depressive symptoms than males or vice versa). Consequently, this study should be repeated using a larger sample size, which will decrease the width of the confidence interval.
>

They should have started with real data.

--
Rich Ulrich