even if we are able to clear up misconceptions of the technical meaning of
p, so we are all on the same page ... i have yet to see any compelling
explanations or compelling arguments about how p can help in the above ...
could someone give an example or two ... of how p values have really
advanced our knowledge and understanding of some particular phenomenon?
_________________________________________________________
dennis roberts, educational psychology, penn state university
208 cedar, AC 8148632401, mailto:d...@psu.edu
http://roberts.ed.psu.edu/users/droberts/drober~1.htm
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
. http://jse.stat.ncsu.edu/ .
=================================================================
Pick up any issue of JAMA or NEJM.
I admit I use P values. They *help* me separate the wheat from the
chaff. Are you saying you don't use them? (The question is earnest,
not rhetorical.)
i would be interested in knowing how p separates wheat from chaff
>Are you saying you don't use them? (The question is earnest,
>not rhetorical.)
sure, i see them but, i am much more interesting in HOW the study was
conducted ... because, it is not the stats that make for success ... it is
how the investigation was conducted
no ... that is not sufficient ... just to look at a journal ... where p
values are used ... does not answer the question above ... that's circular
and just shows HOW they are used ... not what benefit is derived FROM there use
i would like an example or two where ... one can make a cogent argument
that p ... in it's own right ... helps us understand the SIZE of an effect
... the IMPORTANCE of an effect ... the PRACTICAL benefit of an effect
maybe you could select one or two instances from an issue of the journal
... and lay them out in a post?
On 21 Mar 2003, Dennis Roberts wrote:
> At 09:55 PM 3/21/03 +0000, Jerry Dallal wrote:
> >dennis roberts wrote:
> >
> > > could someone give an example or two ... of how p values have really
> > > advanced our knowledge and understanding of some particular phenomenon?
> >
> >Pick up any issue of JAMA or NEJM.
>
> no ... that is not sufficient ... just to look at a journal ... where p
> values are used ... does not answer the question above ... that's circular
> and just shows HOW they are used ... not what benefit is derived FROM there use
>
> i would like an example or two where ... one can make a cogent argument
> that p ... in it's own right ... helps us understand the SIZE of an effect
> ... the IMPORTANCE of an effect ... the PRACTICAL benefit of an effect
>
> maybe you could select one or two instances from an issue of the journal
> ... and lay them out in a post?
Am I missing something ... isn't it important to determine
whether an effect has a low probability of occurring by chance?
If an effect could have too readily occurred by chance, then its
size would not seem to matter much and there is no reason to
think that it has practical benefit in general. No one is saying
that p values are the be all and end all, but neither does that
mean they have no value for their intended purpose (i.e.,
identifying outcomes that are readily explained by random
factors).
Best wishes
Jim
============================================================================
James M. Clark (204) 786-9757
Department of Psychology (204) 774-4134 Fax
University of Winnipeg 4L05D
Winnipeg, Manitoba R3B 2E9 cl...@uwinnipeg.ca
CANADA http://www.uwinnipeg.ca/~clark
============================================================================
> i would like an example or two where ... one can make a cogent argument
> that p ... in it's own right ... helps us understand the SIZE of an effect
> ... the IMPORTANCE of an effect ... the PRACTICAL benefit of an effect
>
No one ever claimed it did. Statistical significance and practical
importance are two different things. But P values can help determine
whether an apparently importance difference is within sampling
variability of no effect.
Take a look at the journal Epidemiology to see what happens when their
use is outlawed.
> >Are you saying you don't use them? (The question is earnest,
> >not rhetorical.)
>
> sure, i see them but, i am much more interesting in HOW the study was
> conducted ... because, it is not the stats that make for success ... it is
> how the investigation was conducted
I should clarify my question. Are you saying you don't generate or
report P values yourself? Again, the question is in earnest, it is not
rhetorical.
This is the type of brainwashing which is accomplished by
the classical approach. The practical benefit only depends
on the size of the effect, and has nothing to do with the
chance that something that extreme would have occurred if
there was no effect at all.
Here is an extreme version of a bad example; there is a
disease which is 50% lethal. The old treatment has been
given to 1,000,000 people and 510,000 have survived.
There is a new treatment which has been given to 3 people,
and all have survived. You find you have the disease;
which treatment will you take?
The first has a very small p-value; it is about 20
sigma out. The second has a probability of 1/8 of
occurring by chance if the treatment does nothing.
--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Deptartment of Statistics, Purdue University
hru...@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558
I think the issue is one of terminology and philosophy, between
"probability" and "likelihood".
Once an effect has occurred, I am very uncomfortable with assigning
any probability other than 1 to its occurrence, and any probability
at all to its causes. I wrote something about this a couple of weeks
ago in the thread "Laplace and the Monty Hall paradox?", which may
be seen at <http://tinyurl.com/7yhk> which is an alias for
> http://groups.google.com/groups?threadm=MPG.18c2c43a1f54d69b98a7ae%40news.odyssey.net
--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
"Walrus meat as a diet is less repulsive than seal."
-- Harry de Windt, /From Paris to New York by Land/ (1904)
> Here is an extreme version of a bad example; there is a
> disease which is 50% lethal. The old treatment has been
> given to 1,000,000 people and 510,000 have survived.
> There is a new treatment which has been given to 3 people,
> and all have survived. You find you have the disease;
> which treatment will you take?
>
> The first has a very small p-value; it is about 20
> sigma out. The second has a probability of 1/8 of
> occurring by chance if the treatment does nothing.
Wouldn't it depend on the survival rate without treatment? With
advanced pancreatic cancer, for example, the second has a probability of
0 if the treatment does nothing.
I specifically stated that the survival rate without
treatment is .5. The old highly significant treatment
raises it to .51, within sampling error.
You're right. I see I missed it. Sorry.
A fascinating question that many IRBs and participants in clinical
trials face daily--foregoing standard treatment to avail one's self of a
"promising" new treatment.
On 22 Mar 2003, Herman Rubin wrote:
> In article <Pine.GSO.4.21.030321...@io.uwinnipeg.ca>,
> jim clark <cl...@uwinnipeg.ca> wrote:
> >Hi
>
> >On 21 Mar 2003, Dennis Roberts wrote:
>
> >> i would like an example or two where ... one can make a cogent argument
> >> that p ... in it's own right ... helps us understand the SIZE of an effect
> >> ... the IMPORTANCE of an effect ... the PRACTICAL benefit of an effect
>
> >> maybe you could select one or two instances from an issue of the journal
> >> ... and lay them out in a post?
>
> >Am I missing something ... isn't it important to determine
> >whether an effect has a low probability of occurring by chance?
> >If an effect could have too readily occurred by chance, then its
> >size would not seem to matter much and there is no reason to
> >think that it has practical benefit in general. No one is saying
> >that p values are the be all and end all, but neither does that
> >mean they have no value for their intended purpose (i.e.,
> >identifying outcomes that are readily explained by random
> >factors).
>
> This is the type of brainwashing which is accomplished by
> the classical approach. The practical benefit only depends
> on the size of the effect, and has nothing to do with the
> chance that something that extreme would have occurred if
> there was no effect at all.
I would be very surprised if the "practical benefit" (say as
indicated by effect size) was completely independent of p value,
at least not among a population of studies that included studies
with 0 benefit. Practical and statistical significance are not
identical, but that does not mean that they are independent.
Nor does a single hypothetical example, as below, address this
question.
> Here is an extreme version of a bad example; there is a
> disease which is 50% lethal. The old treatment has been
> given to 1,000,000 people and 510,000 have survived.
> There is a new treatment which has been given to 3 people,
> and all have survived. You find you have the disease;
> which treatment will you take?
>
> The first has a very small p-value; it is about 20
> sigma out. The second has a probability of 1/8 of
> occurring by chance if the treatment does nothing.
Note that I said "practical benefit in general." So how much
money should the health care system put into this new treatment
based on this study of 3 people? A second question is what you
would recommend or do yourself if 2 of 3 people had survived?
That is still 67% vs. 51%, a large difference if all you are
interested in is effect size.
Your example (especially for 2 out of 3 successes, since 1/8
approaches significance) nicely illustrates that one can obtain
large effect sizes without achieving anything like acceptable
levels of significance, presumably because of inadequate sample
sizes. But we should not put much confidence in conclusions from
such studies because of the lack of significance, although we
might be willing to gamble (i.e., that the treatment is effective
in general) given sufficiently unfavourable circumstances.
> At 09:55 PM 3/21/03 +0000, Jerry Dallal wrote:
> >dennis roberts wrote:
> >
> > > could someone give an example or two ... of how p values have really
> > > advanced our knowledge and understanding of some particular phenomenon?
> >
> >Pick up any issue of JAMA or NEJM.
>
> no ... that is not sufficient ... just to look at a journal ... where p
> values are used ... does not answer the question above ... that's circular
> and just shows HOW they are used ... not what benefit is derived FROM there use
>
> i would like an example or two where ... one can make a cogent argument
> that p ... in it's own right ... helps us understand the SIZE of an effect
> ... the IMPORTANCE of an effect ... the PRACTICAL benefit of an effect
>
> maybe you could select one or two instances from an issue of the journal
> ... and lay them out in a post?
Here's my selection of instances.
Whenever I read of striking epidemiology results
(usually, first in the newspaper), I *always* calculate
the p-level for whatever the gain was.
It *very* often is barely beyond 5%.
Also, it is *very* often based on testing, say, 20
or 50 dietary items; or on some intervention with
little experimental history.
So I say to myself, "As far as being a notable 'finding'
goes, this does not survive adjustment for multiple-testing.
I will doubt it."
Then, I am not at all surprised with the followup studies in
five or ten years show that eggs are not so bad after all,
real butter is not so bad, or Hormone Replacement Therapy
is not necessarily so good. Serious students of the fields
were not at all surprised, basically, for the same reasons
that I'm not surprised.
Quite often, as with the HRT, there is the additional problem
of non-randomization. For that, I also consider the effect size,
too, and say, "Can I imagine particular biases that could
cause this result?" My baseline here is sometimes the
'healthy-worker effect' -- Compared with the non-working
population, workers have age-adjusted SMR (Standardized
Mortality Ratio) for heart diseases of about 0.60. That's
mostly blamed on selection, since sick people quit work.
Something that is highly "significant" can still be unbelievable,
but I hardly waste any time before I disbelieve something
that is 5% "significant" and probably biased.
If you pay no attention to the p-value, you toss out
what is arguably the single most useful screen we have.
Of course, Herman gives the example where the
purpose is different. We may want to explore potential
new leads, instead of setting our standards to screen
out trivial results. It seems to me that the "decision theorist"
is doing a poor job of discriminating tasks here.
I certainly never suggested that there is only one way
to regard results -- I have periodically told posters that
their new studies with small N and many variables
seemed to set them up with "exploratory" paradigms,
rather than "testable" hypotheses.
--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html
> Quite often, as with the HRT, there is the additional problem
> of non-randomization. For that, I also consider the effect size,
> too, and say, "Can I imagine particular biases that could
> cause this result?"
With HRT, there is a further problem. The early epidemiological studies
showing benefit all involved estrogen alone. The recent studies involve
estrogen plus progestin.
how do you know IF an effect has occurred?
A. what is the operational definition of AN effect?
B. when in the statistical analysis would you know that it has now been
observed?
>Am I missing something ... isn't it important to determine
>whether an effect has a low probability of occurring by chance?
you mean the data you got has a low probability of having been observed IF
the null IS true?
if so, this is not the effect you are talking about ... you would like the
"p" value you tell you that there is a HIGH probability of being the case
given your observed data but, unfortunately, the p value we compute does
not tell us that
and, also, WHAT effect? an effect of .03 ... .1 ... 3.4?
seems like this is putting the cart before the horse ... "apparently
important"
by the argument that null hypothesis testing is useful, you have to first
determine how likely our results were IF the null HAD been true ... AND
then possibly speak to "importance"
problem is ... the dominance in the literature is the 'significance =
importance' ... that is precisely the massive problem there is with null
hypothesis testing
>.
>.
>=================================================================
>Instructions for joining and leaving this list, remarks about the
>problem of INAPPROPRIATE MESSAGES, and archives are available at:
>. http://jse.stat.ncsu.edu/ .
>=================================================================
_________________________________________________________
dennis roberts, educational psychology, penn state university
208 cedar, AC 8148632401, mailto:d...@psu.edu
http://roberts.ed.psu.edu/users/droberts/drober~1.htm
.
---------------------- >8 -----------------------
> i would like an example or two where ... one can make a cogent argument
> that p ... in it's own right ... helps us understand the SIZE of an effect
> ... the IMPORTANCE of an effect ... the PRACTICAL benefit of an effect
I think you'll be waiting a long time for that, Dennis. We all know that
p-values are dependent on sample size, and that with a large enough sample
an effect that is trivial in terms of practical significance may be
statistically significant.
Here are the two cents I wanted to throw into the discussion: Measures of
effect size are not particularly useful in certain areas of research where
one cannot measure directly the things they are interested in. Take
psychology experiments concerned with selective attention, for example.
The experimenter may be interested in some process (such as "inhibition")
which either cannot be measured directly, or is very difficult to measure
directly. A common approach in experiments on selective attention is to
measure response time (RT) to the nearest millisecond in two (or more)
experimental conditions. The basic analysis is often a paired t-test or
repeated measures ANOVA (often on median RTs). In such studies, a
difference of 15-20 ms in mean median RT between condtions (with overall
mean median RT in the neighbourhood of 400 ms) is often large enough to be
statistically significant with relatively small samples (n = 15 to 20).
If you compute a measure of effect size in this situtaion, it will almost
certainly be negligible. This might suggest to fans of effect size
measures that the result is not worthy of any further consideration. I
think this would be dead wrong though. The measure of effect size might
give the proportion of variation in overall RT that is accounted for by
the two experimental conditions, for example. But remember, the
experimenter was NOT trying to account for variation in overall RT.
He/she was simply using the difference in RT between two conditions to
make an inference about some underlying process, and to test predictions
from some theory, or theories. (In the ideal case, one theory would
predict a difference in one direction, and another theory in the other
direction.) Therefore, I think the usual kind of effect size measure is
pretty much irrelevant in this situation.
I think I may have ended up with 3 or 4 cents worth there...
--
Bruce Weaver
E-mail: wea...@mcmaster.ca
Homepage: http://www.angelfire.com/wv/bwhomedir/
It is true that one cannot put much confidence in
conclusions without "acceptable levels of significance",
but when there is not much information, there is not much
information. One cannot get statistical blood out of a
statistical turnip.
However, it is not the case that one can be confident of
the null in such cases, either, and this is the usual
attitude of those who believe in significance. From
any type of decision approach, the significance level
should decrease with increasing sample size. The other
form of this is that it should increase with decreasing
sample size! It is very easy to give models in which
one should not consider accepting the hypothesis without
a fair amount of data.
The problem with this is that we have at least 50 major
factors involved, and if the adjustment for multiple
testing is made, nothing can be statistically significant.
In fact, 50 may even be an underestimate; I believe that
there are more than 50 known antioxidants.
>Then, I am not at all surprised with the followup studies in
>five or ten years show that eggs are not so bad after all,
>real butter is not so bad, or Hormone Replacement Therapy
>is not necessarily so good. Serious students of the fields
>were not at all surprised, basically, for the same reasons
>that I'm not surprised.
It is even worse that this; in many of the cases, there
were not even studies. It was stated that eggs were bad
because eggs have cholesterol, and cholesterol levels IN
THE BLOOD were bad. The rest was conclusion jumping.
At this time, we have no good studies of any dietary
effects other than weight.
>Quite often, as with the HRT, there is the additional problem
>of non-randomization. For that, I also consider the effect size,
>too, and say, "Can I imagine particular biases that could
>cause this result?" My baseline here is sometimes the
>'healthy-worker effect' -- Compared with the non-working
>population, workers have age-adjusted SMR (Standardized
>Mortality Ratio) for heart diseases of about 0.60. That's
>mostly blamed on selection, since sick people quit work.
>Something that is highly "significant" can still be unbelievable,
>but I hardly waste any time before I disbelieve something
>that is 5% "significant" and probably biased.
One of the problems with medical studies is subjects
dropping out, causing the distribution of the subjects
in the experimental and control groups to become
different. The standard procedure is to ignore this
if the proportions are "not statistically significant."
This effect, however, might be greater than what is
being investigated.
>If you pay no attention to the p-value, you toss out
>what is arguably the single most useful screen we have.
The question is, what is being screened? Any decision
approach indicates that the critical p-value should
decrease with increasing sample size; this means that
it should increase with decreasing sample size. It is
often the case that evidence is needed to accept.
>Of course, Herman gives the example where the
>purpose is different. We may want to explore potential
>new leads, instead of setting our standards to screen
>out trivial results. It seems to me that the "decision theorist"
>is doing a poor job of discriminating tasks here.
The loss-prior combination must be given by the
investigator, NOT the statistician. To do this, they
need to think probabilitistically, not in terms of
statistical methods.
>I certainly never suggested that there is only one way
>to regard results -- I have periodically told posters that
>their new studies with small N and many variables
>seemed to set them up with "exploratory" paradigms,
>rather than "testable" hypotheses.
If one has the "clean" situations of the early physical
sciences, one can get away with some of this. However, the
"Kepler problem" is noted; if there was one less decimal
place in the available data, it would have been impossible
to distinguish between a circle and an ellipse, and if
there was one more place, as happened with telescopic
data, orbits as Kepler considered them did not even exist.
And what would have been the development of gas laws if
cases which did not behave like ideal gases had to be
considered? No quantitative fit, other than ad hoc, was
obtained from the data; theory was needed. The idea that
theory is generated from data is wrong; data can only help
confirm or deny.
I'd add further, that tiny effects (in terms of proportion of variance)
might be behaviourally important. I remember reading an article that
showed that people were sensitive to tiny (millisecond costs) in their
allocation of resources on a computer-based task. So such tiny effects
might also have surprising practical importance. I'm unconvinced that
standardized (and probably unstandardized) effect sizes are always (or
even often) useful measures of practical significance/importance. In
fact I'd argue that requiring, say, a fixed effect size to publish a
study would be far more damaging than relying on fixed significance levels.
The main thing is to get over the idea that there are other things to
pay attention to than just p values, standardized effect sizes or whatever.
Thom