Dear all:
Recently, a medical doctor colleague sent me the following query:
“Do you have any stats or epidemiology papers that discuss justification of alpha and beta (p values and their interpretation), especially conditions that merit altering them from what is done conventionally? My recollection is that interventions with important primary outcomes (e.g. mortality) and minimal associated adverse effects may mean that one can change p<0.05 to p<0.1 so as not to conclude there are no differences between the arms”.
I wonder if you could provide me with some information about this.
Thanks in advance,
Regards,
Keramat Nourijelyani, PhD Biostat
Current advice is to give the exact p-values and let readers make up their own minds about the importance of the results. In addition, it is essential to give a measure of the effect size. For correlations, r itself is a recommended measure. For difference between 2means, Cohen’s d = difference/SD. Other effect size measures are usually some form of % variance accounted for.
Effect size has the advantage over p-vlaues that it cannot be inflated by large N.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (Second ed.). Hillsdale, New Jersey: Lawrence Erlbaum Associates.
Rosnow, R. L., & Rosenthal, R. (1996). Beginning behavioral research: A conceptual primer (2nd ed.). Englewood Cliffs, NJ: Prentice-Hall, Inc.
Rosnow, R. L., Rosenthal, R., Huberty, C. J., Morris, J. D., Morris, R. J., Bergan, J. R., et al. (1992). Selected tests and analyses. In A. E. Kazdin (Ed.), Methodological issues & strategies in clinical research. Washington, DC: American Psychological Association.
Wilkinson, L. (1999). Statistical Methods in Psychology Journals: Guidelines and Explanations from Task Force on Statistical Inference , APA Board of Scientific Affairs. American Psychologist, 54(8), 594–604.
http://en.wikipedia.org/wiki/Effect_size
Amazing how good wikipaedia is on the issue
Best
Diana
On 10/12/2009 03:22, "knouri" <nou...@yahoo.com> wrote:
- Dear all:
- Recently, a medical doctor colleague sent me the following query:
- “Do you have any stats or epidemiology papers that discuss justification of alpha and beta (p values and their interpretation), especially conditions that merit altering them from what is done conventionally? My recollection is that interventions with important primary outcomes (e.g. mortality) and minimal associated adverse effects may mean that one can change p<0.05 to p<0.1 so as not to conclude there are no differences between the arms”.
- I wonder if you could provide me with some information about this.
- Thanks in advance,
- Regards,
- Keramat Nourijelyani, PhD Biostat
Professor Diana Kornbrot
email: d.e.ko...@herts.ac.uk
web: http://web.mac.com/kornbrot/iweb/KornbrotHome.html
Work
School of Psychology
University of Hertfordshire
College Lane, Hatfield, Hertfordshire AL10 9AB, UK
voice: +44 (0) 170 728 4626
mobile: +44 (0) 796 890 2102
fax +44 (0) 170 728 5073
Home
19 Elmhurst Avenue
London N2 0LT, UK
landline: +44 (0) 208 883 3657
mobile: +44 (0) 796 890 2102
fax: +44 (0) 870 706 4997
_____________________________________________________
Doug Altman
Professor of Statistics in Medicine
Centre for Statistics in Medicine
University of Oxford
Wolfson College Annexe
Linton Road
Oxford OX2 6UD
email: doug....@csm.ox.ac.uk
Tel: 01865 284400 (direct line
01865 284401)
Fax: 01865 284424
www:
http://www.csm-oxford.org.uk/
EQUATOR Network - resources for reporting research
www:
http://www.equator-network.org/
Professor Diana Kornbrot
email: d.e.ko...@herts.ac.uk <d.e.ko...@herts.ac.htm>
web: http://web.mac.com/kornbrot/iweb/KornbrotHome.html
Work
<http://www.equator-network.org/>
Much less emphasized, but equally important, is to give the POWER of the design to detect an effect of specified magnitude. That magnitude may be expressed in direct terms, e.g. power to detect a shift from 10% mortality to 8% mortality , or in statistical terms, using some well-defined measure of effect size to avoid ambiguity. The direct form is more meaningful, but obviously requires some prior estimate of the chosen measure (eg mortality) in the target population.
Current advice is to give the exact p-values and let readers make up their own minds about the importance of the results.
Personally, I see no justification for simply reporting a
"standardised effect size" unless there is a specific and
objective reason for doing so. As Doug says, "effect size
is simply - of course - the size of effect." -- either an
absolute or relative difference of outcome between, say,
a treatment and an absence of treatment.
If you standardise relative to the population SD, then you
can interpret the result relative to the distribution of the
outcome measure in the untreated population. So (if this were
Normally distributed), with an additive treatment effect with
standardised effect size of 1.65, a person at the population
median (= mean here) would be (on average) shifted up to the
95th percentile; a person at the 60th percentile would be
shifted up to the 98th percentile. And so on.
There may be good external reasons for wanting to express things
in that kind of way and, if so, then the effect size standardised
relative to population SD is a useful tool. But if there is
no real point in expressing things in that way, then why do it?
Standardising with respect to the SE of the estimate of the
effect is a quite different matter, and indeed I can see little
good reason for doing it (though I have often seen it done).
It can be interpreted (albeit potentially loosely) as a surrogate
for a P-value or a confidence interval (it is, in fact, simply
the "t-value"). But in my view it is better to give the full
nitty-gritty: N, estimate of effect (non-standardised), SE,
P-value (and possibly t-value if for some reason it is immediately
relevant -- the reader can otherwise easily find it by dividing
the etimate by its SE), and confidence interval.
The real danger -- the "non-invariance" -- of the SE-standardised
effect size is that it depends on the sample size. The same
trial, on the same population but with 4 times the sample size,
would double the effect size.
One could even imagine that pharma firm A have published a trial
of their Drug A in which they simply quote an SE-standardised
effect size of, say, 4.7 obtained from a trial with 400 patients.
Then pharma firm B, who are developing a rival drug, can simply
say "OK, let's trial ours using 1000 patients and publish the
SE-standardised effect size of our Drug B. We have reason to
suspect that ours is only 70% as effective as theirs, on an
absolute scale, but with this plan we can expect to show that
ours has 10-11% greater effect size than theirs."
(0.7*sqrt(1000/400) = 1.107)
As to "how to choose alpha and beta", the prevalence of the
practice of (often apparently blindly) adopting conventional
values really does obscure the essence of the matter: that it
is essentially arbitrary and amounts to a "political decision"
unless explicit grounds are given to show that the choice is
in some sense good.
The use of conventional P-values (alpha) like 0.10, 0.05, 0.01
etc. was the result of R.A. Fisher deciding to produce tables
in a more compact form, and requiring less computational effort,
than the then exisiting tables. The original Biometrika Tables For
Statisticians were published by Karl Pearson from his Biometric
Laboratory in 1914, and presented (for the basic distributions
of statistics like Normal, Chi-squared, T, etc.) the results of
numerical computation over the full range of the variate,
so you could (with interpolation) look up the P-value for any
possible value of the variate, or read the table in reverse to
obtain the the variate corresponding to any P-value. Fisher's
tables (and later the Fisher & Yates Statistical Tables for
Biological, Agricultural and Medical Research Workers) did it
in reverse: for a selection of P-values (.99, .98, .90, .80,
.70, .50, .30, .20, .10, .05, .02, ,01) the corresponding
values of the variate were tabulated, so the "P-bracket" for
any given value could be readily found. These were much smaller
than Pearson's enormous tomes. See, for instance, pp. 245-247
of "R.A. Fisher: The Life of a Scientist" by Joan Fisher Box.
Along with this went Fisher's own expressed views about the
importance or interpretation of P-values. For instance, the
following can be read in Section 20 (Chapter IV "Goodness
of Fit, Etc.") of his "Statistical Methods for Research Workers":
"In preparing this table we have borne in mind that
in practice we do not want to know the exact value
of P for any observed chi-squared, but, in the first
place, whether or not the observed value is open to
suspicion." [in the context of goodness of fit, here]
"If P is between 0.1 and 0.9 there is certainly no
reason to suspect the hypothesis tested. If it is
below 0.02 it is strongly indicated that that the
hypothesis fails to account for the whole of the facts."
[...]
"The actual value of P obtainable from the table by
interpolation indicates the strength of the evidence
against the hypothesis. A value of chi-squared exceeding
the 5 per cent. point is seldom to be disregarded."
Thus Fisher himself was cautious and qualified about the
interpretation of P-values, explicitly describing them
as measure of strength of evidence and justification for
degrees of suspicion, and *not* definitively defining
them as cut-offs for rejection or acceptance.
Nevertheless, the existence of the tables with their selected
"cut-points", and Fisher's enormous methodological influence
on research methods, led to the wide-spread adoption of his
methods and their associated tables. As a result, workers
unwilling (or unable) to appreciate the subtleties and nuances
which Fisher had hinted at began to routinely adopt such P-values
as if they were real cut-offs between the existence of an effect
and the non-existence of an effect. In terms of scientific
inference, this is simply invalid.
It is a different matter when a decision has to be made, based
on the outcome of an uncertain result. "Decision", by etymology,
means "cut-off". It may be a decision as to whether of not to
adopt a different treatment for a disease, or whether to invest
resources in further trials, or even whether or not to convict
for a crime.
For example, the UK statutory limit for driving with alcohol
in the blood is 80mg/100ml. If measured by a laboratory test
on a blood sample, it is (or was) the case that the SE of a
determination was certified to be not greater then 2%. Then
3 times the SE (or 2 units if result < 100) was subtracted
from the result, giving the level quoted in evidence. If this
was more than 80, conviction was certain unless a failure
in technical process could be proved. This gives a numerical
measure for the level of "proof beyond reasonable doubt"
required in criminal law: the corresponding value of alpha
is less than 1/1000.
This gives rise to the corresponding Power Function (1-beta).
If the driver's true value is exactly 80 (so on the limit of
"Not Guilty"), then he is protected at level 999/100 or more
against false conviction (Power = alpha <= 1/1000 at this level).
The Power rises to 50% at 86mg/100ml, and to 999/1000 or more
at 92mg/100ml. Hence there is a "grey area" from 80-92 over
which the conviction probability rises from negligible to
almost certain.
In part, the "alpha = 1/1000" arises from the "proof beyond
reasonable doubt" principle -- we need a small risk of false
conviction. In part, it arises from the realtively precise
laboratory determination (not worse than 2%) and the common
sense fact that at 92mg/100ml you are not much more drunk
than at 80mg/100ml, so the "grey area" is acceptable. But
both of these are chosen on "political" grounds. However,
if the laboratory precision, for instance, had been much
worse (say 10%), then with the limit at 80 and the same alpha
you would have 50% chance of acquitting people driving with
104mg/100ml, and would only be almost sure of conviction
when it got up to 128mg/100ml -- which would not be acceptable.
One simple change in that case would be to lower the legal
limit to, say, 57.5mg/100ml. Then you would again have the
same probability (>= 99/1000) of comvicting anyone at 92mg/100ml
or more (57.5 + 6*0.1*57.5 = 92).
Back at the 2% SE: the state could alternatively decide that
if the true level is 80 or more then conviction must be almost
certain. This would amount to *adding* 3 SEs to the lab result
and quoting this as evidence. This amounts to adopting
alpha = 0.999, as opposed to the good old alpha =0.001 -- a
reversal of the "innocent until proved guilty" principle.
Or, after that Dictator is overthrown and a more moderate one
takes over, a compromise might be struck at which anyone exactly
at 80mg/100ml has a 50% chance of conviction, so just quote
the lab result as found. This amounts to choosing alpha = 0.5.
This could be seen as politically acceptable, since anyone
close enough (> 74) to the limit to have > 1/1000 chance of
conviction was on thin ice in the first place, and deserves
what they get!
So, in such a context, alpha is whatever you want it to be,
along with the corresponding beta (or Power). And it all turns
on trade-off between alpha and beta.
The larger alpha (chance of false "rejection") the smaller is
beta (chance of failure to truly "reject"). There will be a
loss associated with either failure. So, from an administrative
point of view, one can adopt an alpha which will minimise the
expected loss, taking account of the prevalence of the state
being tested for in the population. (In the drink-driving
example, one could envisage writing into the Law that tests
carried out at 11pm on a Saturday night are evaluated with
a larger value of alpha than at 11am on a Wednesday ... ).
So it's all arbitrary, depending on what you want to achieve
and on how you evaluate the consequences of the decisions which
will be made as a result.
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 10-Dec-09 Time: 13:08:41
------------------------------ XFMail ------------------------------
From: meds...@googlegroups.com [mailto:meds...@googlegroups.com] On Behalf Of John Whittington
Sent: 10 December 2009 12:46
To: meds...@googlegroups.com
Subject: {MEDSTATS} Re: p-value < 0.1 for important outcomes!
Being a good Bayesian (tautology of course) I have long thought that post hoc (eg in results) knowing the pre-hoc power calculations adds nothing to properly presented estimates with CIs (or likelihoods), except perhaps gives some idea of the competence of the research team ...
True, but you can also get large effect sizes that are untrustworthy
with small samples and/or large variances. An even better practice
would be to present confidence intervals around your effect sizes.
Of course, confidence intervals incorporate your "significance" or
alpha level as well, so it's really tough to get away from specifying
your acceptable error rate in advance. Though, in my experience folks
tend to do this somewhat more thoughtfully when using CIs. Where,
rather than automagically choosing .05, some tend to consider what an
acceptable error rate is. Though that might be more a matter of
context than the nature of CI vis a vis p values.
I too think this is a very good paper.
A simple way of characterising the point that post-hoc power tests are
pretty pointless would be the scenario of trying to tell someone who's
just one the main prize in a lottery that they shouldn't bother
playing the lottery because they have a negligible chance of winning.
They won't care because they've already got their result!
Although unfortunately I doubt saying the same thing to someone who
hasn't won would stop a habitual player from playing again!
Neil
--
"The combination of some data and an aching desire for an answer does
not ensure that a reasonable answer can be extracted from a given body
of data." ~ John Tukey (1986), "Sunset salvo". The American
Statistician 40(1).
Email - nshe...@gmail.com
Website - http://slack.ser.man.ac.uk/
Photos - http://www.flickr.com/photos/slackline/
--~--~---------~--~----~------------~-------~--~----~
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules
-~----------~----~----~----~------~----~------~--~---
There are good grounds for not reporting post hoc power calculations evaluating either the power to detect what has been seen or more likely what was not but might have been seen.
This issue was discussed 2 years ago on this list. The classic paper is
Goodman SN, Berlin JA.
The use of predicted confidence intervals when planning experiments and the
misuse of power when interpreting results. Ann Intern Med. 1994;121(3):200-6.
(Erratum in: Ann Intern Med 1995;122(6):478.)
"Although there is a growing understanding of the importance of statistical power considerations when designing studies and of the value of confidence intervals when interpreting data, confusion exists about the reverse arrangement: the role of confidence intervals in study design and of power in interpretation. Confidence intervals should play an important role when setting sample size, and power should play no role once the data have been collected, but exactly the opposite procedure is widely practiced. In this commentary, we present the reasons why the calculation of power after a study is over is inappropriate and how confidence intervals can be used during both study design and study interpretation."
There are many other papers on this theme including
Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Stat 2001;55:19-24.
Levine M, Ensom MH. Post hoc power analysis: an idea whose time has passed? Pharmacotherapy 2001;21:405-409.
It is true up to a point that there is no value in knowing the prior sample size calculation when the results are presented. However, the CONSORT group argues that we wish to see this information presented. First, it gives some evidence of the care with which a study was designed (and clinical factors, including clinically important treatment effect), and shuld help demonstrate that the study sample was not determined for example by stopping as convenient after multiple examinations of accumulating data. Second, if the ultimate sample size reported is notably different from the planned N we would wish to know why. Discrepancies are common. Charles et al (BMJ 2009) recently reported a study in which they examined trial reports published in high impact general medical journals; they found that "the difference between the sample size reported in the article and the replicated sample size calculation was greater than 10% in 47 (30%) of the 157 reports that gave enough data to recalculate the sample size."
However, Chan et al (BMJ 2008) showed that for many trials the details of the sample size calculation in the article does not match that included in the trial protocol, suggesting perhaps attempts to hide data driven changes.
(BTW I don't understand why so many people refer to the initial calculation as a "power calculation" when power is generally fixed and the sample size is calculated. The retrospective calculation is indeed a power calculation.)
BW
Doug
Charles P, Giraudeau B, Dechartres A, Baron G, Ravaud P.
Reporting of sample size calculation in randomised controlled trials: review.
BMJ. 2009 May 12;338:b1732.
Chan A-W, Hrobjartsson A, Jorgensen KJ, Gotzsche PC, Altman DG. Discrepancies in sample size
calculations and data analyses reported in randomised trials: comparison of publications with protocols.
BMJ 2008;337:a2299.
At 13:20 10/12/2009, John Whittington wrote:
At 13:10 10/12/2009 +0000, Braunholtz, David A. wrote:
Being a good Bayesian (tautology of course) I have long thought that post hoc (eg in results) knowing the pre-hoc power calculations adds nothing to properly presented estimates with CIs (or likelihoods), except perhaps gives some idea of the competence of the research team ...
Whilst I certainly agree with those conceptual sentiments, I suppose that one does not have to present (and Diana may not have been assuming this, either) the pre-hoc power calculations in the post-hoc situation - one can present post-hoc calculations, based on the observed (rather than guessed) variance of the data. However, I'm not convinced that to do so is materially different in mathematical terms from presenting a CI, which most people would find easier to interpret.
Kind Regards,
John
----------------------------------------------------------------
Dr John Whittington, Voice: +44 (0) 1296 730225
Mediscience Services Fax: +44 (0) 1296 738893
Twyford Manor, Twyford, E-mail: Joh...@mediscience.co.uk
Buckingham MK18 4EL, UK
----------------------------------------------------------------
_____________________________________________________
Doug Altman
Professor of Statistics in Medicine
Centre for Statistics in Medicine
University of Oxford
Wolfson College Annexe
Linton Road
Oxford OX2 6UD
email: doug....@csm.ox.ac.uk
Tel: 01865 284400 (direct line 01865 284401)
Fax: 01865 284424
www: http://www.csm-oxford.org.uk/
<http://www.csm-oxford.org.uk/> EQUATOR Network - resources for reporting research
www: http://www.equator-network.org/
<http://www.equator-network.org/>
--~--~---------~--~----~------------~-------~--~----~
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules
-~----------~----~----~----~------~----~------~--~---