p-value < 0.1 for important outcomes!

64 views
Skip to first unread message

knouri

unread,
Dec 9, 2009, 10:22:27 PM12/9/09
to meds...@googlegroups.com

Dear all:

 

Recently, a medical doctor colleague sent me the following query:

 

“Do you have any stats or epidemiology papers that discuss justification of alpha and beta (p values and their interpretation), especially conditions that merit altering them from what is done conventionally? My recollection is that interventions with important primary outcomes (e.g. mortality) and minimal associated adverse effects may mean that one can change p<0.05 to p<0.1 so as not to conclude there are no differences between the arms”.

 

I wonder if you could provide me with some information about this.

 

Thanks in advance,

Regards,

Keramat Nourijelyani, PhD Biostat


mcap

unread,
Dec 9, 2009, 11:22:01 PM12/9/09
to MedStats
On Dec 9, 10:22 pm, knouri <nou...@yahoo.com> wrote:
> Dear all:
>  
> Recently, a medical doctor colleague sent me the following query:
>  
> “Do you have any stats or epidemiology papers that discuss justification of alpha and beta (p values and their interpretation), especially conditions that merit altering them from what is done conventionally? My recollection is that interventions with important primary outcomes (e.g. mortality) and minimal associated adverse effects may mean that one can change p<0.05 to p<0.1 so as not to conclude there are no differences between the arms”.

The choice of .05 is somewhat arbitrary to begin with. There are
plenty of well known references that discuss the arbitrary nature of
significance threholds. You can choose a different alpha level.
But...despite the fact that it's arbitrary, your audience and your
journal editors may not feel the same way. You would have to justify
your decision carefully. Why not focus on something like effect
size....or if you are dealing with mortality, look at number needed to
treat.

kornbrot

unread,
Dec 10, 2009, 3:35:27 AM12/10/09
to meds...@googlegroups.com
Current advice is to give the exact p-values and let readers make up their own minds about the importance of the results. In addition, it is essential to give a measure of the effect size. For correlations, r itself is a recommended measure. For difference between 2means, Cohen’s d = difference/SD. Other effect size measures are usually some form of % variance accounted for.

Effect size has the advantage over p-vlaues that it cannot be inflated by large N.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (Second ed.). Hillsdale, New Jersey: Lawrence Erlbaum Associates.
Rosnow, R. L., & Rosenthal, R. (1996). Beginning behavioral research:  A conceptual primer (2nd ed.). Englewood Cliffs, NJ: Prentice-Hall, Inc.
Rosnow, R. L., Rosenthal, R., Huberty, C. J., Morris, J. D., Morris, R. J., Bergan, J. R., et al. (1992). Selected tests and analyses. In A. E. Kazdin (Ed.), Methodological issues & strategies in clinical research. Washington, DC: American Psychological Association.
Wilkinson, L. (1999). Statistical Methods in Psychology Journals:  Guidelines and Explanations from Task Force on Statistical Inference , APA Board of Scientific Affairs. American Psychologist, 54(8), 594–604.
http://en.wikipedia.org/wiki/Effect_size

Amazing how good wikipaedia is on the issue

Best
Diana



Professor Diana Kornbrot
  email: 
d.e.ko...@herts.ac.uk    
   
web:    http://web.mac.com/kornbrot/iweb/KornbrotHome.html
Work
School of Psychology
University of Hertfordshire
College Lane, Hatfield, Hertfordshire AL10 9AB, UK
    voice:     +44 (0) 170 728 4626
    mobile:   +44 (0) 796 890 2102
    fax          +44 (0) 170 728 5073
Home
19 Elmhurst Avenue
London N2 0LT, UK
   landline: +44 (0) 208 883 3657
   mobile:   +44 (0) 796 890 2102
   fax:         +44 (0) 870 706 4997





Doug Altman

unread,
Dec 10, 2009, 4:58:32 AM12/10/09
to meds...@googlegroups.com, meds...@googlegroups.com
It is unfortunate that the term "effect size" has two meetings, one very general and the other quite specific. Failure to distinguish these can lead to confusion.

In its generic sense the effect size is simply - of course - the size of effect. For a randomised trial (or any comparative study) this can be quantified as absolute or relative differences between two group-specific estimates, varying according to the nature of the data.

In its specific sense, as defined by Cohen, the effect size is indeed the difference between means divided by the standard deviation (there are variants according to which SD is used). Cohen defined small, medium and large effects as 0.2, 0.5 and 0.8 SD. The statement that "Other effect size measures are usually some form of % variance accounted for" is extending the specific idea of Cohen rather than using effect size in its generic sense. That "standardised" approach is common in social sciences, psychology, etc. In the world of clinical trials, however, I believe that while Cohen's effect sizes are occasionally used to calculate sample size where prior information is lacking, they are almost never used to present the results of trials. In meta-analysis, however, standardised effect sizes may be used when combining data from several studies using continuous outcome measures, in particular when the units of measurement vary across studies.


This is all somewhat tangenial to the original query about alpha and beta, however. There is a very large literature on sample size calculation, but rather less discussing how to choose alpha and beta. Given the widespread use of conventional levels for these, it is generally necessary to justify fixing beta<80% or alpha>5%, which may not be easy. 

I do not agree that "with important primary outcomes (e.g. mortality) and minimal associated adverse effects may mean that one can change p<0.05 to p<0.1 so as not to conclude there are no differences between the arms". And in any case, a non-significant results does not allow one to claim no difference between the groups.

The following paper may be useful:
Lenth RV."Some Practical Guidelines for Effective Sample Size Determination,'' Am Stat 2001;55:187-193.

See also Lenth's related website http://www.stat.uiowa.edu/~rlenth/Power/

Doug




At 08:35 10/12/2009, kornbrot wrote:
Current advice is to give the exact p-values and let readers make up their own minds about the importance of the results. In addition, it is essential to give a measure of the effect size. For correlations, r itself is a recommended measure. For difference between 2means, Cohen’s d = difference/SD. Other effect size measures are usually some form of % variance accounted for.

Effect size has the advantage over p-vlaues that it cannot be inflated by large N.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (Second ed.). Hillsdale, New Jersey: Lawrence Erlbaum Associates.
Rosnow, R. L., & Rosenthal, R. (1996). Beginning behavioral research:  A conceptual primer (2nd ed.). Englewood Cliffs, NJ: Prentice-Hall, Inc.
Rosnow, R. L., Rosenthal, R., Huberty, C. J., Morris, J. D., Morris, R. J., Bergan, J. R., et al. (1992). Selected tests and analyses. In A. E. Kazdin (Ed.), Methodological issues & strategies in clinical research. Washington, DC: American Psychological Association.
Wilkinson, L. (1999). Statistical Methods in Psychology Journals:  Guidelines and Explanations from Task Force on Statistical Inference , APA Board of Scientific Affairs. American Psychologist, 54(8), 594–604.
http://en.wikipedia.org/wiki/Effect_size

Amazing how good wikipaedia is on the issue

Best
Diana


On 10/12/2009 03:22, "knouri" <nou...@yahoo.com> wrote:

Dear all:
 
Recently, a medical doctor colleague sent me the following query:
 
“Do you have any stats or epidemiology papers that discuss justification of alpha and beta (p values and their interpretation), especially conditions that merit altering them from what is done conventionally? My recollection is that interventions with important primary outcomes (e.g. mortality) and minimal associated adverse effects may mean that one can change p<0.05 to p<0.1 so as not to conclude there are no differences between the arms”.

I wonder if you could provide me with some information about this.
 
Thanks in advance,
Regards,
Keramat Nourijelyani, PhD Biostat

 





Professor Diana Kornbrot
  email:  d.e.ko...@herts.ac.uk    
   web:    http://web.mac.com/kornbrot/iweb/KornbrotHome.html
Work
School of Psychology
University of Hertfordshire
College Lane, Hatfield, Hertfordshire AL10 9AB, UK
    voice:     +44 (0) 170 728 4626
    mobile:   +44 (0) 796 890 2102
    fax          +44 (0) 170 728 5073
Home
19 Elmhurst Avenue
London N2 0LT, UK
   landline: +44 (0) 208 883 3657
   mobile:   +44 (0) 796 890 2102
   fax:         +44 (0) 870 706 4997






_____________________________________________________

Doug Altman
Professor of Statistics in Medicine
Centre for Statistics in Medicine
University of Oxford
Wolfson College Annexe
Linton Road
Oxford OX2 6UD

email:  doug....@csm.ox.ac.uk
Tel:    01865 284400 (direct line 01865 284401)
Fax:    01865 284424
www:     http://www.csm-oxford.org.uk/

EQUATOR Network - resources for reporting research
www: http://www.equator-network.org/



kornbrot

unread,
Dec 10, 2009, 6:20:25 AM12/10/09
to meds...@googlegroups.com
As Doug Altman points out, It is indeed unfortunate that effect size, like many other statistical terms, has diverse meanings.

However in a general sense, all those effect sizes used by Rosenthal and by Hedges & colleagues have the property that they compare the magnitude of the effect  with the amount of variation in the population from which the observations were sampled. The Rosenthal reference gives specific effect sizes for different designs, as does the excellent free software G*, http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/.  Many statistical packages give statistical effect sizes, in the more general % variance accounted for form, e.g. SPSS. One needs to look at the exact measure in the specifications (e.g. Omega-squared or eta-squared – although the details rarely make much difference

Recent recommendations are for p-values and SOME effect size
Much less emphasized, but equally important, is to give the POWER of the design to detect an effect of specified magnitude. That magnitude may be expressed in direct terms, e.g. ‘power to detect a shift from 10% mortality to 8% mortality’, or in statistical terms, using some well-defined measure of effect size to avoid ambiguity. The direct form is more meaningful, but obviously requires some prior estimate of the chosen measure (eg mortality) in the target population.

Statistical advice from the [in my view] gold standard of the CONSORT recommendations is excellent on determining sample sizes a priori when DESIGNING a study. It is less detailed on how to report POWER for non-significant results.

When refereeing, I NOW always recommend giving a priori power for specific magnitudes of effect in the design section and re-iterating in the Results section for any effect with p > .05. Although I am embarassed to admit that my own earlier work does not follow thse recommendations.

Its an up hill struggle

Duran, R. P., Eisenhart, M. A., Erickson, F. D., Grant, C. A., Green, J. L., Hedges, L. V., et al. (2006). Standards for Reporting on Empirical Social Science Research in AERA Publication. American Educational Research Association, from http://www.aera.net/uploadedFiles/Opportunities/StandardsforReportingEmpiricalSocialScience_PDF.pdf

Hedges, L. V., Cooper, H., & Bushman, B. J. (1992). Testing the null hypothesis in meta-analysis: a comparison of combined probability and confidence interval procedures. Psychological Bulletin, 111, 188-194.

Consort (2007). CONSORT Statement on randomized controled trials (RCT, from http://www.consort-statement.org/Downloads/download.htm

MOOSE. MOOSE      For meta-analysis of observational studies Retrieved 1aug, 2007, from http://www.consort-statement.org/Initiatives/MOOSE/moose.pdf

Piaggio, G., Elbourne, D. R., Altman, D. G., Pocock, S. J., Evans, S. J. W., & for the, C. G. (2006). Reporting of Noninferiority and Equivalence Randomized Trials: An Extension of the CONSORT Statement. JAMA, 295(10), 1152-1160. http://jama.ama-assn.org/cgi/content/abstract/295/10/1152

Quorum (2007). THE QUOROM STATEMENT on Systematoc Reviews in Medicine, from http://www.consort-statement.org/QUOROM.pdf


Best

Diana
Professor Diana Kornbrot
  
email: d.e.ko...@herts.ac.uk <d.e.ko...@herts.ac.htm>     
   web:   http://web.mac.com/kornbrot/iweb/KornbrotHome.html
Work
Work
School of Psychology
 University of Hertfordshire
 College Lane, Hatfield, Hertfordshire AL10 9AB, UK
 voice:   +44 (0) 170 728 4626
   fax:     +44 (0) 170 728 5073
Home
 
19 Elmhurst Avenue
 London N2 0LT, UK
    voice:   +44 (0) 208 883  3657
    mobile: +44 (0)
796 890 2102

Bruce Weaver

unread,
Dec 10, 2009, 7:33:20 AM12/10/09
to MedStats
On Dec 10, 4:58 am, Doug Altman <doug.alt...@csm.ox.ac.uk> wrote:

--- snip ---

> The following paper may be useful:
> Lenth RV."Some Practical Guidelines for Effective
> Sample Size Determination,'' Am Stat 2001;55:187-193.
>
> See also Lenth's related websitehttp://www.stat.uiowa.edu/~rlenth/Power/
>
> Doug

I really like Lenth's stuff on power & sample size estimation. Given
that Cohen's d has figured prominently in this thread, be sure not to
miss Lenth's comments on it in:

www.stat.uiowa.edu/~rlenth/Power/2badHabits.pdf

--
Bruce Weaver
bwe...@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/Home
"When all else fails, RTFM."

John Whittington

unread,
Dec 10, 2009, 7:45:42 AM12/10/09
to meds...@googlegroups.com
At 11:20 10/12/2009 +0000, kornbrot wrote (in part):

Much less emphasized, but equally important, is to give the POWER of the design to detect an effect of specified magnitude. That magnitude may be expressed in direct terms, e.g. power to detect a shift from 10% mortality to 8% mortality , or in statistical terms, using some well-defined measure of effect size to avoid ambiguity. The direct form is more meaningful, but obviously requires some prior estimate of the chosen measure (eg mortality) in the target population.

The irony here is that this appears to effectively be in direct contradiction/conflict with what Diana wrote in her previous message, namely:


Current advice is to give the exact p-values and let readers make up their own minds about the importance of the results.

... since one can only define power to detect an effect of specified magnitude FOR A GIVEN ALPHA (i.e. applying some arbitrary p-value threshold).

The only way I can think of to satisfy both of Diana's (both very reasonable) recommendations (to present an indication of power whilst not assuming a value for alpha/ p-value thresholds) would be to present power graphically as a function of alpha - but that is something one very rarely sees done in publications.

That's how I see it, anyway.

Kind Regards,


John

----------------------------------------------------------------
Dr John Whittington,       Voice:    +44 (0) 1296 730225
Mediscience Services       Fax:      +44 (0) 1296 738893
Twyford Manor, Twyford,    E-mail:   Joh...@mediscience.co.uk
Buckingham  MK18 4EL, UK            
----------------------------------------------------------------

Ted Harding

unread,
Dec 10, 2009, 8:08:47 AM12/10/09
to meds...@googlegroups.com
Doug Altman comments about "effect size" (and also about
alpha and beta, see later) are very welcome! Clearly put,
and oriented towards the confusion (and potential misuse)
which this can lead to.

Personally, I see no justification for simply reporting a
"standardised effect size" unless there is a specific and
objective reason for doing so. As Doug says, "effect size
is simply - of course - the size of effect." -- either an
absolute or relative difference of outcome between, say,
a treatment and an absence of treatment.

If you standardise relative to the population SD, then you
can interpret the result relative to the distribution of the
outcome measure in the untreated population. So (if this were
Normally distributed), with an additive treatment effect with
standardised effect size of 1.65, a person at the population
median (= mean here) would be (on average) shifted up to the
95th percentile; a person at the 60th percentile would be
shifted up to the 98th percentile. And so on.

There may be good external reasons for wanting to express things
in that kind of way and, if so, then the effect size standardised
relative to population SD is a useful tool. But if there is
no real point in expressing things in that way, then why do it?

Standardising with respect to the SE of the estimate of the
effect is a quite different matter, and indeed I can see little
good reason for doing it (though I have often seen it done).
It can be interpreted (albeit potentially loosely) as a surrogate
for a P-value or a confidence interval (it is, in fact, simply
the "t-value"). But in my view it is better to give the full
nitty-gritty: N, estimate of effect (non-standardised), SE,
P-value (and possibly t-value if for some reason it is immediately
relevant -- the reader can otherwise easily find it by dividing
the etimate by its SE), and confidence interval.

The real danger -- the "non-invariance" -- of the SE-standardised
effect size is that it depends on the sample size. The same
trial, on the same population but with 4 times the sample size,
would double the effect size.

One could even imagine that pharma firm A have published a trial
of their Drug A in which they simply quote an SE-standardised
effect size of, say, 4.7 obtained from a trial with 400 patients.
Then pharma firm B, who are developing a rival drug, can simply
say "OK, let's trial ours using 1000 patients and publish the
SE-standardised effect size of our Drug B. We have reason to
suspect that ours is only 70% as effective as theirs, on an
absolute scale, but with this plan we can expect to show that
ours has 10-11% greater effect size than theirs."
(0.7*sqrt(1000/400) = 1.107)

As to "how to choose alpha and beta", the prevalence of the
practice of (often apparently blindly) adopting conventional
values really does obscure the essence of the matter: that it
is essentially arbitrary and amounts to a "political decision"
unless explicit grounds are given to show that the choice is
in some sense good.

The use of conventional P-values (alpha) like 0.10, 0.05, 0.01
etc. was the result of R.A. Fisher deciding to produce tables
in a more compact form, and requiring less computational effort,
than the then exisiting tables. The original Biometrika Tables For
Statisticians were published by Karl Pearson from his Biometric
Laboratory in 1914, and presented (for the basic distributions
of statistics like Normal, Chi-squared, T, etc.) the results of
numerical computation over the full range of the variate,
so you could (with interpolation) look up the P-value for any
possible value of the variate, or read the table in reverse to
obtain the the variate corresponding to any P-value. Fisher's
tables (and later the Fisher & Yates Statistical Tables for
Biological, Agricultural and Medical Research Workers) did it
in reverse: for a selection of P-values (.99, .98, .90, .80,
.70, .50, .30, .20, .10, .05, .02, ,01) the corresponding
values of the variate were tabulated, so the "P-bracket" for
any given value could be readily found. These were much smaller
than Pearson's enormous tomes. See, for instance, pp. 245-247
of "R.A. Fisher: The Life of a Scientist" by Joan Fisher Box.

Along with this went Fisher's own expressed views about the
importance or interpretation of P-values. For instance, the
following can be read in Section 20 (Chapter IV "Goodness
of Fit, Etc.") of his "Statistical Methods for Research Workers":

"In preparing this table we have borne in mind that
in practice we do not want to know the exact value
of P for any observed chi-squared, but, in the first
place, whether or not the observed value is open to
suspicion." [in the context of goodness of fit, here]
"If P is between 0.1 and 0.9 there is certainly no
reason to suspect the hypothesis tested. If it is
below 0.02 it is strongly indicated that that the
hypothesis fails to account for the whole of the facts."
[...]
"The actual value of P obtainable from the table by
interpolation indicates the strength of the evidence
against the hypothesis. A value of chi-squared exceeding
the 5 per cent. point is seldom to be disregarded."

Thus Fisher himself was cautious and qualified about the
interpretation of P-values, explicitly describing them
as measure of strength of evidence and justification for
degrees of suspicion, and *not* definitively defining
them as cut-offs for rejection or acceptance.

Nevertheless, the existence of the tables with their selected
"cut-points", and Fisher's enormous methodological influence
on research methods, led to the wide-spread adoption of his
methods and their associated tables. As a result, workers
unwilling (or unable) to appreciate the subtleties and nuances
which Fisher had hinted at began to routinely adopt such P-values
as if they were real cut-offs between the existence of an effect
and the non-existence of an effect. In terms of scientific
inference, this is simply invalid.

It is a different matter when a decision has to be made, based
on the outcome of an uncertain result. "Decision", by etymology,
means "cut-off". It may be a decision as to whether of not to
adopt a different treatment for a disease, or whether to invest
resources in further trials, or even whether or not to convict
for a crime.

For example, the UK statutory limit for driving with alcohol
in the blood is 80mg/100ml. If measured by a laboratory test
on a blood sample, it is (or was) the case that the SE of a
determination was certified to be not greater then 2%. Then
3 times the SE (or 2 units if result < 100) was subtracted
from the result, giving the level quoted in evidence. If this
was more than 80, conviction was certain unless a failure
in technical process could be proved. This gives a numerical
measure for the level of "proof beyond reasonable doubt"
required in criminal law: the corresponding value of alpha
is less than 1/1000.

This gives rise to the corresponding Power Function (1-beta).
If the driver's true value is exactly 80 (so on the limit of
"Not Guilty"), then he is protected at level 999/100 or more
against false conviction (Power = alpha <= 1/1000 at this level).
The Power rises to 50% at 86mg/100ml, and to 999/1000 or more
at 92mg/100ml. Hence there is a "grey area" from 80-92 over
which the conviction probability rises from negligible to
almost certain.

In part, the "alpha = 1/1000" arises from the "proof beyond
reasonable doubt" principle -- we need a small risk of false
conviction. In part, it arises from the realtively precise
laboratory determination (not worse than 2%) and the common
sense fact that at 92mg/100ml you are not much more drunk
than at 80mg/100ml, so the "grey area" is acceptable. But
both of these are chosen on "political" grounds. However,
if the laboratory precision, for instance, had been much
worse (say 10%), then with the limit at 80 and the same alpha
you would have 50% chance of acquitting people driving with
104mg/100ml, and would only be almost sure of conviction
when it got up to 128mg/100ml -- which would not be acceptable.

One simple change in that case would be to lower the legal
limit to, say, 57.5mg/100ml. Then you would again have the
same probability (>= 99/1000) of comvicting anyone at 92mg/100ml
or more (57.5 + 6*0.1*57.5 = 92).

Back at the 2% SE: the state could alternatively decide that
if the true level is 80 or more then conviction must be almost
certain. This would amount to *adding* 3 SEs to the lab result
and quoting this as evidence. This amounts to adopting
alpha = 0.999, as opposed to the good old alpha =0.001 -- a
reversal of the "innocent until proved guilty" principle.

Or, after that Dictator is overthrown and a more moderate one
takes over, a compromise might be struck at which anyone exactly
at 80mg/100ml has a 50% chance of conviction, so just quote
the lab result as found. This amounts to choosing alpha = 0.5.
This could be seen as politically acceptable, since anyone
close enough (> 74) to the limit to have > 1/1000 chance of
conviction was on thin ice in the first place, and deserves
what they get!

So, in such a context, alpha is whatever you want it to be,
along with the corresponding beta (or Power). And it all turns
on trade-off between alpha and beta.

The larger alpha (chance of false "rejection") the smaller is
beta (chance of failure to truly "reject"). There will be a
loss associated with either failure. So, from an administrative
point of view, one can adopt an alpha which will minimise the
expected loss, taking account of the prevalence of the state
being tested for in the population. (In the drink-driving
example, one could envisage writing into the Law that tests
carried out at 11pm on a Saturday night are evaluated with
a larger value of alpha than at 11am on a Wednesday ... ).

So it's all arbitrary, depending on what you want to achieve
and on how you evaluate the consequences of the decisions which
will be made as a result.

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 10-Dec-09 Time: 13:08:41
------------------------------ XFMail ------------------------------

Braunholtz, David A.

unread,
Dec 10, 2009, 8:10:33 AM12/10/09
to meds...@googlegroups.com
Being a good Bayesian (tautology of course) I have long thought that post hoc (eg in results) knowing the pre-hoc power calculations adds nothing to properly presented estimates with CIs (or likelihoods), except perhaps gives some idea of the competence of the research team ...
 
David
Immpact Project
University of Aberdeen

From: meds...@googlegroups.com [mailto:meds...@googlegroups.com] On Behalf Of John Whittington
Sent: 10 December 2009 12:46
To: meds...@googlegroups.com
Subject: {MEDSTATS} Re: p-value < 0.1 for important outcomes!



The University of Aberdeen is a charity registered in Scotland, No SC013683.

John Whittington

unread,
Dec 10, 2009, 8:20:35 AM12/10/09
to meds...@googlegroups.com
At 13:10 10/12/2009 +0000, Braunholtz, David A. wrote:

Being a good Bayesian (tautology of course) I have long thought that post hoc (eg in results) knowing the pre-hoc power calculations adds nothing to properly presented estimates with CIs (or likelihoods), except perhaps gives some idea of the competence of the research team ...

Whilst I certainly agree with those conceptual sentiments, I suppose that one does not have to present (and Diana may not have been assuming this, either) the pre-hoc power calculations in the post-hoc situation - one can present post-hoc calculations, based on the observed (rather than guessed) variance of the data.  However, I'm not convinced that to do so is materially different in mathematical terms from presenting a CI, which most people would find easier to interpret.

Doug Altman

unread,
Dec 10, 2009, 9:14:15 AM12/10/09
to meds...@googlegroups.com, meds...@googlegroups.com
There are good grounds for not reporting post hoc power calculations evaluating either the power to detect what has been seen or more likely what was not but might have been seen.

This issue was discussed 2 years ago on this list. The classic paper is

Goodman SN, Berlin JA.
The use of predicted confidence intervals when planning experiments and the
misuse of power when interpreting results. Ann Intern Med. 1994;121(3):200-6.
(Erratum in: Ann Intern Med 1995;122(6):478.)

"Although there is a growing understanding of the importance of statistical power considerations when designing studies and of the value of confidence intervals when interpreting data, confusion exists about the reverse arrangement: the role of confidence intervals in study design and of power in interpretation. Confidence intervals should play an important role when setting sample size, and power should play no role once the data have been collected, but exactly the opposite procedure is widely practiced. In this commentary, we present the reasons why the calculation of power after a study is over is inappropriate and how confidence intervals can be used during both study design and study interpretation."
 
There are many other papers on this theme including

Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Stat 2001;55:19-24.

Levine M, Ensom MH. Post hoc power analysis: an idea whose time has passed? Pharmacotherapy 2001;21:405-409.

It is true up to a point that there is no value in knowing the prior sample size calculation when the results are presented.   However, the CONSORT group argues that we wish to see this information presented. First, it gives some evidence of the care with which a study was designed (and clinical factors, including clinically important treatment effect), and shuld help demonstrate that the study sample was not determined for example by stopping as convenient after multiple examinations of accumulating data. Second, if the ultimate sample size reported is notably different from the planned N we would wish to know why. Discrepancies are common. Charles et al (BMJ 2009) recently reported a study in which they examined trial reports published in high impact general medical journals; they found that "the difference between the sample size reported in the article and the replicated sample size calculation was greater than 10% in 47 (30%) of the 157 reports that gave enough data to recalculate the sample size."

However, Chan et al (BMJ 2008) showed that for many trials the details of the sample size calculation in the article does not match that included in the trial protocol, suggesting perhaps attempts to hide data driven changes.
 
(BTW I don't understand why so many people refer to the initial calculation as a "power calculation" when power is generally fixed and the sample size is calculated. The retrospective calculation is indeed a power calculation.)
 
BW
Doug


Charles P, Giraudeau B, Dechartres A, Baron G, Ravaud P.
Reporting of sample size calculation in randomised controlled trials: review.
BMJ. 2009 May 12;338:b1732.

Chan A-W, Hrobjartsson A, Jorgensen KJ, Gotzsche PC, Altman DG. Discrepancies in sample size
calculations and data analyses reported in randomised trials: comparison of publications with protocols.
BMJ 2008;337:a2299.

Brett Magill

unread,
Dec 10, 2009, 10:41:08 AM12/10/09
to meds...@googlegroups.com
On Thu, Dec 10, 2009 at 2:35 AM, kornbrot <d.e.ko...@herts.ac.uk> wrote:
> Effect size has the advantage over p-values that it cannot be inflated by
> large N.

True, but you can also get large effect sizes that are untrustworthy
with small samples and/or large variances. An even better practice
would be to present confidence intervals around your effect sizes.

Of course, confidence intervals incorporate your "significance" or
alpha level as well, so it's really tough to get away from specifying
your acceptable error rate in advance. Though, in my experience folks
tend to do this somewhat more thoughtfully when using CIs. Where,
rather than automagically choosing .05, some tend to consider what an
acceptable error rate is. Though that might be more a matter of
context than the nature of CI vis a vis p values.

Martin Holt

unread,
Dec 10, 2009, 11:41:41 AM12/10/09
to meds...@googlegroups.com
Doug has thoughtfully included a number of references on this subject. There
is one I wish to emphasize, if I remember aright that it is this reference:

Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power
calculations for data analysis. Am Stat 2001;55:19-24

The reason why I thought I'd highlight this one (again, if I'm right, and I
think I am) is that it addresses the logic of what you are doing if you
perform a retrospective power calculation. While it makes sense to include
power and alpha in a prospective sample size calculation, when one
approaches the same calculation retrospectively one knows the results (the
experiment has crystallized into its final structure) so that recalculating
sample size after the event makes no sense.

I hope what I've just written does make sense: I remember the author
acknowledging that his argument against retrospective power calculations did
involve a bit of mind-bending !

Russ Lenth is more straightforward, but I don't remember him dealing with
the "no sense paradigm" of retrospective power calculations in as much
detail as Hoenig, et al.

Best Regards,

Martin Holt

----- Original Message -----
From: "Brett Magill" <mag...@sbcglobal.net>
To: <meds...@googlegroups.com>
Sent: Thursday, December 10, 2009 3:41 PM
Subject: {MEDSTATS} Re: p-value < 0.1 for important outcomes!


>

Neil Shephard

unread,
Dec 10, 2009, 12:03:40 PM12/10/09
to meds...@googlegroups.com
On Thu, Dec 10, 2009 at 4:41 PM, Martin Holt <m861...@btinternet.com> wrote:
>
> Doug has thoughtfully included a number of references on this subject. There
> is one I wish to emphasize, if I remember aright that it is this reference:
>
> Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power
> calculations for data analysis. Am Stat 2001;55:19-24
>
> The reason why I thought I'd highlight this one (again, if I'm right, and I
> think I am) is that it addresses the logic of what you are doing if you
> perform a retrospective power calculation. While it makes sense to include
> power and alpha in a prospective sample size calculation, when one
> approaches the same calculation retrospectively one knows the results (the
> experiment has crystallized into its final structure) so that recalculating
> sample size after the event makes no sense.
>
> I hope what I've just written does make sense: I remember the author
> acknowledging that his argument against retrospective power calculations did
> involve a bit of mind-bending !

I too think this is a very good paper.

A simple way of characterising the point that post-hoc power tests are
pretty pointless would be the scenario of trying to tell someone who's
just one the main prize in a lottery that they shouldn't bother
playing the lottery because they have a negligible chance of winning.

They won't care because they've already got their result!

Although unfortunately I doubt saying the same thing to someone who
hasn't won would stop a habitual player from playing again!

Neil

--
"The combination of some data and an aching desire for an answer does
not ensure that a reasonable answer can be extracted from a given body
of data." ~ John Tukey (1986), "Sunset salvo". The American
Statistician 40(1).

Email - nshe...@gmail.com
Website - http://slack.ser.man.ac.uk/
Photos - http://www.flickr.com/photos/slackline/

Pedro Emmanuel Alvarenga Americano do Brasil

unread,
Dec 10, 2009, 6:28:14 PM12/10/09
to meds...@googlegroups.com
Sorry I got late to this thread,

It seems the point of views and discussion were way beyond of the original message request although very useful for me at least. I was not awere that power estimation after data was collected was not very useful and Im sure I will read some fo the references posted here to further understand this issue.

The Subject of the first email did not specify p values on clinical trails and first thing that came to my mind is meta-analsys. About ten years ago the p value on Q statistics for detect heterogeneity in meta-analysis  was recommended to be 'on standard' 0.1 due mainly to usual small sample sizes on reviews and therefore values such as .05 were considered way conservative in rejecting the null hypothesis. Later the I2 index was developed and quickly became more popular for the same purpose and this recommendations became less important.

Later in the body of the message of the original mail there was a concern about clinical trials. This issue was not mentioned so far therefore there goes the tip. p values are usually set as very small during interim analysis in clinical trials. There are  some recommendations and or tables in several books on clinical trials such as Pocock's or Friedman's to estimate p values tresholds in interin analysis depending, of course, on power, expected effect number of interin analysis and sample size in each analysis. The rationale here is try to avoid trial interruption (either for for harm of benefit) based on spurious findings due to repetitive testing of the same data.

But very seldom it comes to the final reader on papers or technichal reports.

At last, although im majority of cases .05 could be widely accepted, attention to the object in particular beeing discussed because in can make a lot of sense moving p values decisions threshold toward 0 or 1. A carefull review about research methods on the particular object or field may be required.

Kind regards to all,
May the force be with you,
Abraço forte e que a força esteja com você,

Dr. Pedro Emmanuel A. A. do Brasil
Instituto de Pesquisa Clínica Evandro Chagas
Fundação Oswaldo Cruz
Rio de Janeiro - Brasil


2009/12/10 mcap <mca...@yahoo.com>
--~--~---------~--~----~------------~-------~--~----~
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules
-~----------~----~----~----~------~----~------~--~---


Steve Simon, P.Mean Consulting

unread,
Dec 10, 2009, 6:49:17 PM12/10/09
to meds...@googlegroups.com
knouri wrote:

> Recently, a medical doctor colleague sent me the following query:
>
> “Do you have any stats or epidemiology papers that discuss justification
> of alpha and beta (p values and their interpretation), especially
> conditions that merit altering them from what is done conventionally? My
> recollection is that interventions with important primary outcomes (e.g.
> mortality) and minimal associated adverse effects may mean that one can
> change p<0.05 to p<0.1 so as not to conclude there are no differences
> between the arms”.
>
> I wonder if you could provide me with some information about this.

There should be some good papers about this, but I'm not aware of them.
The choice of alpha, beta, and sample size comes down to economics. I
tell this story to just about every client:

A research who gets a 6 year, 10 million dollar grant. At the end of the
6 years, he writes a report that reads "This is a new and innovative
surgical procedure, and we are 95% confident that the cure rate is
somewhere between 3% and 98%."

This emphasizes that alpha, beta, and power should be selected on the
basis of economic considerations. Here's how you do it.

Figure out the cost of a Type I error and a Type II error. Let the
alpha/beta ratio be determined by the ratio of these two costs (if cost
of Type I error is very big, make sure that alpha is very small).

Then, figure out the cost per patient and set the sample size so the
cost is equal to the sum of the costs of the two types of errors times
the probability of getting those errors. You might have to place a prior
on the possible population parameters.

No one does this, but at a minimum, we should choose the ratio of alpha
and beta equal to the severity of the two types of errors. The classic
choices (alpha=.05, beta=.20) presume that a Type I error is four times
as costly as a Type II error. I can imagine lots of situations where
alpha=.20 and beta=.05 would make a lot more sense.
--
Steve Simon, Standard Disclaimer
Two free webinars coming soon!
"What do all these numbers mean? Odds ratios,
relative risks, and number needed to treat"
Thursday, December 17, 2009, 11am-noon, CST.
"The first three steps in a descriptive
data analysis, with examples in PASW/SPSS"
Thursday, January 21, 2010, 11am-noon, CST.
Details at www.pmean.com/webinars

Neil Shephard

unread,
Dec 11, 2009, 5:06:42 AM12/11/09
to meds...@googlegroups.com
On Thu, Dec 10, 2009 at 11:49 PM, Steve Simon, P.Mean Consulting
<n...@pmean.com> wrote:
> knouri wrote:
>
> The choice of alpha, beta, and sample size comes down to economics.

Out of curiosity...

Whilst the economic aspect is no doubt important how do you calculate
the value of an individual life and factor that into the equation(s)?

John Whittington

unread,
Dec 11, 2009, 8:22:20 AM12/11/09
to meds...@googlegroups.com, meds...@googlegroups.com
At 14:14 10/12/2009 +0000, Doug Altman wrote:
>There are good grounds for not reporting post hoc power calculations
>evaluating either the power to detect what has been seen or more likely
>what was not but might have been seen. This issue was discussed 2 years
>ago on this list. The classic paper is Goodman SN, Berlin JA. ....

Yes, I am aware of that (and have probably participated in the previous
discussions, but ....

>It is true up to a point that there is no value in knowing the prior
>sample size calculation when the results are presented. However, the
>CONSORT group argues that we wish to see this information presented.
>First, it gives some evidence of the care with which a study was designed
>(and clinical factors, including clinically important treatment effect),
>and shuld help demonstrate that the study sample was not determined for
>example by stopping as convenient after multiple examinations of
>accumulating data. Second, if the ultimate sample size reported is notably
>different from the planned N we would wish to know why. .....

I know I am out on a limb here but, in a similar spirit to what Doug is
saying about the CONSORT group's feelings, I still cannot help but feel
that there is one situation in which a 'post-hoc' power calculation (or
post-hoc retrospective sample size estimation - either would do, as
described below) DOES perhaps have a role to play ....

Researchers may have conscientiously done the best they could to design an
adequate trial (which is what the CONSORT group likes to see) but their a
priori sample size estimation will have had to be based on some estimate of
the variance of the results, an estimate which is sometimes little more
than a 'guess', and which often proves to have been a poor estimate. It is
therefore possible to have a trial which was, in good faith, believed to
have a sufficient sample size to have adequate power but which, in the
event, proves to have had an inadequate sample size to have that power,
because the sample size estimate was based on what proved to be an
appreciable underestimate of variance. In that situation, and IF the study
fails to produce a 'significant' result (power is pretty moot if one has
obtained a 'significant' result), I personally feel that it is helpful to
qualify the presentation of those ('non-significant') results with a
demonstration that, with the variance actually observed, the sample size
used was actually inadequate to provide the desired degree of power. That
can be achieved by undertaking calculations using the observed variance -
either a 'power calculation', to demonstrate the low power with the actual
sample size used, or a retrospective 'sample size estimation' to
demonstrate what the sample size would have had to have been to achieve the
desired power, given the observed variance.

Is that totally unreasonable?

>(BTW I don't understand why so many people refer to the initial
>calculation as a "power calculation" when power is generally fixed and the
>sample size is calculated. The retrospective calculation is indeed a power
>calculation.)

I couldn't agree more, and attempt to always use the correct terminology,
although I sometimes go into autopilot and just repeat the incorrect
terminology in the message to which I'm responding! Mind you, as above,
retrospective sample size estimations (as well as power calculations) are
also a possibility.

Bruce Weaver

unread,
Dec 11, 2009, 9:11:41 AM12/11/09
to MedStats
Some of us (including you, John) discussed this very idea in the 2007
thread "retrospective power calculations". I thought it was very
reasonable then, and still do. To find the thread, go to
http://groups.google.com/group/medstats, and use the "Search this
group" button.

Cheers,
Bruce

Braunholtz, David A.

unread,
Dec 11, 2009, 9:28:46 AM12/11/09
to meds...@googlegroups.com
If use a map as a metaphor for knowledge of the parameter being estimated, then perhaps:

The results tell you the gridref where you are. This doesn't depend on how you got there, of course.

The power / sample size calculations tell you something about how & why you got there, rather than somewhere else where you might rather have been. Which may be relevant for planning future expeditions.

Also perhaps about whether you believe the gridref !


David
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules


John Whittington

unread,
Dec 11, 2009, 9:35:59 AM12/11/09
to meds...@googlegroups.com
At 06:11 11/12/2009 -0800, Bruce Weaver wrote:
>Some of us (including you, John) discussed this very idea in the 2007
>thread "retrospective power calculations". I thought it was very
>reasonable then, and still do. To find the thread, go to
>http://groups.google.com/group/medstats, and use the "Search this
>group" button.

Yes, I recall that discussion, and it is one of very many times over the
years when I have 'said my piece' about this, which has represented my
humble view for a very long time. Whenever the topic comes up to
discussion, there are always some people who voice what appear to be
'blanket' objections to the entire concept of post-hoc power calculations/
sample size estimations. However, as I've tried to illustrate, I believe
that there should probably be some holes in that blanket, in as much as
there are some specific situations in which I believe such calculations can
be helpful - and certainly not harmful.

That's how I see it, anyway!

John Whittington

unread,
Dec 11, 2009, 9:46:13 AM12/11/09
to meds...@googlegroups.com
At 14:28 11/12/2009 +0000, Braunholtz, David A. wrote:
>If use a map as a metaphor for knowledge of the parameter being estimated,
>then perhaps:
>The results tell you the gridref where you are. This doesn't depend on
>how you got there, of course.
>The power / sample size calculations tell you something about how & why
>you got there, rather than somewhere else where you might rather have
>been. Which may be relevant for planning future expeditions.
>Also perhaps about whether you believe the gridref !

I personally tend to use analogies in terms of the design and
implementation of anything (an IT system, a mechanical tool or whatever)
which requires an estimate of the conditions under which it is to be used,
followed by testing to see if it performs as required. If it fails to
perform satisfactorily in those tests, there are two possibilities:

1. That it was designed on the basis of a reasonably correct estimate
of operational conditions but was simply not good enough.

2. That the estimate of operational conditions was, in fact, not
correct. Although the item/system as designed might well have been OK
under conditions such as those estimated in the design spec, it was not
adequate under the actual conditions encountered.

In such a situation, I would definitely like to know which of (1) and (2)
applies - and the same goes for the situation in what I thought was an
adequate clinical trial did not produce a 'significant' result.

Martin Holt

unread,
Dec 11, 2009, 9:55:51 AM12/11/09
to meds...@googlegroups.com
=Some of us (including you, John) discussed this very idea in the 2007
=thread "retrospective power calculations". I thought it was very
=reasonable then, and still do. To find the thread, go to
=http://groups.google.com/group/medstats, and use the "Search this
=group" button.

=Cheers,
=Bruce

=--
=Bruce Weaver
=bwe...@lakeheadu.ca
=http://sites.google.com/a/lakeheadu.ca/bweaver/Home
="When all else fails, RTFM."

More intuitively than mathematically, I feel that what you would have then
would be a snapshot, one of "any number" of snapshots, within some limits.
Taking Neil Shepherd's excellent metaphor:

"A simple way of characterising the point that post-hoc power tests are
pretty pointless would be the scenario of trying to tell someone who's
just one the main prize in a lottery that they shouldn't bother
playing the lottery because they have a negligible chance of winning.

They won't care because they've already got their result!

Although unfortunately I doubt saying the same thing to someone who
hasn't won would stop a habitual player from playing again!"

Wouldn't the number of possible sample sizes calculated retrospectively
under such a scenario have a very wide coverage, and the snapshot fall
somewhere within there ? If so, use of the snapshot might still be useful,
but it wouldn't make the most use of the information, and might be
misleading.

This is more of a gut reaction of mine, and an ex-boss of mine said "The
trouble with gut reactions is that your guts are too close to your bottom !"
(He wasn't talking specifically to me !)

Best Wishes,

Martin Holt

Martin Holt

unread,
Dec 11, 2009, 10:05:01 AM12/11/09
to meds...@googlegroups.com
Hi David,

Please could you tell me, if you know, why your subject heading has **SPAM**
and (Note X-RBL-Warning ) ?

TIA,
Martin

Braunholtz, David A.

unread,
Dec 11, 2009, 10:08:18 AM12/11/09
to meds...@googlegroups.com

System here put in on, probably because several messages with identical headings ? I should have taken it off but it didn't register....

Marc Schwartz

unread,
Dec 11, 2009, 10:48:52 AM12/11/09
to meds...@googlegroups.com
It seems to me that if one actually gets to the end of the study, only
then to realize that there was an insufficient number of subjects
because the underlying assumptions of the study proved false, then it
can be argued that to a large extent, there was a failure in the
safety oversight and review process.

There is an ethical obligation by the PI's and the independent IRBs/
DMCs/DSMBs to periodically review the study and to reasonably ensure
that the study has a chance to achieve it's primary objective. If that
is unlikely, then where possible, they are to recommend changes that
do not compromise the scientific validity of the study, or if not
possible, to recommend that the study be stopped for futility.

With appropriate on-going review and decision making, the post-hoc
autopsy on the study characteristics can be implemented much earlier
in the timeline, reducing risks to the subjects (both to those already
enrolled and to those yet to be enrolled) and conserving budgets. In
that case, it also seems to me that knowing post-hoc power is less
helpful than in effect, a root cause analysis of the study failure.
Despite the best of intentions, were invalid assumptions made? Was
there a data quality problem? Were the inclusion/exclusion criteria
poorly specified? Were there too many protocol violations? Was
enrollment not proceeding in a reasonable fashion?

Sample size re-estimation alone, is perhaps the easiest of all mid-
course corrections that can be made to a study design, without getting
into full blown prospectively specified adaptive designs. It is one
for which there is meaningful experience with regulatory bodies (ie.
FDA) in allowing for such changes, even where such changes were not
pre-specified in the study protocol.

In this context, one is not altering the primary hypotheses and
objectives of the study, while preserving the investments already made
during the conduct of the study. Those investments are both human and
financial. Based upon the interim review, decisions can be made such
that if a reasonable (for some definition of 'reasonable') increase in
the sample size will enable the primary objectives to be achieved,
then that change in the protocol and study budget can be implemented.

Other the other hand, there is likely to be some threshold for sample
size increase along with the attendant budgetary impacts, that are not
practical and the study should be stopped.

More generally, it is for these reasons, that pre-specified adaptive
study designs and interim stopping rules are becoming more prevalent,
especially in early phase clinical trials, whether for drugs or devices.

HTH,

Marc Schwartz

mcap

unread,
Dec 11, 2009, 10:49:08 AM12/11/09
to MedStats
I am not sure I agree that post hoc power calculations are always
unwarranted:

-There are many situations in which you just cannot estimate the
variance accurately when determining sample size. There are smaller
clinical trials for new interventions or things that haven't been
tested before where there isn't much of a way to estimate variance or
even a range.

-The idea of avoiding post hoc power calculations is based, in part,
on an assumption that the sample size was adequate. For major
projects with adequate resources this is probably the case. But,
there are many, many small projects (I am on an IRB), where the sample
size is determined by project resources or ability to recruit
patients.....not power. This is not ideal....but it is reality.

-The avoidance of post hoc power is probably also based on a good
understanding of significance. The majority of readers of most
journals however, still do not understand NHST. When they see not
significant they think no difference. I think in that case (or at
least certainly on the student level), telling the reader that there
was only a ....% of detecting a significant difference even if there
was one, can help to put things in perspective.

Someone used the excellent example of the lottery......

"someone who's just one the main prize in a lottery that they
shouldn't bother playing the lottery because they have a negligible
chance of winning. They won't care because they've already got their
result!"

That may be true. But you are dealing with the person who didn't
win. Power in that case would be like telling them.....

'you may have won, but we didn't have the ability to find out if you
did or not. Sorry.'

It's only in the case of a non significant result that you would be
concerned.

Marc



John Whittington

unread,
Dec 11, 2009, 11:00:24 AM12/11/09
to meds...@googlegroups.com
At 14:55 11/12/2009 +0000, Martin Holt wrote:
>More intuitively than mathematically, I feel that what you would have then
>would be a snapshot, one of "any number" of snapshots, within some limits.
>Taking Neil Shepherd's excellent metaphor:
>
>"A simple way of characterising the point that post-hoc power tests are
>pretty pointless would be the scenario of trying to tell someone who's
>just one the main prize in a lottery that they shouldn't bother
>playing the lottery because they have a negligible chance of winning.
>
>They won't care because they've already got their result!

Martin, as I recently wrote, any thought of a post-hoc calculation is
pointless (irrelevant/moot) if one has obtained a 'positive' result
('significant' result of a trial, or a lottery win).

The issue becomes worthy of debate when one does NOT have a positive
result, and may have made one's a priori decisions based on
assumptions/estimates which proved to have been very wrong ....

Imagine that the mechanisms (hence odds of winning) of some
raffle/lottery/whatever were not actually known but, on some basis someone
had estimated/guessed that the chances of winning were, say, 1 in 1000. On
that basis, he undertakes a 'sample size estimation' and concludes that if
he buys 50,000 tickets, he will have a very high probability of
winning. He buys his 50,000 tickets and does not win a main
prize. However, after the event, statistics of that raffle/lottery become
available, indicating that the true chances of winning were actually closer
to 1 in 14 million than 1 in 1000, thereby indicating that his sample size
estimation (and hence the 'power' of his pocketful of 50,000 tickets) was
actually woefully inadequate. In that situation, it seems to me to be
perfectly reasonable to revisit the sample size calculation in order to get
an idea as to whether his failure to win with 50,000 tickets was due to
extremely bad luck (in which case he might well try again), or whether
information available for a retrospective calculation indicated that the
50,000 had been far too low.

Returning to trials, to re-iterate, my personal feeling is that the only
situation in which a post-hoc calculation makes sense is if the a priori
sample size had been undertaken using a variance estimate that was
appreciably lower than that observed in the trial AND the trial had
produced a 'non-significant' result. In that situation, I believe that it
can be valuable in helping one to decide 'what to do next'.

John Whittington

unread,
Dec 11, 2009, 11:09:22 AM12/11/09
to meds...@googlegroups.com
At 09:48 11/12/2009 -0600, Marc Schwartz wrote:
>It seems to me that if one actually gets to the end of the study, only
>then to realize that there was an insufficient number of subjects
>because the underlying assumptions of the study proved false, then it
>can be argued that to a large extent, there was a failure in the
>safety oversight and review process.
>
>There is an ethical obligation by the PI's and the independent IRBs/
>DMCs/DSMBs to periodically review the study and to reasonably ensure
>that the study has a chance to achieve it's primary objective. If that
>is unlikely, then where possible, they are to recommend changes that
>do not compromise the scientific validity of the study, or if not
>possible, to recommend that the study be stopped for futility.

I agree totally, but (a) we are talking about situations in which that has
not been done and (b) although I am (for these very reasons) a great
believer in sample size estimation and other forms of adaptive designs,
such designs remain (at least in my experience) rarities at
present. Goodness knows how many trials I have been involved with, or
aware of, over the past 30 years, but the number of them in which there was
any attempt to look in real time for errors in underlying assumptions
(particularly the variability of emerging results) could undoubtedly be
counted on the fingers of one hand.

Braunholtz, David A.

unread,
Dec 11, 2009, 11:26:18 AM12/11/09
to meds...@googlegroups.com
We are now deep into the general 'significant / not-significant p-value' bear-trap, aren't we? If you want to plan a future study, use the current info (from results of last trial, plus others if any, in a meta-analysis), your now improved understanding of variability, plus your current understanding of a clinically important difference. Then examine how more info would add to the existing (maybe via meta-analysis). How does a power calc (revised or not) for the previous trial help ?

Whether a trial is 'significant' or not does NOT alter the information (likelihood) about the effect, in particular it does not alter the best estimate of the effect.

BW

David

Marc Schwartz

unread,
Dec 11, 2009, 11:47:25 AM12/11/09
to meds...@googlegroups.com
I agree with your sentiments and experience John.

However, in a "post-Vioxx", "post-Avandia", "post-ICD" world, where
there has also been a clear political (and lawsuit driven) shift
towards safety, while still permitting a reasonable review and
approval process, I do think that we are going to see a material shift
in the operational paradigm of the oversight process.

The recent focus on the increasing use of independent DMCs/DSMBs alone
is a significant shift in thinking about the oversight process.

In a forward looking fashion, we are likely to see more and more
rigorous interim reviews of studies for the reasons discussed.

Regards,

Marc

Steve Simon, P.Mean Consulting

unread,
Dec 11, 2009, 12:19:38 PM12/11/09
to meds...@googlegroups.com
Neil Shephard wrote:
> On Thu, Dec 10, 2009 at 11:49 PM, Steve Simon, P.Mean Consulting
> <n...@pmean.com> wrote:
>>
>> The choice of alpha, beta, and sample size comes down to economics.
>
> Out of curiosity...
>
> Whilst the economic aspect is no doubt important how do you calculate
> the value of an individual life and factor that into the equation(s)?

Oh, you're not curious, you just want to see me squirm. <grin>

If you believe in the phrase "a fate worse than death" then you believe
that the value of an individual life can be compared to the value of
other bad outcomes, such as being confined to a hospital bed and
tethered to life support.

The choice is highly individualistic and very personal, but the relative
weighting of outcomes including death need to be done to make
intelligent decisions about whether to endure a harsh set of anti-cancer
therapies.

Now, I'd have to think a bit about how this relates to statistics. The
last time I looked, no statistician has ever been executed for making a
Type I or a Type II error.

Instead, it is society that suffers. People do die if an ineffective
drug is let onto the market because of a Type I error. They also can die
if an effective drug is kept off the market because of a Type II error.
So somebody has to tally the deaths caused by either type of error,
combine it with other types of misery caused by either type of error and
then make a decision. I see no other way to do this than by assigning
monetary values to all the bad outcomes and minimizing the expected
cost. If that means assigning a dollar value to a human life to allow
comparisons to other bad outcomes, so be it.

Do you have a better solution?
---

John Whittington

unread,
Dec 12, 2009, 9:51:39 AM12/12/09
to meds...@googlegroups.com
I agree, and hope that we do, indeed, see trials improving in this fashion
in the foreseeable future. As I said, I have yet to see appreciable
evidence that it is happening yet.

Kind Regards,
John

At 10:47 11/12/2009 -0600, Marc Schwartz wrote:
>I agree with your sentiments and experience John.
>
>However, in a "post-Vioxx", "post-Avandia", "post-ICD" world, where
>there has also been a clear political (and lawsuit driven) shift
>towards safety, while still permitting a reasonable review and
>approval process, I do think that we are going to see a material shift
>in the operational paradigm of the oversight process.
>
>The recent focus on the increasing use of independent DMCs/DSMBs alone
>is a significant shift in thinking about the oversight process.
>
>In a forward looking fashion, we are likely to see more and more
>rigorous interim reviews of studies for the reasons discussed.


John Whittington

unread,
Dec 12, 2009, 10:02:20 AM12/12/09
to meds...@googlegroups.com
At 16:26 11/12/2009 +0000, Braunholtz, David A. wrote:
> We are now deep into the general 'significant / not-significant p-value'
> bear-trap, aren't we? If you want to plan a future study, use the
> current info (from results of last trial, plus others if any, in a
> meta-analysis), your now improved understanding of variability, plus your
> current understanding of a clinically important difference. Then examine
> how more info would add to the existing (maybe via meta-analysis). How
> does a power calc (revised or not) for the previous trial help ?

In the real world, a post-hoc power calculation or sample size estimation
might well be a major factor in determining whether ANY future studies were
undertaken following a 'negative' result, or whether the treatment in
question would be 'abandoned'. Such decisions are often largely in the
hands of people who have no 'technical' (statistical, clinical etc.)
expertise. If they are told that a study designed to have adequate power
to detect an effect has failed to detect an effect, they may abandon the
project. However, if they can be shown that, with the benefit of
hindsight, it is apparent that the trial was not adequately powered for
purpose, then they might be more inclined to move forward into more
adequate trials.

Chris Everyman

unread,
Dec 13, 2009, 6:55:19 AM12/13/09
to MedStats
Hi Keramat et al.,



Just to elaborate on a few of the comments made by others, one needs
to distinguish between measures of evidence (as advocated by R. A.
Fisher), and control of error rates in decisions/ "inductive
behaviour" (as advocated by J. Neyman and E. Pearson). Under Fisher's
approach, p-values are treated as measures of evidence against the
null hypothesis. How much evidence is considered sufficient to discard
the null hypothesis depends upon the particular research problem.
Under the Neyman-Pearson approach, the researcher needs to weigh the
costs and consequences of making type I and type II errors for a
particular study, and set the alpha and beta levels accordingly when
designing the study. Note that neither approach advocates strict
adherence to specific values across all studies. As Ted described,
specific values (such as .10, .05, and .01) are remnants from the days
of yore when tables of distributions were developed and used. For
further explanation of these issues, see, e.g.,



Gigerenzer, G. (1993). The Superego, the Ego, and the Id in
statistical reasoning. In G. Keren and C. Lewis (Eds.), A handbook for
data analysis in the behavioral sciences: Methodological issues (pp.
311-339). Hillsdale, NJ: Lawrence Erlbaum Associates.

Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual:
What you always wanted to know about significance testing but were
afraid to ask. In D. Kaplan (Ed.), The SAGE handbook of quantitative
methodology for the social sciences (pp. 391-408). Thousand Oaks, CA:
SAGE Publications.

Goodman, S. N. (1993). P values, hypothesis tests, and likelihood:
Implications for epidemiology of a neglected historical debate (with
discussion). American Journal of Epidemiology, Vol. 137, pp. 485-500. \
{Additional commentary by S. Greenland in AJE, Vol. 139, pp.
116-118.\}

Goodman, S. N. (1999a). Toward evidence-based medical statistics. 1:
The p value fallacy. Annals of Internal Medicine, Vol. 130, pp.
995-1004. \{Available for *free* at http://www.annals.org/content/vol130/issue12/
\}

Hubbard, R. (2004). Alphabet soup: Blurring the distinctions between
emph{p}’s and \alpha’s in psychological research. Theory and
Psychology, Vol. 14, pp. 295-327.

Hubbard, R., & Bayarri, M. J. (2003). Confusion over measures of
evidence (emph{p}’s) versus errors (\alpha’s ) in classical
statistical testing (with discussion). The American Statistician, Vol.
57, pp. 171-182.

Hubbard, R., & Lindsay, R. M. (2008). Why emph{p} values are not a
useful measure of evidence in statistical significance testing. Theory
and Psychology, Vol. 18, pp. 69-88.



(Some of the above works also cover the Bayesian and likelihoodist
approaches, which I have neglected to discuss in this post solely for
the sake of brevity.)





Problems with so-called "standardised effect sizes" were described in



Greenland, S., Schlesselman, J. J., & Criqui, M. H. (1986). The
fallacy of employing standardized regression coefficients and
correlations as measures of effect. American Journal of Epidemiology,
Vol. 123, pp. 203-208.

Kim, J.-O., & Ferree, G. D. (1981). Standardization and causal
analysis. Sociological Methods and Research, Vol. 10, pp. 187-210.



(and elsewhere). Tangentially, these days I wonder to what extent the
continued use of "standardised effect sizes" is a reflection of the
failure of certain fields to deal with fundamental problems of
measurement? See,



Michell, J. (2000). Normal science, pathological science and
psychometrics. Theory and Psychology, Vol. 10, pp. 639-667. \{Comments
and reply in T&P, Vol. 14, pp. 105-129\}

Michell, J. (2003). Measurement: A beginner's guide. Journal of
Applied Measurement, Vol. 4, pp. 298-308.

Michell, J. (2008). Is psychometrics pathological science?
Measurement: Interdisciplinary Research and Perspectives, Vol. 6, pp.
7-24. \{Additional commentaries in that same issue.\}

Trendler, G. (2009). Measurement theory, psychology and the
revolution that cannot happen. Theory and Psychology, Vol. 19, pp.
579-599.



Best wishes,



Chris

Neil Shephard

unread,
Dec 14, 2009, 6:21:53 AM12/14/09
to meds...@googlegroups.com
On Fri, Dec 11, 2009 at 5:19 PM, Steve Simon, P.Mean Consulting
<n...@pmean.com> wrote:
> Neil Shephard wrote:
>> On Thu, Dec 10, 2009 at 11:49 PM, Steve Simon, P.Mean Consulting
>> <n...@pmean.com> wrote:
>>>
>>> The choice of alpha, beta, and sample size comes down to economics.
>>
>> Out of curiosity...
>>
>> Whilst the economic aspect is no doubt important how do you calculate
>> the value of an individual life and factor that into the equation(s)?
>
> Oh, you're not curious, you just want to see me squirm. <grin>

Not my intention.

> If you believe in the phrase "a fate worse than death" then you believe
> that the value of an individual life can be compared to the value of
> other bad outcomes, such as being confined to a hospital bed and
> tethered to life support.
>
> The choice is highly individualistic and very personal, but the relative
> weighting of outcomes including death need to be done to make
> intelligent decisions about whether to endure a harsh set of anti-cancer
> therapies.

The economics are one factor, but in such a proposed treatment it
might also be worth factoring in the patients quality of life.
Obviously this is subjective and varies from individual, but there are
things such as EQ-5D which measure individuals quality of life.

Taking account of the quality of life as well as the economic costs of
a particular treatment regime is important to.

> Now, I'd have to think a bit about how this relates to statistics. The
> last time I looked, no statistician has ever been executed for making a
> Type I or a Type II error.
>
> Instead, it is society that suffers. People do die if an ineffective
> drug is let onto the market because of a Type I error. They also can die
> if an effective drug is kept off the market because of a Type II error.
> So somebody has to tally the deaths caused by either type of error,
> combine it with other types of misery caused by either type of error and
> then make a decision. I see no other way to do this than by assigning
> monetary values to all the bad outcomes and minimizing the expected
> cost. If that means assigning a dollar value to a human life to allow
> comparisons to other bad outcomes, so be it.
>
> Do you have a better solution?

Nope, but I've never heard anyone explicitly put a dollar value on a
human life in the sample size calculations.

kornbrot

unread,
Oct 5, 2010, 3:21:58 AM10/5/10
to meds...@googlegroups.com
Agree that post hoc power is uninformative and often misleading.
I also can see no role whatsoever for post hoc power
Also agree that the main usefulness of a priori power is at the design stage.

My point is:
  1. when reporting results with p values above the conventional .05, it is useful to REMIND readers of the a priori power that was built into the design.  This is particularly true fo factorial designs where the design is often geared to main effects. In such situations the power to detect interactions may be quite low
  2. when evaluating the results of investigations that have already been conducted, a priori power is important for interpretation. Of course meta-analysis weighting procedures that include N are in fact including a priori power

It is also important to distinguish presentation issues from completeness issues.
Post hoc means, SDs and Ns are SUFFICIENT, or means, N and confidence interval, or means, SEs and N for post hoc group comparisons. Which triple is used is a matter of PRESENTATION, as one can easily transform between triples.  
The confidence level triple is most compelling to many people for a 2 group comparison, or for chosen contrasts (binary comparisons) in a factorial design. Confidence intervals get muddier, especially when presented graphically, when one has multiple groups. This is particularly difficult  in mixed designs, where the SD & hence any confidence  interval for a between group factor is different from that for the within group factor.


By a priori power, people usually mean calculated on the basis of a particular N, and DESIRED minimum change and ESTIMATED standard deviation

Whether one chooses to PRESESNT the statistical  



On 10/12/2009 14:14, "Doug Altman" <doug....@csm.ox.ac.uk> wrote:

There are good grounds for not reporting post hoc power calculations evaluating either the power to detect what has been seen or more likely what was not but might have been seen.

This issue was discussed 2 years ago on this list. The classic paper is

Goodman SN, Berlin JA.
The use of predicted confidence intervals when planning experiments and the
misuse of power when interpreting results. Ann Intern Med. 1994;121(3):200-6.
(Erratum in: Ann Intern Med 1995;122(6):478.)

"Although there is a growing understanding of the importance of statistical power considerations when designing studies and of the value of confidence intervals when interpreting data, confusion exists about the reverse arrangement: the role of confidence intervals in study design and of power in interpretation. Confidence intervals should play an important role when setting sample size, and power should play no role once the data have been collected, but exactly the opposite procedure is widely practiced. In this commentary, we present the reasons why the calculation of power after a study is over is inappropriate and how confidence intervals can be used during both study design and study interpretation."
 
There are many other papers on this theme including

Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Stat 2001;55:19-24.

Levine M, Ensom MH. Post hoc power analysis: an idea whose time has passed? Pharmacotherapy 2001;21:405-409.

It is true up to a point that there is no value in knowing the prior sample size calculation when the results are presented.  However, the CONSORT group argues that we wish to see this information presented. First, it gives some evidence of the care with which a study was designed (and clinical factors, including clinically important treatment effect), and shuld help demonstrate that the study sample was not determined for example by stopping as convenient after multiple examinations of accumulating data. Second, if the ultimate sample size reported is notably different from the planned N we would wish to know why. Discrepancies are common. Charles et al (BMJ 2009) recently reported a study in which they examined trial reports published in high impact general medical journals; they found that "the difference between the sample size reported in the article and the replicated sample size calculation was greater than 10% in 47 (30%) of the 157 reports that gave enough data to recalculate the sample size."

However, Chan et al (BMJ 2008) showed that for many trials the details of the sample size calculation in the article does not match that included in the trial protocol, suggesting perhaps attempts to hide data driven changes.

 
(BTW I don't understand why so many people refer to the initial calculation as a "power calculation" when power is generally fixed and the sample size is calculated. The retrospective calculation is indeed a power calculation.)
 
BW
Doug


Charles P, Giraudeau B, Dechartres A, Baron G, Ravaud P.
Reporting of sample size calculation in randomised controlled trials: review.
BMJ. 2009 May 12;338:b1732.

Chan A-W, Hrobjartsson A, Jorgensen KJ, Gotzsche PC, Altman DG. Discrepancies in sample size
calculations and data analyses reported in randomised trials: comparison of publications with protocols.
BMJ 2008;337:a2299.


At 13:20 10/12/2009, John Whittington wrote:
At 13:10 10/12/2009 +0000, Braunholtz, David A. wrote:

Being a good Bayesian (tautology of course) I have long thought that post hoc (eg in results) knowing the pre-hoc power calculations adds nothing to properly presented estimates with CIs (or likelihoods), except perhaps gives some idea of the competence of the research team ...

Whilst I certainly agree with those conceptual sentiments, I suppose that one does not have to present (and Diana may not have been assuming this, either) the pre-hoc power calculations in the post-hoc situation - one can present post-hoc calculations, based on the observed (rather than guessed) variance of the data.  However, I'm not convinced that to do so is materially different in mathematical terms from presenting a CI, which most people would find easier to interpret.


Kind Regards,


John

----------------------------------------------------------------
Dr John Whittington,      Voice:    +44 (0) 1296 730225
Mediscience Services      Fax:      +44 (0) 1296 738893
Twyford Manor, Twyford,    E-mail:  Joh...@mediscience.co.uk
Buckingham  MK18 4EL, UK            
----------------------------------------------------------------

_____________________________________________________

Doug Altman
Professor of Statistics in Medicine
Centre for Statistics in Medicine
University of Oxford
Wolfson College Annexe
Linton Road
Oxford OX2 6UD

email:  doug....@csm.ox.ac.uk
Tel:    01865 284400 (direct line 01865 284401)
Fax:    01865 284424
www:    http://www.csm-oxford.org.uk/

 <http://www.csm-oxford.org.uk/> EQUATOR Network - resources for reporting research
www: http://www.equator-network.org/



 <http://www.equator-network.org/>
--~--~---------~--~----~------------~-------~--~----~

To post a new thread to MedStats, send email to MedS...@googlegroups.com .
 MedStats' home page is http://groups.google.com/group/MedStats .
 Rules: http://groups.google.com/group/MedStats/web/medstats-rules
-~----------~----~----~----~------~----~------~--~---





Professor Diana Kornbrot
email: 
d.e.ko...@herts.ac.uk    
web:    http://web.me.com/kornbrot/KornbrotHome.html
Work
School of Psychology
 University of Hertfordshire
 College Lane, Hatfield, Hertfordshire AL10 9AB, UK
 voice:   +44 (0) 170 728 4626
   fax:     +44 (0) 170 728 5073
Home
 
19 Elmhurst Avenue
 London N2 0LT, UK
    voice:   +44 (0) 208 883  3657
    mobile: +44 (0)
796 890 2102
   fax:      +44 (0) 870 706 4997





Reply all
Reply to author
Forward
0 new messages