One tailed vs. Two tailed test

auda

unread,

Mar 11, 2001, 9:01:43 PM3/11/01

to

Hi, all,
We are testing a group of subjects on their performance in two different
conditions (say, A and B), and we are testing them individually. We have an
alternative hypothesis that reaction time in condition A should be longer
than in condition B, so we perform a one-tailed t test. However, for some
subjects, they showed the pattern reverse to our alternative hypothesis--RT
B> RT A, and the p value is significant under one tailed test.

Could we claimed that these "reversed" subjects showed "significant" results
in the opposite direction, or we should treat them as non-significant
results?

Thanks,
Erik

Thom Baguley

unread,

Mar 12, 2001, 12:20:48 PM3/12/01

to

If you do a one-tailed test, no. The fact that you are entertaining this
possibility suggests you should be using a two-tailed test. The one-tailed
test has no power to detect differences in the discounted (non-predicted)
direction hence should only be used when you would reject such a finding a priori.

I'm a bit puzzled as to why you test each participant individually? You'd
expect (unless the effect is huge) for some participants to go against the
average pattern. If you do need to test each person individually you need to
use the two-tailed non-directional test and use a correction for multiple
testing (e.g., Bonferonni or similar).

Thom

Jerry Dallal

unread,

Mar 12, 2001, 2:34:40 PM3/12/01

to

Don't do one-tailed tests.

Will Hopkins

unread,

Mar 12, 2001, 9:43:00 PM3/12/01

to

At 7:34 PM +0000 12/3/01, Jerry Dallal wrote:
>Don't do one-tailed tests.

If you are going to do any tests, it makes more sense to one-tailed
tests. The resulting p value actually means something that folks can
understand: it's the probability the true value of the effect is
opposite to what you have observed.

Example: you observe an effect of +5.3 units, one-tailed p = 0.04.
Therefore there is a probability of 0.04 that the true value is less
than zero.

There was a discussion of this notion a month or so ago. A Bayesian
on this list made the point that the one-tailed p has this meaning
only if you have absolutely no prior knowledge of the true value.
Sure, no problem.

But why test at all? Just show the 95% confidence limits for your
effects, and interpret them: "The effect could be as big as <upper
confidence limit>, which would mean.... Or it could be <lower
confidence limit>, which would represent... Therefore... " Doing it
in this way automatically addresses the question of the power of your
study, which reviewers are starting to ask about. If your study turns
out to be underpowered, you can really impress the reviewers by
estimating the sample size you would (probably) need to get a
clear-cut effect. I can explain, if anyone is listening...

Will
--
Will G Hopkins, PhD FACSM
University of Otago, Dunedin NZ
Sportscience: http://sportsci.org
A New View of Statistics: http://newstats.org
Sportscience Mail List: http://sportsci.org/forum
ACSM Stats Mail List: http://sportsci.org/acsmstats
----------------------------------------------------
Be creative: break rules.

=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
http://jse.stat.ncsu.edu/
=================================================================

Donald Burrill

unread,

Mar 12, 2001, 11:46:28 PM3/12/01

to

On Tue, 13 Mar 2001, Will Hopkins wrote in part:

> Example: you observe an effect of +5.3 units, one-tailed p = 0.04.
> Therefore there is a probability of 0.04 that the true value is less
> than zero.

Sorry, that's incorrect. The probability is 0.04 that you would find an
effect as large as +5.3 units (or more), if (a) the true value is zero
and (b) the sampling distribution of the test statistic is what you think
it is. (The probability of finding an effect this large, in this
direction, is less than 0.04 if the true value is less than zero (and
your sampling distribution is correct).)

< snip >

> But why test at all? Just show the 95% confidence limits for your
> effects, and interpret them: "The effect could be as big as <upper
> confidence limit>, which would mean.... Or it could be <lower
> confidence limit>, which would represent... Therefore... " Doing it
> in this way automatically addresses the question of the power of your
> study, which reviewers are starting to ask about. If your study turns
> out to be underpowered, you can really impress the reviewers by
> estimating the sample size you would (probably) need to get a
> clear-cut effect. I can explain, if anyone is listening...

You had in mind, I trust, the _two-sided_ 95% confidence interval!
-- Don.
----------------------------------------------------------------------
Donald F. Burrill dbur...@xtdl.com
348 Hyde Hall, Plymouth State College, dbur...@mail.plymouth.edu
MSC #29, Plymouth, NH 03264 (603) 535-2597
184 Nashua Road, Bedford, NH 03110 (603) 471-7128

Herman Rubin

unread,

Mar 13, 2001, 10:02:27 AM3/13/01

to

In article <p0433010fb6d329af7d2d@[139.80.121.126]>,

Will Hopkins <will.h...@otago.ac.nz> wrote:
>At 7:34 PM +0000 12/3/01, Jerry Dallal wrote:
>>Don't do one-tailed tests.

>If you are going to do any tests, it makes more sense to one-tailed
>tests. The resulting p value actually means something that folks can
>understand: it's the probability the true value of the effect is
>opposite to what you have observed.

>Example: you observe an effect of +5.3 units, one-tailed p = 0.04.
>Therefore there is a probability of 0.04 that the true value is less
>than zero.

This is certainly not the case, except under highly dubious
Bayesian assumptions.

>There was a discussion of this notion a month or so ago. A Bayesian
>on this list made the point that the one-tailed p has this meaning
>only if you have absolutely no prior knowledge of the true value.
>Sure, no problem.

This is not possible; the idea of "insufficient reason" is
full of contradictions, and is a major reason for the failure
of Bayesian inference to be pursued in the 19th century.

There is generally a prior probability that the effect will
be small. Unless there are enough observations that the
scale of "small" is so spread out that it looks large, the
probability statement you have made does not have any real
justification. Also, should you care if the difference is
that small?

>But why test at all? Just show the 95% confidence limits for your
>effects, and interpret them: "The effect could be as big as <upper
>confidence limit>, which would mean.... Or it could be <lower
>confidence limit>, which would represent...

Fixed coverage confidence limits, either classical or
Bayesian, likewise are not appropriate from the real
problem, which is what action to take.

Therefore... " Doing it
>in this way automatically addresses the question of the power of your
>study, which reviewers are starting to ask about. If your study turns
>out to be underpowered, you can really impress the reviewers by
>estimating the sample size you would (probably) need to get a
>clear-cut effect. I can explain, if anyone is listening...

This is more than one can do. Consider ALL the consequences
of the action; you only look at some of them. Also, do this
in ALL the states of nature.
--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
hru...@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

dennis roberts

unread,

Mar 13, 2001, 10:12:33 AM3/13/01

to

we have to first separate out 2 things:

1. some test statistics are naturally (the way they work anyway) ONE sided
with respect to retain/reject decisions

example: chi square test for independence ... we reject ONLY when chi
square is LARGER than some CV ... to put a CV at the lower end of the
relevant chi square distribution makes no sense

2. whether for our research hypothesis ... rejection of the null is
something that makes sense to BE ABLE to do regardless if the evidence
suggests that the effect is LESS than the null or MORE than the null

example: typical treatments could have positive or negative effects (even
though obviously, we predict + effects) ... thus, when doing a typical two
sample t test (if you are interested in differences in means) ... we make
both an upper AND lower rejection region ... ie, two tailed TEST

but, in some cases, it might be totally unthinkable for one end of the
statistical distribution to be "useful" in a given case ... say we have a
weight loss regimen program ... consisting of diet and exercise ... and
want to know if it works ... ie, people lose weight ... now, in this case
(it could be) one might argue that it is difficult to conceptualize that
the regimen would actually "cause" one to GAIN weight ... so, to put some
rejection area on that end of the t distribution would seem silly ... thus,
we might be able to make the case that it is perfectly legitimate to use a
one tailed test in this case ... (done BEFORE hand of course ... not just
after the fact because your 2 tailing approach failed to allow you to
reject the null)

At 03:08 PM 3/13/01 +1300, Will Hopkins wrote:
>At 7:34 PM +0000 12/3/01, Jerry Dallal wrote:
>>Don't do one-tailed tests.
>
>If you are going to do any tests, it makes more sense to one-tailed
>tests. The resulting p value actually means something that folks can
>understand: it's the probability the true value of the effect is opposite
>to what you have observed.

=================================================================

Jerry Dallal

unread,

Mar 13, 2001, 10:59:07 AM3/13/01

to

Will Hopkins wrote:
>
> At 7:34 PM +0000 12/3/01, Jerry Dallal wrote:
> >Don't do one-tailed tests.
>
> If you are going to do any tests, it makes more sense to one-tailed
> tests.

If you're doing a 1 tailed test, why test at all? Just switch from
standard treatment to the new one. Can't do any harm. Every field
is littered with examples where one-tailed tests would have led to
disasters (harmful treatments missed, etc.) had they been used.

Robert J. MacG. Dawson

unread,

Mar 13, 2001, 11:03:04 AM3/13/01

to

dennis roberts wrote:
>
> we have to first separate out 2 things:
>
> 1. some test statistics are naturally (the way they work anyway) ONE sided
> with respect to retain/reject decisions
>
> example: chi square test for independence ... we reject ONLY when chi
> square is LARGER than some CV ... to put a CV at the lower end of the
> relevant chi square distribution makes no sense

I don't know about that... In the "sterile" conditions assumed in intro
textbooks ("Doubt that the stars are fire...but never doubt the validity
of the assumed model") it makes no sense; however in practice it makes
plenty of sense to stop and reconsider the whole model - thus
effectively rejecting the null hypothesis, albeit in favor of the third
hypothesis

H_oz: Toto, we're not in Kansas anymore.

If you want to apply the hypothesis testing approach to this
consistently rather than Trusting the Force, you have to decide just how
low chi-squared can be before you decide that This Cannot Happen. Then
that's a critical value.

You could even modify your alpha, representing a probability under the
reinforced null hypothesis

H_oo: all the assumptions hold and moreover we have independence.

of rejecting at one end or the other.

Just a thought...

-Robert

RD

unread,

Mar 13, 2001, 2:50:22 PM3/13/01

to

On 13 Mar 2001 07:12:33 -0800, d...@PSU.EDU (dennis roberts) wrote:

>1. some test statistics are naturally (the way they work anyway) ONE sided
>with respect to retain/reject decisions
>
>example: chi square test for independence ... we reject ONLY when chi
>square is LARGER than some CV ... to put a CV at the lower end of the
>relevant chi square distribution makes no sense
>

Hmm... do not want to start flame war but just can not go by such HUGE
misconception about chi squared test. Indeed exactly reverse is true :
chi squred test is always two tailed. There is nothing to prove just
look at the definition : Khi^2(n)=sum(Z^2).
Altogether with many other answers I saw on sci.stat.* this makes grow
my desire to unsubscribe.
Now getting back to original question. If you declared to carry on one
tailed test and this was not significant your conclusion is simple as
follows: "We could not show that reaction time in condition A is
longer than in condition B.(full stop)". That is your main conclusion.
Now on the side you can play around to try to explain this (as in your
case it appears that the reason was in small subset). And conclude
that to show this you are going to start another study.
Finally on the subject of your message. My ansver is : ALWAYS DO TWO
TAILED TESTS. In a nutshell there are two major resons to do two
tailed tests. First your problem is a good example - you tested if A
was superior to B instead to test the difference and you failed.
Second imagine your test reached 5% barreer. In this case you will
probably give the reader mean difference with its confidence interval.
This CI is 95% and may contain 0. Seems weird isn't it?
Incidentally my opinion agrees with international harmonisation
guidelines. Just dig FDA site to find them. There are half-page
additional explanations why one tailed tests with 5% are unacceptable.
The result you can not submit a drug for approval based on studies
with one tailed 5% rate tests.

I am dermatologist not statistitian and all those questions seems
obvious to me. I am disappointed.

st...@mimosa.csv.warwick.ac.uk

unread,

Mar 13, 2001, 4:18:38 PM3/13/01

to

In article <33ssatgue6q2iqdlm...@4ax.com>,

RD <al...@mail.ru> writes:
>On 13 Mar 2001 07:12:33 -0800, d...@PSU.EDU (dennis roberts) wrote:
>
>>1. some test statistics are naturally (the way they work anyway) ONE sided
>>with respect to retain/reject decisions
>>
>>example: chi square test for independence ... we reject ONLY when chi
>>square is LARGER than some CV ... to put a CV at the lower end of the
>>relevant chi square distribution makes no sense
>>
>Hmm... do not want to start flame war but just can not go by such HUGE
>misconception about chi squared test. Indeed exactly reverse is true :
>chi squred test is always two tailed. There is nothing to prove just
>look at the definition : Khi^2(n)=sum(Z^2).

Please amplify what you mean by "just look at the definition":
If you mean that positive and negative residuals (Obs_i - Exp_i)
both increase the "lack-of-fit", then that is universally recognised.

For a chi-squared test, "two-tailed" means that you are interested
not only in lack-of-fit [(Obs_i - Exp_i) is typically too far from zero],
but also in too good a fit [(Obs_i-Exp_i) is almost invariably too small].
The lack-of-fit is usually of most interest, given that scientists are
allegedly honest and fairly objective, albeit optimistic about their own
pet theory/treatment. The suspiciously-good-fit looks for scientific fraud
(e.g. Mendel, or Mendel's assistants, produced unreasonably good fits to
his theories on wrinkly green peas etc.; Cyril Burt's "data sets" produced
unreasonably good fits to his theories on IQ and genetic inheritance).

Usually in the "one- vs two-tailed debate", people are talking about
t-tests or similar, where deviations from the null hypothesis in two
opposing directions (new treatment best / standard treatment best)
are both of interest. This is totally different from traditional
chi-squared or similar tests.

>Altogether with many other answers I saw on sci.stat.* this makes grow
>my desire to unsubscribe.

Why?

>Now getting back to original question. If you declared to carry on one
>tailed test and this was not significant your conclusion is simple as
>follows: "We could not show that reaction time in condition A is
>longer than in condition B.(full stop)". That is your main conclusion.

An alternative main conclusion is: "What a total klutz I was to apply
a ridiculous 1-tailed test when I could have applied a slightly less
ridiculous 2-tailed test").

>Now on the side you can play around to try to explain this (as in your
>case it appears that the reason was in small subset). And conclude
>that to show this you are going to start another study.
>Finally on the subject of your message. My ansver is : ALWAYS DO TWO
>TAILED TESTS. In a nutshell there are two major resons to do two
>tailed tests. First your problem is a good example - you tested if A
>was superior to B instead to test the difference and you failed.
>Second imagine your test reached 5% barreer. In this case you will
>probably give the reader mean difference with its confidence interval.
>This CI is 95% and may contain 0. Seems weird isn't it?
>Incidentally my opinion agrees with international harmonisation
>guidelines. Just dig FDA site to find them. There are half-page
>additional explanations why one tailed tests with 5% are unacceptable.
>The result you can not submit a drug for approval based on studies
>with one tailed 5% rate tests.
>
>I am dermatologist not statistitian and all those questions seems
>obvious to me. I am disappointed.

For me, the only important practical (as opposed to theoretical)
objection to carrying out a 1-tailed test is ethical. If an amateur
statistician decides that applying 10mg Cu per square metre is no
better for wheat yield than applying 10mg K per square metre,
then deciding to apply 10mg Cu/m^2 is their prerogative, their problem,
and an example of evolution in action. However, if they chose to
apply poison to my grandmother because it is no better than medically-
accepted standard treatment for multiple sclerosis, then I would object.
Forcibly. See "Decision Theory".

More importantly, I would say: DON'T DO TESTS. Instead, try to find
models that you would be prepared to use to predict the response
in as-yet untried circumstances.
--
J.E.H.Shaw [Ewart Shaw] st...@uk.ac.warwick TEL: +44 2476 523069
Department of Statistics, University of Warwick, Coventry CV4 7AL, U.K.
http://www.warwick.ac.uk/statsdept/Staff/JEHS/
yacc - the piece of code that understandeth all parsing

Herman Rubin

unread,

Mar 13, 2001, 4:32:15 PM3/13/01

to

In article <33ssatgue6q2iqdlm...@4ax.com>,

RD <al...@mail.ru> wrote:
>On 13 Mar 2001 07:12:33 -0800, d...@PSU.EDU (dennis roberts) wrote:

>>1. some test statistics are naturally (the way they work anyway) ONE sided
>>with respect to retain/reject decisions

>>example: chi square test for independence ... we reject ONLY when chi
>>square is LARGER than some CV ... to put a CV at the lower end of the
>>relevant chi square distribution makes no sense

>Hmm... do not want to start flame war but just can not go by such HUGE
>misconception about chi squared test. Indeed exactly reverse is true :
>chi squred test is always two tailed. There is nothing to prove just
>look at the definition : Khi^2(n)=sum(Z^2).

There is a way of looking at the chi-squared test otherwise.

In fact, a low chi-squared would constitute a question of
whether what purport to be random numbers really are.

RD

unread,

Mar 13, 2001, 5:53:17 PM3/13/01

to

On 13 Mar 2001 21:18:38 GMT, st...@mimosa.csv.warwick.ac.uk () wrote:

>In article <33ssatgue6q2iqdlm...@4ax.com>,
> RD <al...@mail.ru> writes:
>>On 13 Mar 2001 07:12:33 -0800, d...@PSU.EDU (dennis roberts) wrote:
>>
>>>1. some test statistics are naturally (the way they work anyway) ONE sided
>>>with respect to retain/reject decisions
>>>
>>>example: chi square test for independence ... we reject ONLY when chi
>>>square is LARGER than some CV ... to put a CV at the lower end of the
>>>relevant chi square distribution makes no sense
>>>
>>Hmm... do not want to start flame war but just can not go by such HUGE
>>misconception about chi squared test. Indeed exactly reverse is true :
>>chi squred test is always two tailed. There is nothing to prove just
>>look at the definition : Khi^2(n)=sum(Z^2).
>
>Please amplify what you mean by "just look at the definition":
>If you mean that positive and negative residuals (Obs_i - Exp_i)
>both increase the "lack-of-fit", then that is universally recognised.
>

The definition of chi squared density function is probability to get a
sum of squared random variables which follow centered normal
distribution. As they are squared this means that when we look at the
right of our value on the chi squared density graph we look for 1
minus cumulative density which basically corresponds to two tailed
test.
You states that this is universally recognied but take a look at
Dennis Roberts message quoted above to see that you are wrong.

large snip

>For me, the only important practical (as opposed to theoretical)
>objection to carrying out a 1-tailed test is ethical. If an amateur
>statistician decides that applying 10mg Cu per square metre is no
>better for wheat yield than applying 10mg K per square metre,
>then deciding to apply 10mg Cu/m^2 is their prerogative, their problem,
>and an example of evolution in action. However, if they chose to
>apply poison to my grandmother because it is no better than medically-
>accepted standard treatment for multiple sclerosis, then I would object.
>Forcibly. See "Decision Theory".

There is a huge gap between test and decision. I do not think that
your example is a good one. No better vs no difference ie 1tailed vs
2tailed does not make difference at all because in medicine these are
never used by decision making authorities. So you won't have to
object.
Indeed in this particular case you are talking about a slightly
different problem. In fact we are faced to dilemma. Usually any
treatment is tested against placebo. Thus if it is not different or no
better with2.5% we throw that molecule to recycle bin. Things are
quite different if there is already an effective (although never
perfect) treatement for your grand mom's multiple sclerosis. Helsinki
declaration exlicitly prohibits testing against placebo where such
treatment exists. If we think that a new treatment is just as
efficatious as the old one we have to use the so called equivalence
tests wich are far from perfect.
But this is another discussion which is far from our original subject.

>More importantly, I would say: DON'T DO TESTS. Instead, try to find
>models that you would be prepared to use to predict the response
>in as-yet untried circumstances.

For me "TO DO TEST" means to test my model. Yet that concept is never
tought at medical school. We are usually tought some cabbalistic
calculations then compare result to that table. This is probably why
some people may think that when the result in table of chi squred
table corresponds to cumulative distribution we are doing one tailed
test.
Alexandre Kaoukhov

dennis roberts

unread,

Mar 13, 2001, 5:23:04 PM3/13/01

to

well, help me out a bit

i give a survey and ... have categorized respondents into male and females
... and also into science major and non science majors ... and find a data
table like:

MTB > chisquare c1 c2

Chi-Square Test: C1, C2

Expected counts are printed below observed counts

non science science
C1 C2 Total
M 1 24 43 67
32.98 34.02

F 2 39 22 61
30.02 30.98

Total 63 65 128

Chi-Sq = 2.444 + 2.368 +
2.684 + 2.601 = 10.097
DF = 1, P-Value = 0.001

when we evaluate THIS test ... with the chi square test statistic we use in
THIS case ... in what sense would this be considered to be a TWO tailed
test? would we still be using say ... the typical value of .05 to make a
decision to retain or reject? would we be asking the tester to look up both
lower and upper CVs from a chi square distribution with 1 df ... and really
ask him/her to consider rejecting if the obtained chi squared value is
smaller than the lower CV?

in this case ... minitab is finding the area ABOVE 10.097 in a chi square
distribution with 1 df ... and recording it as the P value ...

of course, in a simple hypothesis test for a single population mean ... like

Test of mu = 31 vs mu not = 31

Variable N Mean StDev SE Mean
C5 20 28.10 6.71 1.50

Variable 95.0% CI T P
C5 ( 24.96, 31.24) -1.93 0.068

the p value that is listed is found by taking the area TO THE LEFT of -1.93
and to the RIGHT of +1.93 in a t distribution with 19 df ... and adding
them together

At 08:50 PM 3/13/01 +0100, RD wrote:
>On 13 Mar 2001 07:12:33 -0800, d...@PSU.EDU (dennis roberts) wrote:
>
> >1. some test statistics are naturally (the way they work anyway) ONE sided
> >with respect to retain/reject decisions
> >
> >example: chi square test for independence ... we reject ONLY when chi
> >square is LARGER than some CV ... to put a CV at the lower end of the
> >relevant chi square distribution makes no sense
> >
>Hmm... do not want to start flame war but just can not go by such HUGE
>misconception about chi squared test.

>Now getting back to original question.

>Incidentally my opinion agrees with international harmonisation

>guidelines. Just dig FDA site to find them. There are half-page
>additional explanations why one tailed tests with 5% are unacceptable.
>The result you can not submit a drug for approval based on studies
>with one tailed 5% rate tests.

agreement with another position is not sufficient evidence to discard the
notion that one tailed tests can be legitimate in some cases

are you suggesting that the model for drug research is always correct?

>I am dermatologist not statistitian and all those questions seems
>obvious to me. I am disappointed.
>
>

>=================================================================
>Instructions for joining and leaving this list and remarks about
>the problem of INAPPROPRIATE MESSAGES are available at
> http://jse.stat.ncsu.edu/
>=================================================================

_________________________________________________________
dennis roberts, educational psychology, penn state university
208 cedar, AC 8148632401, mailto:d...@psu.edu
http://roberts.ed.psu.edu/users/droberts/drober~1.htm

Will Hopkins

unread,

Mar 13, 2001, 5:54:52 PM3/13/01

to

Responses to various folks. And to everyone touchy about one-tailed
tests, let me make it quite clear that I am only promoting them as a
way of making a sensible statement about probability. A two-tailed p
value has no real meaning, because no real effects are ever null. A
one-tailed p value, for a normally distributed statistic, does have a
real meaning, as I pointed out. But precision of
estimation--confidence limits--is paramount. Hypothesis testing is
passe.

Donald Burrill queried my assertion about one-tailed p values
representing the probability that the true value is opposite in sign
to what you observed. Don restated what a one-tailed p represents,
as it is defined by hypothesis testers, but he did not show that my
assertion was false. He did point out that I have to know the
sampling distribution of the statistic. Yes, of course. I assumed a
normal (or t) distribution.

Here's one proof of my assertion, using arbitrary real values. I
always find these confidence-limit machinations a bit tricky. If
someone has a better way to prove this, please let me know.

Suppose you observe a value of 5.3 for some normally distributed
outcome statistic X, and suppose the one-tailed p is 0.04.

Therefore the sampling distribution is such that, when the true value
is 0, the observed values will be greater than 5.3 for 4% of the time.

Therefore, when the true value is not 0 but something else, T say,
then X-T will be greater than 5.3 for 4% of the time. (This is the
tricky bit. Don't leap to deny it without a lot of thought. It
follows, because the sampling distribution is normal. It doesn't
follow for sampling distributions like the non-central t.)

But if X-T > 5.3 for 4% of the time, then rearranging, T < 5.3-X for
4% of the time. But our observed value is 5.3, so T < 0 for 4% of the
time. That is, there is a 4% chance that the true value is less than
zero. QED.

Don also wrote

>You had in mind, I trust, the _two-sided_ 95% confidence interval!

Of course. I only thing I've got against 95% confidence intervals is
that they are too damn conservative, by half. The default should be
90% confidence intervals. I think being wrong about something (here,
the true value) 10% of the time is more realistic in human affairs.
But obviously, in any specific instance, it depends on the cost of
being wrong.

Dennis Roberts wrote:
>1. some test statistics are naturally (the way they work anyway) ONE
>sided with respect to retain/reject decisions

Look, forget test statistics. What matters is the precision of the
estimate of the EFFECT statistics. If you keep that in front of
everything else, the question of hypothesis testing with any number
of tails just vanishes into thin air. The only use for a test
statistic is to help you work out a confidence interval. Don't ever
report them in your papers.

Herman Rubin wrote about my assertion:

>This is certainly not the case, except under highly dubious
>Bayesian assumptions.

Herman, see above. And the only Bayesian assumption is what you
might call the null Bayesian: that there is no prior knowledge of
the true value. But any Bayesian- vs frequentist-type arguments here
are academic.

Jerry Dallal wrote, ironically:

>If you're doing a 1 tailed test, why test at all? Just switch from
>standard treatment to the new one. Can't do any harm. Every field
>is littered with examples where one-tailed tests would have led to
>disasters (harmful treatments missed, etc.) had they been used.

As you well know, Jerry, 5% is arbitrary.

Will

Alan McLean

unread,

Mar 13, 2001, 7:53:24 PM3/13/01

to

st...@mimosa.csv.warwick.ac.uk wrote:
>

>
> More importantly, I would say: DON'T DO TESTS. Instead, try to find
> models that you would be prepared to use to predict the response
> in as-yet untried circumstances.
> --

Hypothesis testing is simply one useful method of identifying 'models

that you would be prepared to use to predict the response

in as-yet untried circumstances.'

Any method has to use past experience ('sample data') to identify models
and choose between them.

Hypothesis testing is restricted in its use, but within its limitations
it is very useful.

Alan

--
Alan McLean (alan....@buseco.monash.edu.au)
Department of Econometrics and Business Statistics
Monash University, Caulfield Campus, Melbourne
Tel: +61 03 9903 2102 Fax: +61 03 9903 2007

Alan McLean

unread,

Mar 13, 2001, 8:26:52 PM3/13/01

to

Will Hopkins wrote:
>
> Responses to various folks. And to everyone touchy about one-tailed
> tests, let me make it quite clear that I am only promoting them as a
> way of making a sensible statement about probability. A two-tailed p
> value has no real meaning, because no real effects are ever null. A
> one-tailed p value, for a normally distributed statistic, does have a
> real meaning, as I pointed out. But precision of
> estimation--confidence limits--is paramount. Hypothesis testing is
> passe.
>

...............................

The only use for a test
> statistic is to help you work out a confidence interval. Don't ever
> report them in your papers.
>

This is arguably the case for research matters when estimating/testing a
mean - a confidence interval and a test are two ways of approaching the
same thing. Even there, the hypothesis testing approach is a useful way
of thinking. It is exactly the scientific method writ small. I also
happen to think that all tests should be one tailed, but almost
certainly not for the same reasons as Will's.

In 'practical statistics' such as quality control, one is only
interested if the sample mean is sufficiently close to what it should be
that one can proceed as if it does equal what it should - that is,
accept the null model and proceed - or not. If it is not, the 'true
value' (meaningless phrase!) is of no interest, so obtaining a
confidence interval is a waste of time. It could be done, but offers
nothing.

Hypothesis testing is essentially a method of selecting between models.
Should I use the model with mu = 0, or a model with mu not= 0? If the
latter, what value of mu should I use?

A more illuminating example is simple linear regression. Should I use
the model with beta = 0 (that is, the 'constant mean' model, Y = mu +
epsilon) or the model with beta not= 0 (that is, the varying mean model,
Y = alpha + beta*X)? This is clearly a choice between two different
models. Again one can resolve it by using a test statistic or by
calculating a confidence interval, but in both cases you are doing the
same thing - deciding between the two models.

The questionable thing about hypothesis testing is the fact that the
null model is privileged over the alternative. But this is resolved as
follows: if a test statistic is not significant (or equivalently, if the
confidence interval includes zero) then it does not matter which model
you choose. But you do have to choose, at least tentatively. (In a
quality control application you have to decide really; in research, you
choose tentatively.) All this means is that you make your decision on
some other basis than the statistics. For the regression example, we
would decide on the basis of simplicity. In a court case we decide on
the basis of fairness. In the case of research we decide on the basis of
accepted theory.

Hypothesis testing is certainly not passe!

Regards,

Jerry Dallal

unread,

Mar 14, 2001, 10:39:49 AM3/14/01

to

Will Hopkins wrote:
>
> Jerry Dallal wrote, ironically:
> >If you're doing a 1 tailed test, why test at all? Just switch from
> >standard treatment to the new one. Can't do any harm. Every field
> >is littered with examples where one-tailed tests would have led to
> >disasters (harmful treatments missed, etc.) had they been used.
>
> As you well know, Jerry, 5% is arbitrary.
>

It wasn't ironically and has nothing to do with 5%. As Marvin Zelen
has pointed out, one-tailed tests are unethical from a human
subjects perspective because they state that the difference can go
in only one direction (we can argue about tests that are similar on
the boundary, but I'm talking about how they are used in practice).
If the investigator is *certain* that the result can go in only one
direction, then s/he is ethically bound not to give a subject a
treatment that is inferior to another.

Consider yourself or someone near and dear with a fatal condition.
You go to a doc who says, "I can give you A with P(cure) in your
case of 20% or I can give you B for which P(cure) can't be less than
20% and might be higher. In fact, I wouldn't even consider B if
there weren't strong reasons to suspect it might be higher. And
let's not forget it can't be lower than 20%. I just flipped a
coin. YOU CAN'T HAVE "B"!"

st...@mimosa.csv.warwick.ac.uk

unread,

Mar 14, 2001, 10:55:34 AM3/14/01

to

In article <3AAEB9E8...@buseco.monash.edu.au>,

alan....@buseco.monash.edu.au (Alan McLean) writes:
>st...@mimosa.csv.warwick.ac.uk wrote:
>>
>> More importantly, I would say: DON'T DO TESTS. Instead, try to find
>> models that you would be prepared to use to predict the response
>> in as-yet untried circumstances.
>> --
>
>Hypothesis testing is simply one useful method of identifying 'models
>that you would be prepared to use to predict the response
> in as-yet untried circumstances.'

>...

Yes. Mea culpa. Of course I myself do tests (not least informal ones
during model-building), but I became carried away while responding after
a hard day at the office.

I've just given lectures including what I would call two-tailed
chi^2 and F tests, analysing well-known data from Georg Mendel
and from Cyril Burt, and resulting in very small values of the
corresponding test statistics, strongly suggesting fiddled data
[more accurately, strongly suggesting analysing further data sets
to see if the evidence of fiddling/massaging becomes overhelming].

Are some people using "two tailed" to mean something other than
looking for extreme values in both tails of the test statistic?

-- Ewart Shaw

J E H Shaw

unread,

Mar 14, 2001, 11:48:59 AM3/14/01

to

Thanks for your e-mail (which arrived much later than your post to
the newsgroup). I've already posted an apology and half-retraction
for saying something I didn't really mean!
-- Ewart

J.E.H.Shaw [Ewart Shaw] st...@uk.ac.warwick TEL: +44 2476 523069
Department of Statistics, University of Warwick, Coventry CV4 7AL, U.K.
http://www.warwick.ac.uk/statsdept/Staff/JEHS/
yacc - the piece of code that understandeth all parsing

=================================================================

Herman Rubin

unread,

Mar 14, 2001, 12:21:39 PM3/14/01

to

In article <p04330102b6d43bd5b798@[139.80.121.126]>,

This is one of the standard fallacies. The statement that
T < 5.3-X for 4% of the time is valid before X is observed,
but not after; this is true of all of the other statements
as well. It is approximately true after the observation if
T has a prior almost uniform distribution over a rather
large range, so the density of T can be assumed constant
in the calculation of the posterior distribution.

................

>Herman Rubin wrote about my assertion:
>>This is certainly not the case, except under highly dubious
>>Bayesian assumptions.

>Herman, see above. And the only Bayesian assumption is what you
>might call the null Bayesian: that there is no prior knowledge of
>the true value. But any Bayesian- vs frequentist-type arguments here
>are academic.

The "null Bayesian" is an EXTREMELY strong assumption, and
it is even somewhat contradictory, and the uniform distribution
over the real line is a much odder beast than even most who
understand mathematics think it is. A posterior cannot be
obtained from it by any legitimate mathematical operation;
this is not hard to prove. It is not at all surprising that
the attempted use of the "null Bayesian" assumption did not
foster the use of Bayesian procedures. It MAY be, as indicated
above, a reasonable approximation, but only that.

Herman Rubin

unread,

Mar 14, 2001, 12:53:01 PM3/14/01

to

In article <3AAEB9E8...@buseco.monash.edu.au>,

Alan McLean <alan....@buseco.monash.edu.au> wrote:
>st...@mimosa.csv.warwick.ac.uk wrote:

>> More importantly, I would say: DON'T DO TESTS. Instead, try to find
>> models that you would be prepared to use to predict the response
>> in as-yet untried circumstances.
>> --

>Hypothesis testing is simply one useful method of identifying 'models
>that you would be prepared to use to predict the response
> in as-yet untried circumstances.'

>Any method has to use past experience ('sample data') to identify models
>and choose between them.

>Hypothesis testing is restricted in its use, but within its limitations
>it is very useful.

I suggest that practitioners of statistics abandon their
current RELIGION and look at the problems. Testing is
needed, but the real problem is when one should accept
a hypothesis known to be false.

In some cases, this can be approximated by a point null,
but one should still use a decision approach to the problem.
The "significance level" should depend rather heavily on
both the problem and the sample size.

I have only seen one way of using fixed level testing
which I consider to be somewhat sensible; a mathematical
psychologist told me this one. He produces a model, and
collects data until it is rejected at the .05 level.
Then he looks at the fit, and decides whether to accept
his model as an approximation. The point of sampling
until rejection is to avoid accepting on a small number
of observations which accidentally fit.

Herman Rubin

unread,

Mar 14, 2001, 12:57:53 PM3/14/01

to

In article <3AAEC0E0...@buseco.monash.edu.au>,

Alan McLean <alan....@buseco.monash.edu.au> wrote:
>Will Hopkins wrote:

>> Responses to various folks. And to everyone touchy about one-tailed
>> tests, let me make it quite clear that I am only promoting them as a
>> way of making a sensible statement about probability. A two-tailed p
>> value has no real meaning, because no real effects are ever null. A
>> one-tailed p value, for a normally distributed statistic, does have a
>> real meaning, as I pointed out. But precision of
>> estimation--confidence limits--is paramount. Hypothesis testing is
>> passe.

...............................

< The only use for a test
<> statistic is to help you work out a confidence interval. Don't ever
<> report them in your papers.

>This is arguably the case for research matters when estimating/testing a
>mean - a confidence interval and a test are two ways of approaching the
>same thing. Even there, the hypothesis testing approach is a useful way
>of thinking. It is exactly the scientific method writ small. I also
>happen to think that all tests should be one tailed, but almost
>certainly not for the same reasons as Will's.

It is not the case even there. The scientific method is
concerned with making approximations, and it may be necessary
to accept approximations which clearly do not fit.

>In 'practical statistics' such as quality control, one is only
>interested if the sample mean is sufficiently close to what it should be
>that one can proceed as if it does equal what it should - that is,
>accept the null model and proceed - or not. If it is not, the 'true
>value' (meaningless phrase!) is of no interest, so obtaining a
>confidence interval is a waste of time. It could be done, but offers
>nothing.

This is correct, not only for quality control, but elsewhere.
The actual observations do not come from the null model. This
is a decision problem, and classical statistics is not appropriate
here, nor is it in scientific inference.

RD

unread,

Mar 14, 2001, 1:09:23 PM3/14/01

to

On 13 Mar 2001 14:23:04 -0800, d...@PSU.EDU (dennis roberts) wrote:

>well, help me out a bit
>
>i give a survey and ... have categorized respondents into male and females
>... and also into science major and non science majors ... and find a data
>table like:
>
>MTB > chisquare c1 c2
>
>Chi-Square Test: C1, C2
>
>
>Expected counts are printed below observed counts
>
>
> non science science
> C1 C2 Total
>M 1 24 43 67
> 32.98 34.02
>
>F 2 39 22 61
> 30.02 30.98
>
>Total 63 65 128
>
>Chi-Sq = 2.444 + 2.368 +
> 2.684 + 2.601 = 10.097
>DF = 1, P-Value = 0.001
>
>when we evaluate THIS test ... with the chi square test statistic we use in
>THIS case ... in what sense would this be considered to be a TWO tailed
>test?

Yes.

> would we still be using say ... the typical value of .05 to make a
>decision to retain or reject?

Yes most often. But as in any other case investigator is free to
choese a lesser value.

> would we be asking the tester to look up both
>lower and upper CVs from a chi square distribution with 1 df ... and really
>ask him/her to consider rejecting if the obtained chi squared value is
>smaller than the lower CV?

No

>in this case ... minitab is finding the area ABOVE 10.097 in a chi square
>distribution with 1 df ... and recording it as the P value ...

This is the right way to do. Yet I must emphasise that looking at one
side of distribution density does not mean the test is one sided.

>of course, in a simple hypothesis test for a single population mean ... like
>
>Test of mu = 31 vs mu not = 31
>
>Variable N Mean StDev SE Mean
>C5 20 28.10 6.71 1.50
>
>Variable 95.0% CI T P
>C5 ( 24.96, 31.24) -1.93 0.068
>
>the p value that is listed is found by taking the area TO THE LEFT of -1.93
>and to the RIGHT of +1.93 in a t distribution with 19 df ... and adding
>them together

If only you had more cases you could test normal distribution. Then
for (-1.93;1.93) p=0.054. Now look at this: 1.93^2=3.72;
P(Khi2>3.72)=0.054.

>>Incidentally my opinion agrees with international harmonisation
>>guidelines. Just dig FDA site to find them. There are half-page
>>additional explanations why one tailed tests with 5% are unacceptable.
>>The result you can not submit a drug for approval based on studies
>>with one tailed 5% rate tests.
>
>agreement with another position is not sufficient evidence to discard the
>notion that one tailed tests can be legitimate in some cases

Sure but you removed my other arguments... In SOME cases one tailed
test may be legitimate. I just do not see any. Some time ago I was
also puzzled why people do not use one tailed test which seem
intuitively obvious for a beguinner.
Now about agreement... It appears that this agreement is an
internationall consensus. In medicine consensus is one of the methods
to elaborate evidency. Although not perfect.

>are you suggesting that the model for drug research is always correct?

What model? Two tailed?

RD

unread,

Mar 14, 2001, 1:09:24 PM3/14/01

to

On 13 Mar 2001 16:32:15 -0500, hru...@odds.stat.purdue.edu (Herman
Rubin) wrote:

>In article <33ssatgue6q2iqdlm...@4ax.com>,
>RD <al...@mail.ru> wrote:
>>On 13 Mar 2001 07:12:33 -0800, d...@PSU.EDU (dennis roberts) wrote:
>
>>>1. some test statistics are naturally (the way they work anyway) ONE sided
>>>with respect to retain/reject decisions
>
>>>example: chi square test for independence ... we reject ONLY when chi
>>>square is LARGER than some CV ... to put a CV at the lower end of the
>>>relevant chi square distribution makes no sense
>
>>Hmm... do not want to start flame war but just can not go by such HUGE
>>misconception about chi squared test. Indeed exactly reverse is true :
>>chi squred test is always two tailed. There is nothing to prove just
>>look at the definition : Khi^2(n)=sum(Z^2).
>
>There is a way of looking at the chi-squared test otherwise.
>
>In fact, a low chi-squared would constitute a question of
>whether what purport to be random numbers really are.

What do you exactly mean by that?

Robert J. MacG. Dawson

unread,

Mar 14, 2001, 1:02:19 PM3/14/01

to

Jerry Dallal wrote:

>
> It wasn't ironically and has nothing to do with 5%. As Marvin Zelen
> has pointed out, one-tailed tests are unethical from a human
> subjects perspective because they state that the difference can go
> in only one direction (we can argue about tests that are similar on
> the boundary, but I'm talking about how they are used in practice).
> If the investigator is *certain* that the result can go in only one
> direction, then s/he is ethically bound not to give a subject a
> treatment that is inferior to another.

Basically correct, but I would argue that "inferior" can only
reasonably be interpreted in a braod context,in which it is not
synonymous to "more effective". A treatment may be slightly less
effective but superior because of lesser side effects (it is said that
castration, done early enough, will cure male pattern baldness - any
takers? Or how would you feel about a treatment involving major thoracic
surgery with a 100% success rate against the common cold?)

It may interfere less in quality of life (think of some dialysis
techniques that require somebody to spend several hours a day on a
machine). It may be ethically preferable for other reasons (consider the
debate about the use of fetal cells in the treatment of Parkinson's).

A slightly less effective treatment may be within somebody's means
whereas they simply cannot afford the absolutely most effective
treatment. It is said that the sort of intensive
one-physician-one-patient medical care enjoyed by many heads of state
has a significant health benefit. Should we all have our own private
doctors and who will care for _them_? It is nice to say that nobody
should be refused the best medical treatment because of cost, but there
seems no limit to the expensive medical procedures that can be developed
if there is a will to pay for them. Realistically, the human race is not
going to dedicate all its resources to health care; and even if we did
there would _still_ not be enough to go around.

Finally, a treatment may be more effective but unusable because its
effectiveness has not been demonstrated in a generally accepted way in
clinical trials, so that it is not permitted in some jurisdiction.

All this said, such "pidgin Bayesian" one-tailed tests are indeed
silly. A one-tailed test may be done where there is an _interest_ in one
tail, as in acceptance sampling, but not to artificially exaggerate a
p-value.

-Robert Dawson

jim clark

unread,

Mar 14, 2001, 3:11:55 PM3/14/01

to

Hi

On 13 Mar 2001, dennis roberts wrote:
> i give a survey and ... have categorized respondents into male and females
> ... and also into science major and non science majors ... and find a data
> table like:

> non science science
> C1 C2 Total
> M 1 24 43 67
> 32.98 34.02
>
> F 2 39 22 61
> 30.02 30.98
>
> Total 63 65 128
>
> Chi-Sq = 2.444 + 2.368 +
> 2.684 + 2.601 = 10.097
> DF = 1, P-Value = 0.001
>
> when we evaluate THIS test ... with the chi square test statistic we use in
> THIS case ... in what sense would this be considered to be a TWO tailed
> test?

In the sense that Chi^2 would give you the same value if your
cells had been 43 24 for Males and 22 39 for females (i.e., the
reverse direction of relationship between gender and
science). This is most easily demonstrated by calculating the
z-test equivalent to your chi^2 test.

z = (.6418-.3607)/[sqrt{.5078*.4922( (1/67) + (1/61))}]
= .2811 / .08847
= 3.1772

z^2 = chi^2

The p-value for z that corresponds to the chi^2 p-value is

p(z<= -3.1772 or z>= +3.1772) i.e., a "two-tailed" probability

If the proportions were reversed, z would become negative. This
makes it easier to see that there are directional (i.e.,
"one-tailed") and non-directional (i.e., "two-tailed") hypotheses
being tested by the chi^2.

The chi^2 distribution is equivalent to the z distribution
"folded over" so that both negative and positive tails of z are
in the upper (i.e., positive) tail of chi^2. The same
relationship holds between t and F. As we saw recently on this
(or another stats list), there is much confusion between
"one-tailed" in the sense of a directional test (which concerns
the direction of differences or correlations) and "one-tailed" in
the narrower sense of tail of distribution (e.g., chi^2). These
uses are _not_ equivalent. Perhaps less confusing if we use
"directional" or some other term besides "one-tailed" for the
first sense.

Best wishes
Jim

============================================================================
James M. Clark (204) 786-9757
Department of Psychology (204) 774-4134 Fax
University of Winnipeg 4L05D
Winnipeg, Manitoba R3B 2E9 cl...@uwinnipeg.ca
CANADA http://www.uwinnipeg.ca/~clark
============================================================================

dennis roberts

unread,

Mar 14, 2001, 2:50:39 PM3/14/01

to

At 03:39 PM 3/14/01 +0000, Jerry Dallal wrote:

>It wasn't ironically and has nothing to do with 5%. As Marvin Zelen
>has pointed out, one-tailed tests are unethical from a human
>subjects perspective because they state that the difference can go
>in only one direction (we can argue about tests that are similar on
>the boundary, but I'm talking about how they are used in practice).
>If the investigator is *certain* that the result can go in only one
>direction, then s/he is ethically bound not to give a subject a
>treatment that is inferior to another.
>
>Consider yourself or someone near and dear with a fatal condition.
>You go to a doc who says, "I can give you A with P(cure) in your
>case of 20% or I can give you B for which P(cure) can't be less than
>20% and might be higher. In fact, I wouldn't even consider B if
>there weren't strong reasons to suspect it might be higher. And
>let's not forget it can't be lower than 20%. I just flipped a
>coin. YOU CAN'T HAVE "B"!"

what can i say ... marvin zelen is wrong ...

it would only be unethical if a better alternative were available ... or
even a possibly better alternative were available ... and the investigator
or the one making the decision to give or not to give ... KNOWS this ...
AND HAS the ability to give this treatment to the patient ... and does NOT
do it

because a treatment might be known to be better, through a logical
deductive process or experimentation ... or potentially better ... does NOT
lead to unethical practice if this treatment is not adopted ...

implementations of treatments have consequences ... other than impact of
treatments ... there are COSTS ASSOCIATED WITH TREATMENTS and these costs
have to be weighed in from a cost/benefit perspective (maybe even take into
account IF the public WANTS this to be done) ... it is irresponsible NOT to
take other things into consideration

if the costs associated with treatments are so high compared to the (albeit
true) benefits ... one has to consider whether it would actually be
UNethical to go ahead and order up full implementation ... when society has
to shell out the $$$$

one vivid example: we KNOW for a fact that ... if we reduced the national
speed limit to 45 ... it would save thousands of lives ... though drivers
would be hopping mad (and road rage might cause some accidents ... the
reduction still would save many many lives) ...

are politicians, who make these decisions, acting in an unethical way NOT
to lower the national speed limit to 45? i don't think so

decisions to implement or not implement (regardless of evidence) in most
cases are some compromise between what we know MIGHT happen if we go
direction A ... but, we make a tempered decision to go in direction B ...
because of the realities of the overall situation

hypothesis testing ... is NO different

>=================================================================
>Instructions for joining and leaving this list and remarks about
>the problem of INAPPROPRIATE MESSAGES are available at
> http://jse.stat.ncsu.edu/
>=================================================================

_________________________________________________________

dennis roberts, educational psychology, penn state university
208 cedar, AC 8148632401, mailto:d...@psu.edu
http://roberts.ed.psu.edu/users/droberts/drober~1.htm

=================================================================

Alan McLean

unread,

Mar 14, 2001, 6:15:47 PM3/14/01

to

Apart from making the observation that there are many applications of
tests that do not involve ethical considerations, I am not at all clear
how this example relates to one or two tailed testing.

There is certainly an argument that when trialling a new treatment (I
initially used the word 'testing' here, but figure that it may be
confused with the statistical test of the resultant data) it is
presumably expected to work. Consequently, if a person in the trial is
given a placebo, there is a clear expectation that he or she is being
disadvantaged - given an inferior treatment.

On the other hand, if a placebo is not used, the results of the trial
will be unclear. This will presumably disadvantage Society. The ethical
choice is then between disadvantaging a number of individuals by giving
them a treatment which is expected to be inadequate (rather than a
treatment which is expected to be better - but may not be!) and
disadvantaging society by reducing the increase in knowledge - which is
expected to advantage many people in the future.

This is certainly an ethical question (though I might argue that neither
choice is unethical if the choice is made ethically!) But I don't see
how the type of statistical test done in analysing the resultant data
can be ethical or not.

Regards,
Alan

> =================================================================
> Instructions for joining and leaving this list and remarks about
> the problem of INAPPROPRIATE MESSAGES are available at
> http://jse.stat.ncsu.edu/
> =================================================================

--

Alan McLean (alan....@buseco.monash.edu.au)
Department of Econometrics and Business Statistics
Monash University, Caulfield Campus, Melbourne
Tel: +61 03 9903 2102 Fax: +61 03 9903 2007

Alan McLean

unread,

Mar 14, 2001, 6:36:47 PM3/14/01

to

As we saw recently on this
> (or another stats list), there is much confusion between
> "one-tailed" in the sense of a directional test (which concerns
> the direction of differences or correlations) and "one-tailed" in
> the narrower sense of tail of distribution (e.g., chi^2). These
> uses are _not_ equivalent. Perhaps less confusing if we use
> "directional" or some other term besides "one-tailed" for the
> first sense.
>

Years ago the terms 'one sided' and 'two sided' were used rather than
'one tailed' and 'two tailed'. To me, this emphasises the statement of
the alternative hypothesis (that is, the directionality) rather than the
ends of the distribution used.

Alan

--
Alan McLean (alan....@buseco.monash.edu.au)
Department of Econometrics and Business Statistics
Monash University, Caulfield Campus, Melbourne
Tel: +61 03 9903 2102 Fax: +61 03 9903 2007

Jerry Dallal

unread,

Mar 14, 2001, 8:14:37 PM3/14/01

to

dennis roberts (d...@PSU.EDU) wrote:

: it would only be unethical if a better alternative were available ... or

: even a possibly better alternative were available ... and the investigator
: or the one making the decision to give or not to give ... KNOWS this ...
: AND HAS the ability to give this treatment to the patient ... and does NOT
: do it

Anyone proposing a 1-tailed test is doing *exactly* this.

: because a treatment might be known to be better, through a logical

: deductive process or experimentation ... or potentially better ... does NOT
: lead to unethical practice if this treatment is not adopted ...

I would disagree. Once it's known to be better, everything else is
sub-standard.

: implementations of treatments have consequences ... other than impact of

: treatments ... there are COSTS ASSOCIATED WITH TREATMENTS and these costs
: have to be weighed in from a cost/benefit perspective (maybe even take into
: account IF the public WANTS this to be done) ... it is irresponsible NOT to
: take other things into consideration

I'm waiting for the thank you from Professor Rubin. FINALLY, people are
starting to take all possible consequences of their decisions into
consideration! I confess. I can't recall ever having factored costs into
account when performing a significance test. Maybe it's the nature of the
work I do.

Herman Rubin

unread,

Mar 14, 2001, 8:14:45 PM3/14/01

to

In article <ahvuatobk5taovols...@4ax.com>,

RD <al...@mail.ru> wrote:
>On 13 Mar 2001 16:32:15 -0500, hru...@odds.stat.purdue.edu (Herman
>Rubin) wrote:

>>In article <33ssatgue6q2iqdlm...@4ax.com>,
>>RD <al...@mail.ru> wrote:
>>>On 13 Mar 2001 07:12:33 -0800, d...@PSU.EDU (dennis roberts) wrote:

>>>>1. some test statistics are naturally (the way they work anyway) ONE sided
>>>>with respect to retain/reject decisions

>>>>example: chi square test for independence ... we reject ONLY when chi
>>>>square is LARGER than some CV ... to put a CV at the lower end of the
>>>>relevant chi square distribution makes no sense

>>>Hmm... do not want to start flame war but just can not go by such HUGE
>>>misconception about chi squared test. Indeed exactly reverse is true :
>>>chi squred test is always two tailed. There is nothing to prove just
>>>look at the definition : Khi^2(n)=sum(Z^2).

>>There is a way of looking at the chi-squared test otherwise.

>>In fact, a low chi-squared would constitute a question of
>>whether what purport to be random numbers really are.

>What do you exactly mean by that?

Suppose one has random numbers supposedly independent
and equally likely to be any of 0, 1, ..., k-1. Then
the chi-squared statistic has, for large sample size n,
approximately a chi-squared distribution with k-1 df.
Now suppose that the numbers are not independent, but
the process producing them tends very strongly to even
the proportions out. the statistic will tend to be much
smaller. In the extreme case in which the numbers are
produced in sets of k, with one of each in each set,
then none of the discrepancies between observed and
predicted is as large as 1.

Jerry Dallal

unread,

Mar 14, 2001, 8:26:29 PM3/14/01

to

Alan McLean (alan....@buseco.monash.edu.au) wrote:

: There is certainly an argument that when trialling a new treatment (I

: initially used the word 'testing' here, but figure that it may be
: confused with the statistical test of the resultant data) it is
: presumably expected to work. Consequently, if a person in the trial is
: given a placebo, there is a clear expectation that he or she is being
: disadvantaged - given an inferior treatment.

Placebo controlled trials are unethical and often illegal when placebo
means withholding standard medical care.

: On the other hand, if a placebo is not used, the results of the trial

: will be unclear. This will presumably disadvantage Society. The ethical
: choice is then between disadvantaging a number of individuals by giving
: them a treatment which is expected to be inadequate (rather than a
: treatment which is expected to be better - but may not be!) and
: disadvantaging society by reducing the increase in knowledge - which is
: expected to advantage many people in the future.

Controls need not be placebos. The usual control is standard medical
care. Standard medical care changes. For example, it olden times (like
10 years ago) you could do a placebo controlled trial in subjects with
cholesterols of 240 mg/dl. Today, they get referred to their physicians!

: This is certainly an ethical question (though I might argue that neither

: choice is unethical if the choice is made ethically!) But I don't see
: how the type of statistical test done in analysing the resultant data
: can be ethical or not.

A one-tailed test (as usually proposed) presumes that the difference, if
there is one, can only be it a specified direction.
This violates the principle of equipoise.

Rich Ulrich

unread,

Mar 16, 2001, 4:14:23 PM3/16/01

to

Sides? Tails?

There are hypotheses that are one- or two-sided.
There are distributions (like the t) that are sometimes
folded over, in order report "two tails" worth of p-level
for the amount of the extreme.

I don't like to write about these, because it is so easy
to be careless and write it wrong -- there is not an official
terminology.

On Thu, 15 Mar 2001 14:29:04 GMT, Jerry Dallal
<gda...@hnrc.tufts.edu> wrote:

> We don't really disagree. Any apparent disagreement is probably due
> to the abbreviated kind of discussion that takes place in Usenet.
> See http://www.tufts.edu/~gdallal/onesided.htm
>
> Alan McLean (alan....@buseco.monash.edu.au) wrote:
>
> > My point however is still true - that the person who receives
> > the control treatment is presumably getting an inferior treatment. You
> > certainly don't test a new treatment if you think it is worse than
> > nothing, or worse than current treatments!
>
> Equipoise demands the investigator be uncertain of the direction.
> The problem with one-tailed tests is that they imply the irrelevance
> of differences in a particular direction. I've yet to meet the
> researcher who is willing to say they are irrelevant regardless of
> what they might be.
[ ... ]

"Equipoise"? I'm not familiar with that as a principle, though I
would guess....

When I was taught testing, I was taught that using *one* tail
of a distribution is what is statistically intelligible, or natural.
Adding together the opposite extremes of the CDF, as with a
"two-tailed t-test," is an arbitrary act. It seems to be justified
or explained by pointing to the relation between tests on
two means, t^2 = F. Is that explanation enough?

Technically speaking (as I was taught, and as it still
seems to me), there is nothing wrong with electing to
take 4.5% from one tail, and 0.5% from the other tail.
Someone has complained about this: that is "really"
what some experimenters do. They say they plan a
one-tailed t- test of a one-sided hypothesis. However,
they do not *dismiss* a big effect in the wrong direction,
but they want to apply different values to it. I say, This
does make sense, if you set up the tests like I just said.

That is: I ask, What is believable?
Yes, to a 4.4% test (for instance) in the expected direction.
No, to a test of 2% or 1% or so, in the other direction;
- but: Pay attention, if it is EXTREME enough.

Notice, you can take out a 0.1% test and leave the main
test as 4.9%, which is not effectively different from 5%.

--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html

Jerry Dallal

unread,

Mar 16, 2001, 6:40:07 PM3/16/01

to

Rich Ulrich (wpi...@pitt.edu) wrote:

: Notice, you can take out a 0.1% test and leave the main

: test as 4.9%, which is not effectively different from 5%.

I've no problem with having different probabilities in the
two tails as long as they're specified up front. I say
so on my web page about 1-sided tests. I have concerns about
getting investigators to settle on anything other than
equal tails, but that's a separate issue.
The thing I've found interesting about
this thread is that everyone who seems to be defending
one-tailed tests is proposing something other than a
standard one-tailed test!

FWIW, for large samples, 0.1% in the unexpected tail
corresponds to a t statistic of 3.09. I'd love to
be a fly on the wall while someone is explaining to
a client why that t = 3.00 is non-significant! :-)

dennis roberts

unread,

Mar 16, 2001, 11:32:40 PM3/16/01

to

At 04:14 PM 3/16/01 -0500, Rich Ulrich wrote:
>Sides? Tails?
>
>There are hypotheses that are one- or two-sided.
>There are distributions (like the t) that are sometimes
>folded over, in order report "two tails" worth of p-level
>for the amount of the extreme.

seems to me when you fold over (say) a t distribution ... you don't have a
t distribution anymore ... mighten you have a chi square if before you fold
it over you square the values?

unless there is something really funny about a distribution that i have
been unable to identify in a picture ... all of them have two ends ...
tails ... whether they stretch out alot or ... bunch up on the left like
chisquare 1

a test STATISTIC is not a distribution ... so, we need to keep what the
test STATISTIC does ... how it works ... APART from some distribution ...
which it might follow

all i know is that there seems to be considerably confusion/differential
use ... call it whatever but ... our terminology on this one is NOT clear ...

especially when we relate the test statistic ... the statistical
distribution ... AND the null/and research hypothesis we might have in some
particular investigation

i was hoping that our list might help reduce this confusion ... by
advancing some more specific uses of terms

Alan McLean

unread,

Mar 15, 2001, 12:07:53 AM3/15/01

to

Jerry Dallal wrote:
>
> Alan McLean (alan....@buseco.monash.edu.au) wrote:
>
> : There is certainly an argument that when trialling a new treatment (I
> : initially used the word 'testing' here, but figure that it may be
> : confused with the statistical test of the resultant data) it is
> : presumably expected to work. Consequently, if a person in the trial is
> : given a placebo, there is a clear expectation that he or she is being
> : disadvantaged - given an inferior treatment.
>
> Placebo controlled trials are unethical and often illegal when placebo
> means withholding standard medical care.
>
> : On the other hand, if a placebo is not used, the results of the trial
> : will be unclear. This will presumably disadvantage Society. The ethical
> : choice is then between disadvantaging a number of individuals by giving
> : them a treatment which is expected to be inadequate (rather than a
> : treatment which is expected to be better - but may not be!) and
> : disadvantaging society by reducing the increase in knowledge - which is
> : expected to advantage many people in the future.
>
> Controls need not be placebos. The usual control is standard medical
> care. Standard medical care changes. For example, it olden times (like
> 10 years ago) you could do a placebo controlled trial in subjects with
> cholesterols of 240 mg/dl. Today, they get referred to their physicians!

Hi Jerry,

I was using 'placebo' as an example - although not a medical researcher,
I realise that the control is in some sense or other the 'not different'
treatment. My point however is still true - that the person who receives

the control treatment is presumably getting an inferior treatment. You
certainly don't test a new treatment if you think it is worse than
nothing, or worse than current treatments!

As far as I can make out, this is in line with what you were saying
anyway.

>
> : This is certainly an ethical question (though I might argue that neither
> : choice is unethical if the choice is made ethically!) But I don't see
> : how the type of statistical test done in analysing the resultant data
> : can be ethical or not.
>
> A one-tailed test (as usually proposed) presumes that the difference, if
> there is one, can only be it a specified direction.
> This violates the principle of equipoise.
>

I am not sure what you mean by the principle of equipoise. However, my
statement that the lack of ethics, if it is there, is in giving the
control treatment to a person, not in the statistical test, remains
true.

A one tailed test does not 'presume a difference can only be in a
specified direction'. What it does is to consider only differences in
one direction to be significant. If I am trialling a new treatment, and
I measure some variable such that the mean mu of that variable is
positive if the treatment 'works'. By 'works' I mean that this new
treatment is better than past ones, or that it does something different
from current ones, or whatever is relevant. If there is a control group,
mu will be the difference in means.

For the sample data I compute xbar (the difference of sample means if
there is a control group). There are three possibilities.

1. xbar is negative
2. xbar is positive, but small
3. xbar is positive, and large

Given that we would normally only trial a treatment if we expected it to
improve things, either 2 or 3 is likely to be the result. But 1 can
happen!

If 1 does happen, we would conclude either that the new treatment is no
better than the control, and may be worse. In either case we junk the
new treatment.

If 2 happens, we would conclude that the new treatment is no better than
the control (so we might as well stick with current practice - and junk
the new treatment.

If 3 happens, we would conclude that the new treatment is better than
the control - so it might well replace current treatments (maybe after a
lot more testing....)

Nowhere in this statistical testing is there an ethical issue. (Except,
if you like, that it is ethically correct to require the new treatment
to be shown to be better than the current treatment before it is
accepted - that is, take the 'no better' as the null hypothesis.)

Regards,
Alan

Thom Baguley

unread,

Mar 15, 2001, 10:00:40 AM3/15/01

to

jim clark wrote:
> The chi^2 distribution is equivalent to the z distribution
> "folded over" so that both negative and positive tails of z are
> in the upper (i.e., positive) tail of chi^2. The same
> relationship holds between t and F. As we saw recently on this
> (or another stats list), there is much confusion between
> "one-tailed" in the sense of a directional test (which concerns
> the direction of differences or correlations) and "one-tailed" in
> the narrower sense of tail of distribution (e.g., chi^2). These
> uses are _not_ equivalent. Perhaps less confusing if we use
> "directional" or some other term besides "one-tailed" for the
> first sense.

Yes, especially if the discussion has extended from the t to other tests and distributions!

Thom

Jerry Dallal

unread,

Mar 15, 2001, 9:29:04 AM3/15/01

to

We don't really disagree. Any apparent disagreement is probably due
to the abbreviated kind of discussion that takes place in Usenet.
See http://www.tufts.edu/~gdallal/onesided.htm

Alan McLean (alan....@buseco.monash.edu.au) wrote:

> My point however is still true - that the person who receives
> the control treatment is presumably getting an inferior treatment. You
> certainly don't test a new treatment if you think it is worse than
> nothing, or worse than current treatments!

Equipoise demands the investigator be uncertain of the direction.

The problem with one-tailed tests is that they imply the irrelevance
of differences in a particular direction. I've yet to meet the
researcher who is willing to say they are irrelevant regardless of
what they might be.

> For the sample data I compute xbar (the difference of sample means if

> there is a control group). There are three possibilities.
>
> 1. xbar is negative

> If 1 does happen, we would conclude either that the new treatment is no

> better than the control, and may be worse. In either case we junk the
> new treatment.

The question is, do you look to see how much worse? If the answer
is no, then I've no argument. But everyone looks. It's unethical not
to!

--Jerry

Alan McLean

unread,

Mar 15, 2001, 5:55:00 PM3/15/01

to

I agree that it's the detail about which we disagree! However, one
detail is pretty important - I still think you are confusing the trial
and the statistical test. The same confusion is shown on the web site.

I agree totally that if the treatment appears to be significantly worse
than the control treatment (as in your last paragraph below, and as you
illustrate with an example on the web page) you have to do something
about it. But - this 'something' is quite different from the 'something'
you do if you conclude that the treatment is significantly better than
the control.

In essence, you are setting up a second question - that is, a second
pair of hypotheses. The primary question is: Is the new treatment better
than the control? (This has to be the primary question in most such
research - it would certainly be unethical to trial a treatment that you
think is worse than the control.) The secondary question is: Is the new
treatment worse than the control?

Actually the secondary question is: If the new treatment is no better,
is it worse than the control?

I concede that you can view these two questions as one, but I think that
that is confusing and (therefore) not good design.

Regards,
Alan

Jerry Dallal

unread,

Mar 16, 2001, 8:12:23 AM3/16/01

to

I've thought about your proposal. Pages of mathematics with sups over
composite parameter spaces reduce to this: The two-stage procedure is
equivalent to a two-sided test. That is, from his/her behavior, it would
be impossible to tell whether someone was acting according to your
proposed two-stage procedure or a usual two-sided test. In both cases,
the ultimate outcome (whether it's done in one step or two) is that an
action is taken whether the difference between treatment and control is
large positive or large negative.

Alan McLean (alan....@buseco.monash.edu.au) wrote:

: I agree totally that if the treatment appears to be significantly worse

Vit Drga

unread,

Mar 18, 2001, 3:43:28 PM3/18/01

to

On Fri, 16 Mar 2001 23:40:07 -0000, gda...@world.std.com (Jerry
Dallal) wrote:

>FWIW, for large samples, 0.1% in the unexpected tail
>corresponds to a t statistic of 3.09. I'd love to
>be a fly on the wall while someone is explaining to
>a client why that t = 3.00 is non-significant! :-)

What if you had an effect that when it does happen is pretty obvious
(e.g. H_1 results in a std t-distn mean-shifted to mean = 10)? An
observed t-value of 3 may be statistically significant at the 0.1%
level and yet should still count as evidence for the null hypothesis
rather than against it. But, of course, in situations like that there
is no need to run a statistical test...

Vit D.

Rich Ulrich

unread,

Mar 19, 2001, 5:16:11 PM3/19/01

to

On 16 Mar 2001 20:32:40 -0800, d...@PSU.EDU (dennis roberts) wrote:

[ ... ]

> seems to me when you fold over (say) a t distribution ... you don't have a
> t distribution anymore ... mighten you have a chi square if before you fold
> it over you square the values?

[ ... snip, rest ]

You are forgetting? normal z^2 is chi^squared.

And t^2 with xxx degrees of freedom, is equal to F(1,xxx) d.f.

--
Rich U.
http://www.pitt.edu/~wpilib/index.html

Rich Ulrich

unread,

Mar 28, 2001, 5:23:23 PM3/28/01

to

- I finally get back to this topic -

On Fri, 16 Mar 2001 23:40:07 GMT, gda...@world.std.com (Jerry Dallal)
wrote:

= concerning the 5.1% solution; asymmetrical testing
with 0.05 as a one-sided, nominal level of significance,
and 0.001 as the other side (as a precaution).

Jerry,
In that last line, you are jumping to a conclusion.
Aren't you jumping to a conclusion?

If the Investigator was seriously headed toward a 1-sided
test -- which (as I imagine it) is how it must have been, that
he could have been talked-around to the prospect of a 5.1%
combined test instead -- then he won't be eager to
jump on t= 3.00 as significant.

I mean, it can be easier to "publish" if you pass magic size,
but it is easier to avoid "perishing" in the long run, with a series
of connected hypotheses.

I think of the Investigator as torn three ways.

a) Stick to the plan; ignore the t=3.0, which is *not quite* 0.001.
'It did not reach the previously stated, 0.001 nominal level, and I
still don't believe it. (And I don't want to furnish ammunition for
arguments for other things.)' Practically speaking, the risk of
earning blame for stonewalling like that is not high.

b) Run with it; claim that a two-sided test always *did* make
sense and the statistician was to blame for brain-fever, for
wanting 1-tailed in the first place. (Or, never mention it.)
The fly on the wall probably would not see this.
The statistician should have already quit.

c) Report the outcome in the same diffident style as would have
been earned by a 0.06 result in the other direction, "not quite
meeting the preset nominal test size, but it is suggestive."
Unlike the 6% result, this one is unwelcome.

T=3.00 will stir up investigation to try to undermine the implication
(such as it is).

I have trouble taking the imagined outcome much further without
speculating about where you have the trade-off between effect-size
and N; and whether the "experimental design" was thoroughly robust --
and there's a different slant to the arguments if you are explaining
or explaining-away the results of uncontrolled observation.