Why use the chi square test at all? Can you justify that the expected
counts in sufficiently many of the cells
is above 5 or did you simply look at your data and see that that was the
case (the latter is not appropriate). How can you justify that
your data is normally distributed or do you just assume this or do you
trust the law of large numbers without knowing how large
the sample has to be. Did you test for normality?
Don't get me wrong, I am genuinely interested in how you assemble the
assumptions that lead you to choose this test.
Perhaps - or maybe even definately - I am missing something.
Best, Karl
--
---------------------------------------------------------------------
Karl Schlag
Professor Tel: +34 93 542 1493
Department of Economics and Business Fax: +34 93 542 1746
Universitat Pompeu Fabra email: karl....@upf.edu
Ramon Trias Fargas 25-27 www.iue.it/Personal/Schlag/
Barcelona 08005, Spain room: 20-221 Jaume I
Dr. Paul R. Swank,
Professor and Director of Research
Children's Learning Institute
University of Texas Health Science Center-Houston
http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8838.html
BW,
Martin
----- Original Message -----
From: "Frank" <f.ha...@vanderbilt.edu>
To: "MedStats" <MedS...@googlegroups.com>
Sent: Saturday, April 18, 2009 2:19 PM
Subject: {MEDSTATS} Re: Corrected vs uncorrected chi square values
One cannot recommend appropriate methods without knowing about the design, the data, and the intent of the researcher.
Dr. Paul R. Swank,
Professor and Director of Research
Children's Learning Institute
University of Texas Health Science Center-Houston
the final recommendation in the forwarded posting was "N-1' chi-square
test when all expected frequencies are at least 1". Is there a reference
that justifies this formally. It seems like the discussion posted is of
the "heuristic" type, arguing that it "works well". I belong to the
statistics camp that prefers p values that are correct (ie "exact").
Note that I see no practical value to the condition "expected
frequencies are at least 1" as this can be true for all cells and yet
with positive probability it can occur that each column has a zero entry
in which case the practitioner would not use the test.
I agree with the posting of Paul that it is difficult to generally make
a recommendation without knowing the application. I am more modest, I am
interested in the existence of an application where normality is not
assumed where chi square is formally justified for the given finite
sample. Note that I do not think that this is possible but hope to be
proven wrong by seeing a formal reference.
Best, Karl
Universitat Pompeu Fabra email: karl....@upf.edu
In "Practical Statistics for Medical Research", under #10.6.8, Doug Altman
states the following under "Sample Size":
"The guidelines, attributed to the statistician W.G. Cochran, are that 80%
of the cells in the table should have expected frequencies greater than 5,
and all cells should have expected frequencies greater than 1." No reference
is given, and I have not been able to tie this down to one
reference.....maybe Doug/Frank could help. The text does not state what size
of table these guidelines apply to.
But I think the point is......it's not just that all cells should have
expected frequencies greater than 1, but also that (80% of) the expected
frequencies should be greater than 5. I think the latter requirement
addresses the normal assumption issue.
Looking back at the specific part of Frank's notes that you refer to, he
doesn't include this latter requirement but then he simply says, "use the
'N-1' chi-square test when all expected frequencies are at least 1,
otherwise use the Fisher-Irwin test...snip", which is not controversial, is
it ? Except maybe for 2x2 tables, where "80%" is an odd figure to apply. I
am left wondering, with 2x2 tables, is there an additional guideline to "all
cells should have expected frequencies greater than 1", when choosing to
perform a chi-square test ? Is it that all four cells should have expected
frequencies greater than 5 ?
I took responsibility for answering you, Karl, because it was I who
forwarded the posting in question and I started off thinking that I had an
answer for you...but there does seem to be a gap.
Comments, anybody ?
There has been no response, so I've looked into it some more. Specifically,
to answer your question: is there a paper supporting use of no cells having
expected frequency less than 1 as the sole criterion for performing a
chi-square test in a 2x2 table.
The abstract for the paper Frank referenced can be found here:
http://www3.interscience.wiley.com/journal/114125487/abstract?CRETRY=1&SRETRY=0
It reads, in part: "The optimum test policy was found to be analysis by the
N-1 chi-squared test when the minimum expected number is at least 1, and
otherwise, by the Fisher-Irwin test by Irwin's rule (taking the total
probability of tables in either tail that are as likely as, or less likely
than the one observed.) This policy was found to have increased power
compared to Cochran's recommendations."
It states that, "Computer-intensive techniques were used in this study to
compare seven two-sided tests of two-by-two tables in terms of their Type I
errors."
Ian Campbell, who wrote the paper, has an excellent site at:
http://www.iancampbell.co.uk/twobytwo/background.htm
and here, just above "Versions of the Fisher-Irwin's test", it says: "It
seems that little work has been done to investigate whether this rule can be
relaxed, although in the related problems of tests of goodness-of-fit ad
contingency tables with more than one degree of freedom, it seems that a
minimum expected number as low as 1 can be allowed (Cochran, 1942).
Cochran, W.G. (1942) The Chi-Square correction for continuity. Iowa State
Coll. Jour. Sci., 16, 421-436
So it would appear that there are at least 2 papers that address this. I
admit to being surprised, because there is a plethora of guidelines just on
how to use "5", and which test to use.....Ian Campbell says discussion over
the chi-square test has been going on for over 100 years.
I'll leave it to you to assess whether the papers come up to scratch in
terms of your standards :)
Best Wishes,
Martin
----- Original Message -----
From: "Karl Schlag" <karl....@upf.edu>
To: <MedS...@googlegroups.com>
Sent: Monday, April 20, 2009 8:03 AM
The difference between theory and heuristics is clearest when it is
stated that a test for nominal size 0.05 is not unacceptable if the true
p values do not lie above 0.06. To a theorist, a p value makes a
statement that is not simply an approximation, so here I interpret the
paper as investigating tests that have level 0.06.
Suissa and Shuster (JRSS 1985) already pointed out the value of exact
testing in this context. With modern software it is even simplee to
implement. The exact version of the test proposed by Campbell (N-1 chi
square) is identical to that of the chi square which is identical to the
exact version of the Z test with pooled variance.
Note that parts of Figure 2 in Cambell (2007) are misleading. The
important graph in this figure is the one that shows the rejection
probability for all tables that have an expected count above 1. So what
does the statistician do if the expected count is below 1? The paper
recommends to use Fisher-Irwin test by Irwin's rule. So the recommended
test consists of two tests, in particular the FI test will also
sometimes wrongly reject the null hypothesis. Hence the figure has to be
adjusted upward. The authors only mention in 3.4 that a formal inclusion
of using the FI test when the expected count is below 1 "does not lead
to an unacceptable level" of the combined test. What does "unacceptable"
mean here formally? In short, the graph in Figure 2 does not show the
level of the suggested test as it ignores what the statistician does
when the expected count is below 1.
The power comparisons are not quite fair as tests with the same level
are not being compared. Yes, the N-1 version rejects more often than
four other tests. But the Pearson chi square has an even larger critical
region. So it seems by their argument that Pearson chi square is better
than the N-1 version they are trying to sell. Moreover, they do not
formally reflect power comparisons of the suggested tests as one needs
to include the option of reverting to FI test when expected count <1
when comparing critical regions. Or did they include the FI test?
A theorists recommendation for a simple test: use the exact version of
chi squared test when the sample is more or less balanced, use the test
of Boschloo otherwise, in any case select between these two using power
analyses. This is very simple to compare.
So much in order to maintain a quick feedback. Thanx again for the
reference, Karl
Thank you very much for taking the time to give this feedback. Off-line,
I've had a lot of interest expressed in this thread, and would be grateful
if you could comment on my queries below, where some help clarifying a
couple of issues would be much appreciated.
On Weds, April 22, 2009 8:00 PM Karl Schlag wrote:
>
> Dear Martin, here my comments on Campbell (Statist. Med. 2007;
> 26:3661–3675) as a theorist. I focus on comparative trials, so where
> equality of binomial proportions coming from two independent samples is to
> be tested as null hypothesis.
>
> The difference between theory and heuristics is clearest when it is stated
> that a test for nominal size 0.05 is not unacceptable if the true p values
> do not lie above 0.06. To a theorist, a p value makes a statement that is
> not simply an approximation, so here I interpret the paper as
> investigating tests that have level 0.06.
Yes, it is stated that Cochran arbritarily suggested that the actual Type I
error could be allowed to be as high as 0.06, in assessing the tests. Like
you, I would assume that the author followed this, but I note that I cannot
find a positive statement that he did. If he did, the tests/guidelines would
still have been compared under the same conditions, but at a maximum of 0.06
rather than 0.05. The nature of the study would have been the same as if
0.05 had been used. So yes, there is a small deviation (0.06 rather than
0.05), but is this enough to call the study "heuristic" ?
>
> Suissa and Shuster (JRSS 1985) already pointed out the value of exact
> testing in this context. With modern software it is even simplee to
> implement. The exact version of the test proposed by Campbell (N-1 chi
> square) is identical to that of the chi square which is identical to the
> exact version of the Z test with pooled variance.
Just to be clear: are you saying that the exact versions of the N-1 and the
N Chi-square tests are identical ?
>
> Note that parts of Figure 2 in Cambell (2007) are misleading. The
> important graph in this figure is the one that shows the rejection
> probability for all tables that have an expected count above 1. So what
> does the statistician do if the expected count is below 1? The paper
> recommends to use Fisher-Irwin test by Irwin's rule. So the recommended
> test consists of two tests, in particular the FI test will also sometimes
> wrongly reject the null hypothesis. Hence the figure has to be adjusted
> upward. The authors only mention in 3.4 that a formal inclusion of using
> the FI test when the expected count is below 1 "does not lead to an
> unacceptable level" of the combined test. What does "unacceptable" mean
> here formally? In short, the graph in Figure 2 does not show the level of
> the suggested test as it ignores what the statistician does when the
> expected count is below 1.
I would follow this happily if the procedure that was being followed was
first to do one test and then to do the other. What is actually happening is
to calculate the expected values and then decide which of the tests to do.
Only one test is done, as I understand it. So isn't this OK ? If not, the
same criticism applies to the use of any guideline that says, "Check this
assumption before choosing a test", doesn't it ? The author does, as you
say, state that this policy does result in small increases in the maximum
Type I error rates, but does not state over what...you are left to assume
over not doing the FI test at all if expected frequencies are not over 1. (I
think "unacceptable" would come under the Cochran suggestions mentioned
above and at the start of the paper (that the actual Type I error could be
allowed to be as high as 0.06, in assessing the tests), but it's not 100%
clear.)
>
> The power comparisons are not quite fair as tests with the same level are
> not being compared. Yes, the N-1 version rejects more often than four
> other tests. But the Pearson chi square has an even larger critical
> region. So it seems by their argument that Pearson chi square is better
> than the N-1 version they are trying to sell.
Yes, the author does state this quite clearly. I can't think why N-1 is
recommended over N except for the theoretical argument supporting it.
Moreover, they do not
> formally reflect power comparisons of the suggested tests as one needs to
> include the option of reverting to FI test when expected count <1 when
> comparing critical regions. Or did they include the FI test?
The author includes the FI test for comparison with the other tests. This
relates to my point above, doesn't it ?..you wouldn't do one test and then
revert to FI; you would look at the expected values and choose which test to
do.
>
> A theorists recommendation for a simple test: use the exact version of chi
> squared test when the sample is more or less balanced, use the test of
> Boschloo otherwise, in any case select between these two using power
> analyses. This is very simple to compare.
>
> So much in order to maintain a quick feedback. Thanx again for the
> reference, Karl
Thanks very much, Karl.
Best Wishes,
Martin Holt
Martin Holt wrote:
>
> Dear Karl,
>
> Thank you very much for taking the time to give this feedback.
> Off-line, I've had a lot of interest expressed in this thread, and
> would be grateful if you could comment on my queries below, where some
> help clarifying a couple of issues would be much appreciated.
>
> On Weds, April 22, 2009 8:00 PM Karl Schlag wrote:
>
>>
>> Dear Martin, here my comments on Campbell (Statist. Med. 2007;
>> 26:3661–3675) as a theorist. I focus on comparative trials, so where
>> equality of binomial proportions coming from two independent samples
>> is to be tested as null hypothesis.
>>
>> The difference between theory and heuristics is clearest when it is
>> stated that a test for nominal size 0.05 is not unacceptable if the
>> true p values do not lie above 0.06. To a theorist, a p value makes a
>> statement that is not simply an approximation, so here I interpret
>> the paper as investigating tests that have level 0.06.
>
> Yes, it is stated that Cochran arbritarily suggested that the actual
> Type I error could be allowed to be as high as 0.06, in assessing the
> tests. Like you, I would assume that the author followed this, but I
> note that I cannot find a positive statement that he did. If he did,
> the tests/guidelines would still have been compared under the same
> conditions, but at a maximum of 0.06 rather than 0.05. The nature of
> the study would have been the same as if 0.05 had been used. So yes,
> there is a small deviation (0.06 rather than 0.05), but is this enough
> to call the study "heuristic" ?
I use the term heuristic (or "heuristic flavor" if you prefer) to refer
to the fact that concepts are not used in their original definition.
Level 0.05 is not that p value is below 0.06. One might instead use the
term "misleading" when showing Figure 2 sth that is not the p value of
the recommended test. One might use the term vague when the term "not
unacceptable" is used at a central part of the paper (end of first
paragraph in Section 3.4). It is a valuable paper, just not one written
for a formal audience.
>> Suissa and Shuster (JRSS 1985) already pointed out the value of exact
>> testing in this context. With modern software it is even simplee to
>> implement. The exact version of the test proposed by Campbell (N-1
>> chi square) is identical to that of the chi square which is identical
>> to the exact version of the Z test with pooled variance.
>
>
> Just to be clear: are you saying that the exact versions of the N-1
> and the N Chi-square tests are identical ?
Correct. This is because the N-1 or N are just multipliers that lead to
different cutoffs but to the same critical regions. For the same reason,
the exact version of the Z test will pooled variance is identical to the
chi-squared test even though the statistic of the one is the square root
of the other with some additional multipliers.
The suggested procedure is to look at the minimum expected count and to
apply the N-1 chi-square if the exp number is greater than 1 and to
apply the FI test otherwise. So only one test is applied in the end. But
it is not the p value of the test that has to be calculated, but the p
value of the entire procedure. The authors realize this although they
hide it in section 3.4.
Correct. Any guideline that applies to looking at the data to determine
which test to use is part of the test procedure and has to enter when
calculating the p value. Of course many or even most ignore this. The
common example is to first test for normality (a so-called pre-test) and
then if not rejected apply the t test. Formally one has to adjust the p
values of the t test when using this pre-test but most do not make this
adjustment.
The statement on "increases in maximum Type I error rates" in Section
3.4 can only be referring to the following procedure. Apply N-1
chi-square if exp number above 1, do not reject if exp number below 1.
The maximum type I error of this procedure is the one plotted in Figure
2 (I did my own calculations to be sure that this is in fact true).
>> The power comparisons are not quite fair as tests with the same level
>> are not being compared. Yes, the N-1 version rejects more often than
>> four other tests. But the Pearson chi square has an even larger
>> critical region. So it seems by their argument that Pearson chi
>> square is better than the N-1 version they are trying to sell.
>
>
> Yes, the author does state this quite clearly. I can't think why N-1
> is recommended over N except for the theoretical argument supporting it.
The reason why N-1 is recommended over N is that the version with N has
a type I error sometimes above 0.06 as shown in Figure 2 (look at the
dotted line associated to cutoff 1).
> Moreover, they do not
>> formally reflect power comparisons of the suggested tests as one
>> needs to include the option of reverting to FI test when expected
>> count <1 when comparing critical regions. Or did they include the FI
>> test?
>
> The author includes the FI test for comparison with the other tests.
> This relates to my point above, doesn't it ?..you wouldn't do one test
> and then revert to FI; you would look at the expected values and
> choose which test to do.
Correct, as written above, you do either the N-1 test or the FI test,
depending on the exp count.
>>
>> A theorists recommendation for a simple test: use the exact version
>> of chi squared test when the sample is more or less balanced, use the
>> test of Boschloo otherwise, in any case select between these two
>> using power analyses. This is very simple to compare.
>>
>> So much in order to maintain a quick feedback. Thanx again for the
>> reference, Karl
>
> Thanks very much, Karl.
>
> Best Wishes,
> Martin Holt
--
I find Bayesian statistics very valuable, in particular when the
environments are complex. However they do not "Let the data speak!" as
their inference depends on the prior.
For simple settings like comparing binomial proportions of two
independent samples one can get solutions that are excellent without
worrying loose ends. For instance consider the case of a balanced sample
of 20 observations each. It is hard to argue against using the exact
version of the Chi-squared test.
The difference betwen the type II error of this test and the
theoretically possible minimal type II error is 0.034 for testing
mean1>=mean2 against mean1<=mean2+theta at level 0.05 for all theta>0
where the theoretically possible minimal type II error is below 0.8.
In other words, nobody can construct a test that outperforms the the
exact version of the Chi-squared test in a balanced sample of 20
observations each in terms of power by more than 0.034 unless the null
and the alternative hypothesis are so close that any test has a type II
error above 0.8.
Karl
> Frank
>
> ...
Best Regards,
Martin
----- Original Message -----
From: "Karl Schlag" <karl....@upf.edu>
To: <MedS...@googlegroups.com>
Sent: Friday, April 24, 2009 10:09 AM
Subject: {MEDSTATS} Re: Corrected vs uncorrected chi square values
>