Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

how to deal with rare events with chi-squared test

380 views
Skip to first unread message

k claffy

unread,
Feb 5, 1993, 6:02:35 PM2/5/93
to
help.

i'm trying to do a chi-squared
goodness of fit test with many bins
[like, thousands], and i have lots of
bins with zeros in them.
since the chi-squiare test doesn't like
having bin counts < 5, i'm wondering
how i can deal with this [there's
no easy way for me to collapse my bins;
they are totally independent. there's
also no easy way to turn this into
some Kolmogorov-Smirnov test, as the
bins are not really ordered in such
a way that any kind of 'distribution'
emerges.]

i guess my basic problem is i don't
know what people do when they really
need a chi-squared goodness-of-fit test
when they're gonna have empty bins.

can anyone help me?

e-mail much appreciated,
k

Herman Rubin

unread,
Feb 6, 1993, 10:33:24 AM2/6/93
to

The chi-square test is not a good test in most situations, anyhow.
Even the approximation to the chi-square distribution is not that
good.

To come up with something reasonable in this situation requires that
the real problem be formulated without regard to statistical procedures.
Only then can a reasonable statistical procedure be found.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hru...@snap.stat.purdue.edu (Internet, bitnet)
{purdue,pur-ee}!snap.stat!hrubin(UUCP)

Anders Holtsberg

unread,
Feb 8, 1993, 4:58:37 AM2/8/93
to

>i'm trying to do a chi-squared
>goodness of fit test with many bins
>[like, thousands], and i have lots of
>bins with zeros in them.
>since the chi-squiare test doesn't like
>having bin counts < 5, i'm wondering
>how i can deal with this


The following brute force method isn't bad.
Compute the chi-square statistic for your
outcome. Write a program that simulates tables
with the same marigins. Let the computer work overnight.
In the morning see how many tables that have been
simulated and how many that got a chi-square statistic
that was worse that your observed table, and there
you have a nice good p-value.

I believe that the program StatXact can do this if you don't
want to write the simulation code yourself.

..............................................
: Anders Holtsberg :..
: Department of Mathematical Statistics : :
: Lund Institute of Technology : :
: Box 118 : :
: S-221 00 Lund, Sweden : :
:............................................: :
:............................................:

Clark D. Thomborson

unread,
Feb 9, 1993, 2:38:14 PM2/9/93
to

In article <C219v...@mentor.cc.purdue.edu> hru...@pop.stat.purdue.edu (Herman Rubin) writes:

>i'm trying to do a chi-squared
>goodness of fit test with many bins
>[like, thousands], and i have lots of
>bins with zeros in them.

The chi-square test is not a good test in most situations, anyhow.


Even the approximation to the chi-square distribution is not that
good.

I've looked at this problem intensively, working from a STAT101
background and my research experience in theoretical computer science.

My summary reading of the current literature: few mathematical
statisticians "believe" in hypothesis testing in the first place,
and chi-square testing is almost beneath contempt. On the other hand,
most non-experts behave as though Pearson's old chi-square test has
some utility. It's in most textbooks and statistical software, after
all, rarely with as much as a hint of its manifold defects.

I believe it's possible to make a good argument for Pearson's
chi-square test as a first step in analyzing whether a dataset "fits"
a hypothesis of a symmetric multinomial distribution. Non-symmetric
distributions are especially tricky. Zelterman [JASA 82:398, pp.
624-629, 1987] gives some practical ideas on how to handle the sparse
non-symmetric case in the context of 2-D contingency tables. I
can offer an unpublished manuscript, which I'm in the process of
revising, describing how you can calculate an approximate confidence
interval for the sparse symmetric case. You can retrieve my
manuscript, along with C-language source code, by anonymous ftp from
theory.lcs.mit.edu in directory pub/cthombor/Mrandom.

Anders Holtsberg's suggestion to you (of making a simulation study) is
another good method for studying the behavior of the chi-square
statistic under your null hypothesis.

Bayesians have made, repeatedly, the point that it doesn't make much
sense to "reject" a null hypothesis in favor of an unspecified
"non-null" hypothesis. This is, of course, what a chi-squared test
purports to do, and this is the fundamental difficulty with hypothesis
testing in general.

Best wishes.

--
Clark

Luke Whitaker

unread,
Feb 10, 1993, 5:47:19 AM2/10/93
to

>Bayesians have made, repeatedly, the point that it doesn't make much
>sense to "reject" a null hypothesis in favor of an unspecified
>"non-null" hypothesis. This is, of course, what a chi-squared test
>purports to do, and this is the fundamental difficulty with hypothesis
>testing in general.

Could you expand on that a bit - why doesn't it make sense to reject a
null hypothesis ? I work in medical statistics and take a pragmatic
approach - say for example I have two samples with xbar = 50 and 60
both with SEM 1 then isn't it reasonable to reject H0 of equal means ?
(assuming all the usual (normal!) assumptions - or is this the problem ?).

Of course real life is never this simple anyway!

Luke

Herman Rubin

unread,
Feb 10, 1993, 3:19:39 PM2/10/93
to

There are very few cases in which the point null hypothesis can be
EXACTLY true, so that the question is not about rejection but about
acceptance! However, if one puts positive probability on the truth
of the null, a Bayesian approach would correspond to significance
levels changing drastically with sample size; this is not surprising
as with a given significance level, the type II risk is decreasing
as the sample size increases. Sethuraman and I treated this problem
in 1965; the results appear in _Sankhya_.

But that is not the real question; the question is whether one should
act as if the null hypothesis is close enough to true; for sample sizes
small enough that the width of the parameter region in which one would
want to accept is small compared to the standard deviation of the usual
estimate, the problem becomes roughly as above. But if the sample size
gets larger, the form of the prior becomes more critical, and very
simple forms give different solutions.

Hume Winzar

unread,
Feb 11, 1993, 1:06:53 PM2/11/93
to
In article <1993Feb8.0...@lth.se> and...@maths.lth.se (Anders Holtsberg) writes:

>The following brute force method isn't bad.
>Compute the chi-square statistic for your
>outcome. Write a program that simulates tables
>with the same marigins. Let the computer work overnight.
>In the morning see how many tables that have been
>simulated and how many that got a chi-square statistic
>that was worse that your observed table, and there
>you have a nice good p-value.

>I believe that the program StatXact can do this if you don't
>want to write the simulation code yourself.

This sounds like the kind of program I need for a set of simulations, and
I'd just as soon not have to learn to write my own code. I was going to set
up some routines using SHAZAM, which I have used in the past with some
econometric and regression problems, but I'm sure there must be some less
clunky packages around.

Do readers have details of StatXact and its capabilities, publisher/
distributers (expecially in Australia), and price could be an issue too.

Thanks in advance.

- - - - - - - - -
| _--_|\ | Hume Winzar
| / \ | Commerce School,
| *_.--._/ | Murdoch University,
| v | Perth, Western Australia
- - - - - - - - -
E_Mail win...@csuvax1.csu.murdoch.edu.au
Phone: (09) 310 7389
Fax: (09) 310 5004

Jerry Dallal

unread,
Feb 12, 1993, 11:27:03 AM2/12/93
to
In article <C291s...@mentor.cc.purdue.edu>, hru...@pop.stat.purdue.edu (Herman Rubin) writes:
> There are very few cases in which the point null hypothesis can be
> EXACTLY true, so that the question is not about rejection but about
> acceptance! However, if one puts positive probability on the truth
> of the null, a Bayesian approach would correspond to significance
> levels changing drastically with sample size; this is not surprising
> as with a given significance level, the type II risk is decreasing
> as the sample size increases. Sethuraman and I treated this problem
> in 1965; the results appear in _Sankhya_.
>

Another good article is Edwards, Lindman, and Savage (1963), Psychological
Review, 193-242.

0 new messages