My students have TI-83 or TI-84 calculators, which can make normal probability plots. The idea is to test whether the data (probably) came from a normal distribution; the closer the plot is to a straight line, the more likely that they did.
But it can be hard with a small sample to see whether the line is straight. Minitab (which we don't have) plots boundary curves, and if the points are all inside those bounds then we say that the data were (probably) normal.
1. I've done quite a lot of Googling, but have been unable to discover how Minitab computes those bounds. Can anyone state clearly how they are computed?
2. Is there any theoretical justification for the bounds that Minitab computes?
3. Some authors instead suggest computing the correlation coefficient of the plot, and comparing it to a critical value. If the correlation coefficient is below the critical value, we reject the hypothesis of normality. The trouble is that different authors give different critical values. Two examples are at http://www.itl.nist.gov/div898/handbook/eda/section3/eda3676.htm and http://www.minitab.com/uploadedFiles/Shared_Resources/Documents/Artic les/normal_probability_plots.pdf (on page 6).
I *think* that they are giving critical values for the same computation, but are they? Ryan and Joiner (1976), the second reference, say that their critical values come from Monte Carlo simulations; the NIST (first reference) refers to simulations bu Filliben and Devaney. How is one to choose which to use (if either)?
-- Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com Shikata ga nai...
> My students have TI-83 or TI-84 calculators, which can make normal
> probability plots. The idea is to test whether the data (probably)
> came from a normal distribution; the closer the plot is to a straight
> line, the more likely that they did.
> But it can be hard with a small sample to see whether the line is
> straight. Minitab (which we don't have) plots boundary curves, and
> if the points are all inside those bounds then we say that the data
> were (probably) normal.
> 1. I've done quite a lot of Googling, but have been unable to
> discover how Minitab computes those bounds. Can anyone state clearly
> how they are computed?
> 2. Is there any theoretical justification for the bounds that Minitab
> computes?
> 3. Some authors instead suggest computing the correlation coefficient
> of the plot, and comparing it to a critical value. If the correlation
> coefficient is below the critical value, we reject the hypothesis of
> normality. The trouble is that different authors give different
> critical values. Two examples are at
> http://www.itl.nist.gov/div898/handbook/eda/section3/eda3676.htm > and
> http://www.minitab.com/uploadedFiles/Shared_Resources/Documents/Artic > les/normal_probability_plots.pdf (on page 6).
> I *think* that they are giving critical values for the same
> computation, but are they? Ryan and Joiner (1976), the second
> reference, say that their critical values come from Monte Carlo
> simulations; the NIST (first reference) refers to simulations bu
> Filliben and Devaney. How is one to choose which to use (if either)?
Hi Stan. I don't have an answer to your question. I'm just wondering *why* you want to test for normality. As George Box said,
“…the statistician knows…that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.” (JASA, 1976, Vol. 71, 791-799)
So if you're working with real data (as opposed to simulated data), the population is *not* normally distributed.
<the_stan_br...@fastmail.fm> wrote:
>My students have TI-83 or TI-84 calculators, which can make normal >probability plots. The idea is to test whether the data (probably) >came from a normal distribution; the closer the plot is to a straight >line, the more likely that they did.
>But it can be hard with a small sample to see whether the line is >straight. Minitab (which we don't have) plots boundary curves, and >if the points are all inside those bounds then we say that the data >were (probably) normal.
>1. I've done quite a lot of Googling, but have been unable to >discover how Minitab computes those bounds. Can anyone state clearly >how they are computed?
>2. Is there any theoretical justification for the bounds that Minitab >computes?
>3. Some authors instead suggest computing the correlation coefficient >of the plot, and comparing it to a critical value.
That sounds to me like the essence of the Shapiro-Wilk statistic
for normality. That's a test that has a very good reputation for its overall generality and power.
> If the correlation >coefficient is below the critical value, we reject the hypothesis of >normality. The trouble is that different authors give different >critical values. Two examples are at >http://www.itl.nist.gov/div898/handbook/eda/section3/eda3676.htm >and >http://www.minitab.com/uploadedFiles/Shared_Resources/Documents/Artic >les/normal_probability_plots.pdf (on page 6).
>I *think* that they are giving critical values for the same >computation, but are they? Ryan and Joiner (1976), the second >reference, say that their critical values come from Monte Carlo >simulations; the NIST (first reference) refers to simulations bu >Filliben and Devaney. How is one to choose which to use (if either)?
I might trust the name-fame of NIST over Minitab, speaking as a person who knows very little about either.
The problem with any simulation is same factor that creates the gain: The usefulness depends on whether the given set of alternatives (simulated) is a match for your data. However, I do not find a reference for evaluating the S-W test except for the original S-W 1965 paper (which, I now
guess, used simulations). Simulations in 1965 used smaller Ns.
The Wikip page on tests of normaility includes tests and authorities that I'm not familiar with, but S-W is still rated high.
Agreed that no real data are ever exactly normally distributed, or at least that an exact normal distribution would warrant a very hard look. I think the real criterion is "close enough to normal that we can use a normal approximation without throwing the p-value off by very much."
Regarding Shapiro-Wilk, when I was googling before posting it was different. Instead of order statistics like a normal probability plot, it used coefficients generated from "means, variances, and covariances" of the order statistics according to NIST. But given Rich's remark about power, I guess I should get hold of S-W's paper and look into their method in more detail.
-- Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com Shikata ga nai...
> Agreed that no real data are ever exactly normally distributed, or at
> least that an exact normal distribution would warrant a very hard
> look. I think the real criterion is "close enough to normal that we
> can use a normal approximation without throwing the p-value off by
> very much."
--- snip ---
And of course, in linear models (including t-tests, ANOVA, etc), it is the *errors* that are assumed to be normal, not the dependent variable.