I have read that the sample size can affect the significance of
correlation. Can someone explain why is this so? Thanks.
Statistical significance = effect size x sample size
e.g., in concrete form
t = d x (df^0.5)/2 ... (for independent t using Cohen's d)
... as d or df increases t (and statistical significance) increases ...
This is taken straight from Robert Rosenthal's book chapter in the
Handbook of Research Synthesis.
>Statistical significance = effect size x sample size
A far more correct statement is that the "statistical
significance" is approximately a function of the effect
size multiplied by the square root of the sample size.
Even this is not a good guide as to what action should
be taken. To get that, one should look upon the
problem as a decision problem.
>e.g., in concrete form
>t = d x (df^0.5)/2 ... (for independent t using Cohen's d)
>... as d or df increases t (and statistical significance) increases ...
>This is taken straight from Robert Rosenthal's book chapter in the
>Handbook of Research Synthesis.
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Department of Statistics, Purdue University
hru...@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558
That's certainly true, but I think the spirit of Rosenthal's conceptual
equation is to remind/educate researchers that statistical significance
is a function of both effect size and sample size.
Both are incorrectly stated because of the undefined
nature of "statistical significance". (See below)
This question/answer was given in the thread in June:
" Do the critical values of linear correlation depend on sample size?"
In the above article in that thread, my stated result was
r is significant (two-tailed) if
RF> |R|* sqrt((n-2)/(1 - R*R)) > t(1-alpha/2;(n-2)).
RF> or equivalently, if |R| > t /(sqrt((n-2) + t*t))
where t is the critical value at alpha/2 for t with (n-2) df.
Since sqrt((n-2) + t*t)) is approximate sqrt(n) for large n,
an easy mnemonic device (using the asymptotic approx,)
is to think of the standard error of r as 1/sqrt(n).
Thus, the r is statistically significant at the 95% level if
I r I > 2/sqrt(100) = 0.2 if n = 100
and I r I > 2/sqrt(10000) = 0.02 if n = 10,000
and so on.
> Even this is not a good guide as to what action should
> be taken. To get that, one should look upon the
> problem as a decision problem.
It is always a bad idea to make any decision based on the
value OR significance of ANY correlation coefficient!
I have shown elsewhere that a highly significant r , with
p value smaller than 0.0001 say, could be a completely
USELESS result in a regression problem. (The actual
example was the SPSS Manual data re-analyzed).
RF> In the case of a regression, testing for the significance of R is
RF> ALWAYS the WRONG QUESTION. John Tukey said something
RF>to the effect that using R is sweep the data under the rug
RF> with a vengeance.
I'm a big fan of Tukey - especially about correlation coefficients, but
I doubt he'd have agreed "It is always a bad idea to make any decision
based on the value OR significance of ANY correlation coefficient"
given that he doesn't come across in his writing as dogmatic. As a
trivial counter-argument when r = 0 or 1 it is often possible to make
sensible decisions (though they might be trivial in some cases). I
think one could reasonably extend this to values very close to 1 or 0
in many cases.
You're quite right in a sense that I'll be glad to elaborate.
First, people have always ridiculed my use of CAPS for emphasis.
In my paraphrase of what Tukey said, I did NOT emphasize the
word "always", but "ANY" and "value OR significance".
Second, it is unfortunate (in more ways than one) that Tukey is no
longer with us, to be able to say whether he would have agreed
with my statement of what he said.
I have absolutely no doubt that I correctly conveyed his message,
though I am not quite sure if that was what he SAID (in person),
in the research conference sponsored by NSF, held in the late
1970s, or was in something WRITTEN, whether in one of his
publications or in one of his unpublished notes (plenty of them
handed out in said conference).
One thing I distinctly remembered, when the conference started,
was that it was going be the FIRST, in any of Tukey's conferences
of that kind (for researchers in academia) that the material in the
5-day conference would be published in the form of a Proceedings.
That "proceedings" turned out to be another one of the rumors
about Tukey's material to be in print. What the participants
received in print, instead, were pre-publication galley prints of
his book on Data Analysis, and the regression book he co-authored
with Fred Mostaller.
On the serious side of that comment, I am absolutely sure of the
quoted phrase "sweeping under the rug with a vengeance", was
correct, because I could hardly come up with such a powerful
phrase myself. :-)
I have already indicated the two extremes (near 0) or (near 1)
that a statistically "significant" correlation could be completely
r = 0.02 is statistically highly significant for a sample of size
(That's the main answer to the SUBJECT of this thread). By
increasing the sample size to 100,000, the 0.02 correlation
will be even more significant with a much smaller p-value.
On the "near one" side of the uselessness, I think the example
in the 1975 SPSS Manual is hard to beat, as an ACTUAL example,
with real data, and published! The Multiple R was so statistically
significant that the writer of that example completely overlooked
two HUGE typos in the data; and when those typos were
corrected, that multiple regression was even more significant.
The balloon was punctured by me in MY Data Analysis Lecture
Notes that the SPSS multiple regression did a WORSE job than
my SIMPLE regression with only one of the three independent
variables, and that ALL of the fitted models were useless for
any practical purposes.
Details of that example (even with participation by several
readers in sci.stat.math for trying out their hands at the data)
can be found in the archives.
So, I stand on my understanding and paraphrased quote of
Tukey, because one can easily extrapolate from the two
examples above that a correlation coefficient of 0.002
could be highly statistically significant for a very large sample;
and that a correlation of 0.999 could be just as spurious or
useless as the SPSS Manual example.
> As a
> trivial counter-argument when r = 0 or 1 it is often possible to make
> sensible decisions (though they might be trivial in some cases).
These two cases of "exactly 0" and "exactly 1" never came up
in any of Tukey's or my discussion of correlations, and I am not
prepared to argue about these pedantically trivial and pathogical
cases, other than saying that would be like arguing how small
is "epsilon" or how many angels can dance on the tip of a pin.
> think one could reasonably extend this to values very close to 1 or 0
> in many cases.
To which I would strongly disagree. The examples I've seen and
used above were "very close" (by ordinary standareds of the
the meaning of the term) to 1 or 0, and yet they are both clearly
Thanks for your "devil's advocate" challenge of my statement of
Tukey's view, which allowed me to elaborate on several aspects
of the quoted statement, as well as re-iterating on the actual
USELESS examples of correlations very close to 0 or 1.
-- Reef Fish Bob.
In Las Vegas pondering on how small the House has "epsilon"
advantage on me in a game of Blackjact, given the rules of
blackjact in Vegas, and the lack of opportunity in card counting
in single-deck deals; as well as the not-barred card counting
play in 6-8 deck dealer-shuffled (as opposed to machine or
continuous shuffle) shoes.
``Like the late Charles P. Winsor, a statistician far ahead of his
time, I find the use of a correlation coefficient a dangerous
J.W. Tukey (1969)
BTW Tukey was a member of Charles Winsor's Society for the Supression
of the Correlation Coefficient.
See Brillinger (2001). "John Tukey and the correlation coefficient",
Computing Science and Statistics, 33.
I can't speak for Tukey, but I don't think so.
In ALL multiple regression, the coefficient of Xj is r times sy (std.
dev. of y) divided by the std dev. of Xj, where r is either the
Pearson r for simple regression or the partial correlation between
Y and Xj given the other X's in the equation.
Standardized or not, it doesn't take a genius to convert the
unstandardized coef. to a standardized one or vice versa. But
the point remains that the correlation (simple or partial) are
themselves rather useless in the sense already discussed.
I was a bit careless not stating explicit the "partial standard
in the case if multiple regression coefficients. The detail about "all
simple, partial, and multiple correlations are SIMPLE correlations"
can be seen in several of my posts earlier this year, including :
I thought I had better include that reference because of my
ambiguous use of the "std dev. of Xj" in the preceding post.