Do the critical values of linear correlation depend on sample size?

0 views
Skip to first unread message

A Guy

unread,
Jun 9, 2005, 7:58:52 PM6/9/05
to
I have been having an argument about this.

For this question, we're assuming linear regression is run on n data
points. We get a correlation of R^2, where R is the slope of the best
fit line with normalized (mean of 0, standard devitation of 1)
coordinates.

If you think that the 95% confidence critical value is always an R^2 of
95%, you should choose the NO option.

If you think that the critical value of the R^2 correlation depends on
how many points n you have, like in this chart, then you should choose
the YES option. The critical values are listed in terms of |R|, BTW. Of
course |R| = sqrt(R^2).
http://www.gifted.uconn.edu/siegle/research/Correlation/corrchrt.htm

Choose YES or NO or INVALID QUESTION

Jerry Dallal

unread,
Jun 9, 2005, 7:43:58 PM6/9/05
to
The question is not perfectly posed, but the answer to what I believe
you are intending to ask is 'yes'. The more data, the better you have
estimated the parameter of interest, the tighter the confidence
interval, and the less variable the sample quantity will be around the
corresponding population value.

BTW, the critical value you speak of is usually the value needed to
reject a null hypothesis of a population value of 0. If the underlying
value is 0, the more data, the harder it is for the sample value to
exceed any 0<k<1.

Unless I missed something...

Data Matter

unread,
Jun 9, 2005, 9:50:57 PM6/9/05
to

What do you mean by "the 95% confidence critical value is always an R^2
of 95%"? What does "R^2 of 95%" mean? I have never heard of such a
thing.

Reef Fish

unread,
Jun 10, 2005, 12:20:03 AM6/10/05
to

A Guy wrote:
> I have been having an argument about this.
>
> For this question, we're assuming linear regression is run on n data
> points. We get a correlation of R^2, where R is the slope of the best
> fit line with normalized (mean of 0, standard devitation of 1)
> coordinates.

Just say your sample correlation between X and Y is R.

You want to test the hypothesis Ho: rho = 0. and you want to know
if you test it at some alpha level, how large must R be before Ho
can be rejected at the slpha level for a two-tailed test.

The answer depends on the sample size n and alpha.

The critical values of |R| is given in the table in the link you gave:

> http://www.gifted.uconn.edu/siegle/research/Correlation/corrchrt.htm

For example, at alpha = 0.05 and various df (n-2) your critical values
are

df 20 40 60 80 100

|R|> .423 .304 .250 .217 .195

e.g., if n=102, you reject if |R| is greater than .195
and if n= 20, you need a corr coefficient of |R|>.423 before you can
rej.

This is WHY. If (X,Y) comes from a bivariate normal, then under the
null hyp. of rho = 0, the statistic R* sqrt((n-2)/(1-R*R)) has a
T distribution with (n-2) df. Thus, the null hypothesis is rejected if

|R|* sqrt((n-2)/(1 - R*R)) > t(1-alpha/2;(n-2)).

or equivalently, if |R| > t /sqrt((n-2) + t*t))

The right-hand side expression of the above, for various combinations
of alpha and d.f. (= n-2) are given in the web link.

Exercise: verify any of the values in the table by the formula above
---------------------------------------------------------------------

Having given you the solution, I should caution that George Box said
in his (1978 JASA) paper "it is better to get an approximate answer
to the right question than an exact answer to the wrong question".

In the case of a regression, testing for the significance of R is
ALWAYS the WRONG QUESTION. John Tukey said something to the effect
that using R is sweep the data under the rug with a vengeance.

A statistically significant R may be utterly useless if the sample
size is large, and the relation between X and Y may look like a
shot gun blast. For n=10,000 it takes only a correlation of .02
for it to be statistically significant.

Use prediction intervals and other means to look for the PRACTICAL
significance of any regression result.

A final note is an easy and useless result for large samples.
If the sample size is large, the APPROXIMATE critical value for |R|
is Z/sqrt(n), because T --> Z, and (n-2) + Z*Z ---> n (approx).

For example, for Z = 1.96 and n = 10,000, the critical value
is approx. 1.96/100 = 0.0196 or 0.02.

The exact critical values is 1.9602/(9998 + 1.9602**2) = 0.01960.

If you're good at doing square roots of large numbers in your head
you can estimate the critical value of R for large samples to be
2/n for alpha .05, and win a bar bet or impress some friends.

-- Bob.

Reef Fish

unread,
Jun 11, 2005, 9:29:11 PM6/11/05
to

I found two fairly obvious typos.


>
> The exact critical values is 1.9602/(9998 + 1.9602**2) = 0.01960.

The denominator should be sqrt(9998 + 1.9602^2).


>
> If you're good at doing square roots of large numbers in your head
> you can estimate the critical value of R for large samples to be
> 2/n for alpha .05, and win a bar bet or impress some friends.

2/n should be 2/sqrt(n).

glenb...@geocities.com

unread,
Jun 14, 2005, 2:49:16 AM6/14/05
to

This post seems to confuse the r-squared statistic (expressed as a
percentage) with (1-alpha, also expressed as a percentage) for some
hypothesis test.

Reef Fish

unread,
Jun 14, 2005, 8:15:30 AM6/14/05
to

At first I thought your "This post" referred to the post by the OP,
"A Guy" who expressed his question poorly, as already been noted by
Jerry Dallal and Data Matter. But my post was the only one that
mentioned alpha or 1-alpha, so you must have meant my post.

BTW, it would be helpful in your future posts if you can identity
the post to which you refer and be more specific about your comment,
relative to the post to which you followed up.

In this case, the confusion is all YOURS.


Given the TABLE provided by "A Guy" for his ill-posed question, I
merely asked the appropriate question for him for which the table
provided the answers, as well as providing the THEORY from which
the values in the table was computed.

R, the "multiple correlation coefficient" is the same as |r| the
Pearson correlation.

The R^2 statistic is NOT a percent, but can be expressed as a percent
by multiplying by 100. It is the percent of variation of Y FITTED
(not explained -- r or R doesn't explain anything) by the X.
Specifically, R^2 = Regression SS / Corrected Total SS.
= 1 - Residuals SS / Corrected Total SS.

This is YOUR (Glen Barnett's) first confusion.

Next, you are confused about the ALPHA given in the web page TABLE.

This was how the web page explained on "how to use this table"

#> Once you have learned the correlation coefficient (r) for your
#> sample, you need to determine what the likelihood is that the
#> relationship you found in your sample really exist in the
#> population or were your results a fluke? -- OR -- In the case
#> of a t-test, did the difference between the two means in your
#> sample occurred by chance and not really exist in your population.

which of course is the verbose way of saying, for the correlation r
case, that alpha is the significance level for testing Ho: rho = o.

Alpha is a PROBABILITY. While it can be expressed as a percent,
by multiplying by 100, so I equated alpha to the level of significance
p (fraction) given in the table for a two-tailed test for testing


the hypothesis Ho: rho = 0.


The THEORY I provided was derived from standard textbook material
on how r or R can be tested though I did not have refer to any
textbook.

The approximation results are in my unpublished lecture notes on
Testing Correlation Coeffcients, which emphasized

RF> In the case of a regression, testing for the significance of R is
RF> ALWAYS the WRONG QUESTION. John Tukey said something to the effect

RF> that using R is sweep the data under the rug with a vengeance.

which I also included in my post.


Now you can go unconfuse yourself by picking up an elementary textbook
on the subject. My explanation in THIS post should help those who
might be somewhat confused, but not as confused as you were, as to
where EVERYTHING came from, in my post:

http://groups-beta.google.com/group/sci.stat.math/msg/601db8302f0f2b2a?hl=en

-- Bob.

Jerry Dallal

unread,
Jun 14, 2005, 12:08:07 PM6/14/05
to
Reef Fish wrote:
>
> The R^2 statistic is NOT a percent, but can be expressed as a percent
> by multiplying by 100. It is the percent of variation of Y FITTED
> (not explained -- r or R doesn't explain anything) by the X.

I have no problem problem parading my ignorance. It's one of the ways I
learn. Often some kind soul takes pity...

Of course, I always tell my students and colleagues that the regression
model doesn't "explain" anything and that the *only* thing "explained"
going for it is that it is pithy and entrenched.

I've tried "reduction in variability". I never liked "accounted for"
because it's too close to "explained". I'd been using variations on the
words "predicted by". But, "Fitted by"! Just the phrase I've been
looking for for the past quarter century!

Richard Ulrich

unread,
Jun 14, 2005, 12:32:33 PM6/14/05
to
On 14 Jun 2005 05:15:30 -0700, "Reef Fish"
<Large_Nass...@Yahoo.com> wrote:

>
> glenb...@geocities.com wrote:
> > This post seems to confuse the r-squared statistic (expressed as a
> > percentage) with (1-alpha, also expressed as a percentage) for some
> > hypothesis test.
>
> At first I thought your "This post" referred to the post by the OP,
> "A Guy" who expressed his question poorly, as already been noted by
> Jerry Dallal and Data Matter. But my post was the only one that
> mentioned alpha or 1-alpha, so you must have meant my post.

Bob is reading badly again. Or remembering wrong.
Neither Jerry nor Data Matter hit the point directly.
The OP confused the two. He wrote,

"If you think that the 95% confidence critical value is always an R^2
of 95%, you should choose the NO option."

My newsreader shows by outline form that Glen's post
was a "Reply" to the OP. That's what it makes sense as, too.

>
> BTW, it would be helpful in your future posts if you can identity
> the post to which you refer and be more specific about your comment,
> relative to the post to which you followed up.

Yes, citing a bit can be helpful.

[snip, gratuitous abuse of Glen, and explanations of R]

--
Rich Ulrich, wpi...@Pitt.edu
http://www.pitt.edu/~wpilib/index.html

Reef Fish

unread,
Jun 14, 2005, 1:25:28 PM6/14/05
to

Glad to hear someone who thinks the distinction is important.

For OVER a quarter of a century, I've been lecturing AGAINST the
sloppy language used by many textbooks AND statistical packages
that term R^2 as the "percent of variation explained by the
regression," on one misdemeanor and one felony charge. :-)

Sloppy language promotes sloppy (and worse, WRONG) ideas.
The use of "explain" for "fitted" is the felony that inadvertently
but nearly always encouraged the unwary user to think that
somehow the FITTED model actually "explained" some phenomenon
between the dependent variable and its regressors or independent
variables.

The misdemenor is the "percent". It is of course the fraction or
the proportion, but not the "percent". After the Miami politicians
passed in their legisture a law to reduce the real estate property
tax to 0.5% (when they intended to mean 50%) of the previous value,
to the delight of the tax payers I am sure, their red faces made
national news when they had to repeal what they passed immediately,
for their faux pas in the careless use of the word "percent".

In my own unpublished Lecture Notes (since 1970), I have used
"proportion (or fraction) of the variation FITTED by the regression"
as my standard terminology, together with the explanation of WHY
it should be expressed that way, instead the way it's normally
expressed in many books and computer packages.

In the book I co-authored with Harry Roberts (1982 MeGraw Hill/
Scientific Press), Harry (who wrote most of the text) put it
this way (page 17-21):

1. The word "explained" is sometimes erroneous thought to
connote causation whereas it refers only to deviations
of fitted values from the overall mean, without any
implication that the regression model that produced
these fitted values has captured any causal scheme
underlying the data.

2. "Variation" or "variance" is often misunderstood. It
refers to a sum of squared deviations ... not to be
confused with "mean square" to be explained below.


Careful use of terminology promotes and breeds clear thinking
and the PROPER application of regression methods.

-- Bob..

Reef Fish

unread,
Jun 14, 2005, 1:49:15 PM6/14/05
to

Richard Ulrich wrote:
> On 14 Jun 2005 05:15:30 -0700, "Reef Fish"
> <Large_Nass...@Yahoo.com> wrote:
>
> >
> > glenb...@geocities.com wrote:
> > > This post seems to confuse the r-squared statistic (expressed as a
> > > percentage) with (1-alpha, also expressed as a percentage) for some
> > > hypothesis test.
> >
> > At first I thought your "This post" referred to the post by the OP,
> > "A Guy" who expressed his question poorly, as already been noted by
> > Jerry Dallal and Data Matter. But my post was the only one that
> > mentioned alpha or 1-alpha, so you must have meant my post.
>
> Bob is reading badly again. Or remembering wrong.

How could I remember wrong? I had ALL the posts of the thread, in
order of their appearance, in front of me, when I posted. I read
from groups.google.com which is always threaded that way.


> Neither Jerry nor Data Matter hit the point directly.

Both of them pointed to the OP's poor wording.


> The OP confused the two. He wrote,
>
> "If you think that the 95% confidence critical value is always an R^2
> of 95%, you should choose the NO option."
>
> My newsreader shows by outline form that Glen's post
> was a "Reply" to the OP. That's what it makes sense as, too.

I actually LOOKED at the original message (from google) and saw

From: glenbarn...@geocities.com
Newsgroups: sci.math,sci.stat.edu,sci.stat.math
Subject: Re: Do the critical values of linear correlation depend on
sample size?
Date: 13 Jun 2005 23:49:16 -0700
Organization: http://groups.google.com
Lines: 5
Message-ID: <1118731756....@g14g2000cwa.googlegroups.com>
References: <1118361532.9...@f14g2000cwb.googlegroups.com>


but was unable to retrieve the "reference" post to know which one it
was.
That was why I based my inference on the (1 - alpha) mention by Glen
because"

NOEHERE in the OP's post NOR in the web link there
referred to (1- alpha)!!


> > BTW, it would be helpful in your future posts if you can identity
> > the post to which you refer and be more specific about your comment,
> > relative to the post to which you followed up.
>
> Yes, citing a bit can be helpful.
>
> [snip, gratuitous abuse of Glen, and explanations of R]
>
> --
> Rich Ulrich, wpi...@Pitt.edu
> http://www.pitt.edu/~wpilib/index.html

So, what have YOU, Richard Ulrich contributed to the OP's post,
my more than complete answer to the OP's intended question which
he never asked, or knew exactly WHAT to ask.

Glen certainly did not make it clear himeslf either, in his cryptic
three lines:


GB> This post seems to confuse the r-squared statistic (expressed
GB> as a percentage) with (1-alpha, also expressed as a percentage)
BG> for some hypothesis test.

Where did Glen Barnett get his "1=alpha" and "expressed as a
percentage"
from the OP's ("A Guy") post?

Richard, you remain a NUISANCE GNAT of this newsgroup -- never had
ANYTHING worthwhile to contribute whenever you follow-up on MY posts,
but also with gratuitous ad hominem remarks,

RU> Bob is reading badly again. Or remembering wrong.

and

RU> My newsreader shows by outline form that Glen's post
RU> was a "Reply" to the OP. That's what it makes sense as, too.

Then why didn't YOU explain

Where did Glen Barnett get his "1=alpha" and "expressed
as a percentage" from the OP's ("A Guy") post?


Above all, Richard Ulrich the the resident gnat -- you have added
NOTHING to the substance or knowledge about the statistical question
of "A Guy" which I answered fully. You have added plenty of NOISE,
pure unadulterated NOISE of vacuous statistical substance.

-- Bob.

Reply all
Reply to author
Forward
0 new messages