0 views

Skip to first unread message

Jun 9, 2005, 7:58:52 PM6/9/05

to

I have been having an argument about this.

For this question, we're assuming linear regression is run on n data

points. We get a correlation of R^2, where R is the slope of the best

fit line with normalized (mean of 0, standard devitation of 1)

coordinates.

If you think that the 95% confidence critical value is always an R^2 of

95%, you should choose the NO option.

If you think that the critical value of the R^2 correlation depends on

how many points n you have, like in this chart, then you should choose

the YES option. The critical values are listed in terms of |R|, BTW. Of

course |R| = sqrt(R^2).

http://www.gifted.uconn.edu/siegle/research/Correlation/corrchrt.htm

Choose YES or NO or INVALID QUESTION

Jun 9, 2005, 7:43:58 PM6/9/05

to

you are intending to ask is 'yes'. The more data, the better you have

estimated the parameter of interest, the tighter the confidence

interval, and the less variable the sample quantity will be around the

corresponding population value.

BTW, the critical value you speak of is usually the value needed to

reject a null hypothesis of a population value of 0. If the underlying

value is 0, the more data, the harder it is for the sample value to

exceed any 0<k<1.

Unless I missed something...

Jun 9, 2005, 9:50:57 PM6/9/05

to

What do you mean by "the 95% confidence critical value is always an R^2

of 95%"? What does "R^2 of 95%" mean? I have never heard of such a

thing.

Jun 10, 2005, 12:20:03 AM6/10/05

to

A Guy wrote:

> I have been having an argument about this.

>

> For this question, we're assuming linear regression is run on n data

> points. We get a correlation of R^2, where R is the slope of the best

> fit line with normalized (mean of 0, standard devitation of 1)

> coordinates.

Just say your sample correlation between X and Y is R.

You want to test the hypothesis Ho: rho = 0. and you want to know

if you test it at some alpha level, how large must R be before Ho

can be rejected at the slpha level for a two-tailed test.

The answer depends on the sample size n and alpha.

The critical values of |R| is given in the table in the link you gave:

> http://www.gifted.uconn.edu/siegle/research/Correlation/corrchrt.htm

For example, at alpha = 0.05 and various df (n-2) your critical values

are

df 20 40 60 80 100

|R|> .423 .304 .250 .217 .195

e.g., if n=102, you reject if |R| is greater than .195

and if n= 20, you need a corr coefficient of |R|>.423 before you can

rej.

This is WHY. If (X,Y) comes from a bivariate normal, then under the

null hyp. of rho = 0, the statistic R* sqrt((n-2)/(1-R*R)) has a

T distribution with (n-2) df. Thus, the null hypothesis is rejected if

|R|* sqrt((n-2)/(1 - R*R)) > t(1-alpha/2;(n-2)).

or equivalently, if |R| > t /sqrt((n-2) + t*t))

The right-hand side expression of the above, for various combinations

of alpha and d.f. (= n-2) are given in the web link.

Exercise: verify any of the values in the table by the formula above

---------------------------------------------------------------------

Having given you the solution, I should caution that George Box said

in his (1978 JASA) paper "it is better to get an approximate answer

to the right question than an exact answer to the wrong question".

In the case of a regression, testing for the significance of R is

ALWAYS the WRONG QUESTION. John Tukey said something to the effect

that using R is sweep the data under the rug with a vengeance.

A statistically significant R may be utterly useless if the sample

size is large, and the relation between X and Y may look like a

shot gun blast. For n=10,000 it takes only a correlation of .02

for it to be statistically significant.

Use prediction intervals and other means to look for the PRACTICAL

significance of any regression result.

A final note is an easy and useless result for large samples.

If the sample size is large, the APPROXIMATE critical value for |R|

is Z/sqrt(n), because T --> Z, and (n-2) + Z*Z ---> n (approx).

For example, for Z = 1.96 and n = 10,000, the critical value

is approx. 1.96/100 = 0.0196 or 0.02.

The exact critical values is 1.9602/(9998 + 1.9602**2) = 0.01960.

If you're good at doing square roots of large numbers in your head

you can estimate the critical value of R for large samples to be

2/n for alpha .05, and win a bar bet or impress some friends.

-- Bob.

Jun 11, 2005, 9:29:11 PM6/11/05

to

I found two fairly obvious typos.

>

> The exact critical values is 1.9602/(9998 + 1.9602**2) = 0.01960.

The denominator should be sqrt(9998 + 1.9602^2).

>

> If you're good at doing square roots of large numbers in your head

> you can estimate the critical value of R for large samples to be

> 2/n for alpha .05, and win a bar bet or impress some friends.

2/n should be 2/sqrt(n).

Jun 14, 2005, 2:49:16 AM6/14/05

to

This post seems to confuse the r-squared statistic (expressed as a

percentage) with (1-alpha, also expressed as a percentage) for some

hypothesis test.

Jun 14, 2005, 8:15:30 AM6/14/05

to

At first I thought your "This post" referred to the post by the OP,

"A Guy" who expressed his question poorly, as already been noted by

Jerry Dallal and Data Matter. But my post was the only one that

mentioned alpha or 1-alpha, so you must have meant my post.

BTW, it would be helpful in your future posts if you can identity

the post to which you refer and be more specific about your comment,

relative to the post to which you followed up.

In this case, the confusion is all YOURS.

Given the TABLE provided by "A Guy" for his ill-posed question, I

merely asked the appropriate question for him for which the table

provided the answers, as well as providing the THEORY from which

the values in the table was computed.

R, the "multiple correlation coefficient" is the same as |r| the

Pearson correlation.

The R^2 statistic is NOT a percent, but can be expressed as a percent

by multiplying by 100. It is the percent of variation of Y FITTED

(not explained -- r or R doesn't explain anything) by the X.

Specifically, R^2 = Regression SS / Corrected Total SS.

= 1 - Residuals SS / Corrected Total SS.

This is YOUR (Glen Barnett's) first confusion.

Next, you are confused about the ALPHA given in the web page TABLE.

This was how the web page explained on "how to use this table"

#> Once you have learned the correlation coefficient (r) for your

#> sample, you need to determine what the likelihood is that the

#> relationship you found in your sample really exist in the

#> population or were your results a fluke? -- OR -- In the case

#> of a t-test, did the difference between the two means in your

#> sample occurred by chance and not really exist in your population.

which of course is the verbose way of saying, for the correlation r

case, that alpha is the significance level for testing Ho: rho = o.

Alpha is a PROBABILITY. While it can be expressed as a percent,

by multiplying by 100, so I equated alpha to the level of significance

p (fraction) given in the table for a two-tailed test for testing

the hypothesis Ho: rho = 0.

The THEORY I provided was derived from standard textbook material

on how r or R can be tested though I did not have refer to any

textbook.

The approximation results are in my unpublished lecture notes on

Testing Correlation Coeffcients, which emphasized

RF> In the case of a regression, testing for the significance of R is

RF> ALWAYS the WRONG QUESTION. John Tukey said something to the effect

RF> that using R is sweep the data under the rug with a vengeance.

which I also included in my post.

Now you can go unconfuse yourself by picking up an elementary textbook

on the subject. My explanation in THIS post should help those who

might be somewhat confused, but not as confused as you were, as to

where EVERYTHING came from, in my post:

http://groups-beta.google.com/group/sci.stat.math/msg/601db8302f0f2b2a?hl=en

-- Bob.

Jun 14, 2005, 12:08:07 PM6/14/05

to

Reef Fish wrote:

>

> The R^2 statistic is NOT a percent, but can be expressed as a percent

> by multiplying by 100. It is the percent of variation of Y FITTED

> (not explained -- r or R doesn't explain anything) by the X.

>

> The R^2 statistic is NOT a percent, but can be expressed as a percent

> by multiplying by 100. It is the percent of variation of Y FITTED

> (not explained -- r or R doesn't explain anything) by the X.

I have no problem problem parading my ignorance. It's one of the ways I

learn. Often some kind soul takes pity...

Of course, I always tell my students and colleagues that the regression

model doesn't "explain" anything and that the *only* thing "explained"

going for it is that it is pithy and entrenched.

I've tried "reduction in variability". I never liked "accounted for"

because it's too close to "explained". I'd been using variations on the

words "predicted by". But, "Fitted by"! Just the phrase I've been

looking for for the past quarter century!

Jun 14, 2005, 12:32:33 PM6/14/05

to

On 14 Jun 2005 05:15:30 -0700, "Reef Fish"

<Large_Nass...@Yahoo.com> wrote:

<Large_Nass...@Yahoo.com> wrote:

>

> glenb...@geocities.com wrote:

> > This post seems to confuse the r-squared statistic (expressed as a

> > percentage) with (1-alpha, also expressed as a percentage) for some

> > hypothesis test.

>

> At first I thought your "This post" referred to the post by the OP,

> "A Guy" who expressed his question poorly, as already been noted by

> Jerry Dallal and Data Matter. But my post was the only one that

> mentioned alpha or 1-alpha, so you must have meant my post.

Bob is reading badly again. Or remembering wrong.

Neither Jerry nor Data Matter hit the point directly.

The OP confused the two. He wrote,

"If you think that the 95% confidence critical value is always an R^2

of 95%, you should choose the NO option."

My newsreader shows by outline form that Glen's post

was a "Reply" to the OP. That's what it makes sense as, too.

>

> BTW, it would be helpful in your future posts if you can identity

> the post to which you refer and be more specific about your comment,

> relative to the post to which you followed up.

Yes, citing a bit can be helpful.

[snip, gratuitous abuse of Glen, and explanations of R]

--

Rich Ulrich, wpi...@Pitt.edu

http://www.pitt.edu/~wpilib/index.html

Jun 14, 2005, 1:25:28 PM6/14/05

to

Glad to hear someone who thinks the distinction is important.

For OVER a quarter of a century, I've been lecturing AGAINST the

sloppy language used by many textbooks AND statistical packages

that term R^2 as the "percent of variation explained by the

regression," on one misdemeanor and one felony charge. :-)

Sloppy language promotes sloppy (and worse, WRONG) ideas.

The use of "explain" for "fitted" is the felony that inadvertently

but nearly always encouraged the unwary user to think that

somehow the FITTED model actually "explained" some phenomenon

between the dependent variable and its regressors or independent

variables.

The misdemenor is the "percent". It is of course the fraction or

the proportion, but not the "percent". After the Miami politicians

passed in their legisture a law to reduce the real estate property

tax to 0.5% (when they intended to mean 50%) of the previous value,

to the delight of the tax payers I am sure, their red faces made

national news when they had to repeal what they passed immediately,

for their faux pas in the careless use of the word "percent".

In my own unpublished Lecture Notes (since 1970), I have used

"proportion (or fraction) of the variation FITTED by the regression"

as my standard terminology, together with the explanation of WHY

it should be expressed that way, instead the way it's normally

expressed in many books and computer packages.

In the book I co-authored with Harry Roberts (1982 MeGraw Hill/

Scientific Press), Harry (who wrote most of the text) put it

this way (page 17-21):

1. The word "explained" is sometimes erroneous thought to

connote causation whereas it refers only to deviations

of fitted values from the overall mean, without any

implication that the regression model that produced

these fitted values has captured any causal scheme

underlying the data.

2. "Variation" or "variance" is often misunderstood. It

refers to a sum of squared deviations ... not to be

confused with "mean square" to be explained below.

Careful use of terminology promotes and breeds clear thinking

and the PROPER application of regression methods.

-- Bob..

Jun 14, 2005, 1:49:15 PM6/14/05

to

Richard Ulrich wrote:

> On 14 Jun 2005 05:15:30 -0700, "Reef Fish"

> <Large_Nass...@Yahoo.com> wrote:

>

> >

> > glenb...@geocities.com wrote:

> > > This post seems to confuse the r-squared statistic (expressed as a

> > > percentage) with (1-alpha, also expressed as a percentage) for some

> > > hypothesis test.

> >

> > At first I thought your "This post" referred to the post by the OP,

> > "A Guy" who expressed his question poorly, as already been noted by

> > Jerry Dallal and Data Matter. But my post was the only one that

> > mentioned alpha or 1-alpha, so you must have meant my post.

>

> Bob is reading badly again. Or remembering wrong.

How could I remember wrong? I had ALL the posts of the thread, in

order of their appearance, in front of me, when I posted. I read

from groups.google.com which is always threaded that way.

> Neither Jerry nor Data Matter hit the point directly.

Both of them pointed to the OP's poor wording.

> The OP confused the two. He wrote,

>

> "If you think that the 95% confidence critical value is always an R^2

> of 95%, you should choose the NO option."

>

> My newsreader shows by outline form that Glen's post

> was a "Reply" to the OP. That's what it makes sense as, too.

I actually LOOKED at the original message (from google) and saw

From: glenbarn...@geocities.com

Newsgroups: sci.math,sci.stat.edu,sci.stat.math

Subject: Re: Do the critical values of linear correlation depend on

sample size?

Date: 13 Jun 2005 23:49:16 -0700

Organization: http://groups.google.com

Lines: 5

Message-ID: <1118731756....@g14g2000cwa.googlegroups.com>

References: <1118361532.9...@f14g2000cwb.googlegroups.com>

but was unable to retrieve the "reference" post to know which one it

was.

That was why I based my inference on the (1 - alpha) mention by Glen

because"

NOEHERE in the OP's post NOR in the web link there

referred to (1- alpha)!!

> > BTW, it would be helpful in your future posts if you can identity

> > the post to which you refer and be more specific about your comment,

> > relative to the post to which you followed up.

>

> Yes, citing a bit can be helpful.

>

> [snip, gratuitous abuse of Glen, and explanations of R]

>

> --

> Rich Ulrich, wpi...@Pitt.edu

> http://www.pitt.edu/~wpilib/index.html

So, what have YOU, Richard Ulrich contributed to the OP's post,

my more than complete answer to the OP's intended question which

he never asked, or knew exactly WHAT to ask.

Glen certainly did not make it clear himeslf either, in his cryptic

three lines:

GB> This post seems to confuse the r-squared statistic (expressed

GB> as a percentage) with (1-alpha, also expressed as a percentage)

BG> for some hypothesis test.

Where did Glen Barnett get his "1=alpha" and "expressed as a

percentage"

from the OP's ("A Guy") post?

Richard, you remain a NUISANCE GNAT of this newsgroup -- never had

ANYTHING worthwhile to contribute whenever you follow-up on MY posts,

but also with gratuitous ad hominem remarks,

RU> Bob is reading badly again. Or remembering wrong.

and

RU> My newsreader shows by outline form that Glen's post

RU> was a "Reply" to the OP. That's what it makes sense as, too.

Then why didn't YOU explain

Where did Glen Barnett get his "1=alpha" and "expressed

as a percentage" from the OP's ("A Guy") post?

Above all, Richard Ulrich the the resident gnat -- you have added

NOTHING to the substance or knowledge about the statistical question

of "A Guy" which I answered fully. You have added plenty of NOISE,

pure unadulterated NOISE of vacuous statistical substance.

-- Bob.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu