2 views

Skip to first unread message

Nov 23, 2005, 12:43:46 AM11/23/05

to

Hi,

I have read that the sample size can affect the significance of

correlation. Can someone explain why is this so? Thanks.

Nov 23, 2005, 7:41:44 AM11/23/05

to

The SE of r = SE(r) = SQRT[(1-r^2)/(n-2)]. The significance of r is

tested with a t-test.

t = (r-0)/SE(r)

So, as n gets larger, SE(r) gets smaller, t gets larger, and p gets smaller.

--

Bruce Weaver

bwe...@lakeheadu.ca

www.angelfire.com/wv/bwhomedir

Nov 24, 2005, 9:52:09 AM11/24/05

to

Conceptually:

Statistical significance = effect size x sample size

e.g., in concrete form

t = d x (df^0.5)/2 ... (for independent t using Cohen's d)

... as d or df increases t (and statistical significance) increases ...

This is taken straight from Robert Rosenthal's book chapter in the

Handbook of Research Synthesis.

Thom

Nov 24, 2005, 10:55:32 AM11/24/05

to

In article <1132843929.0...@f14g2000cwb.googlegroups.com>,

Thom <t.s.b...@lboro.ac.uk> wrote:

>Conceptually:

Thom <t.s.b...@lboro.ac.uk> wrote:

>Conceptually:

>Statistical significance = effect size x sample size

A far more correct statement is that the "statistical

significance" is approximately a function of the effect

size multiplied by the square root of the sample size.

Even this is not a good guide as to what action should

be taken. To get that, one should look upon the

problem as a decision problem.

>e.g., in concrete form

>t = d x (df^0.5)/2 ... (for independent t using Cohen's d)

>... as d or df increases t (and statistical significance) increases ...

>This is taken straight from Robert Rosenthal's book chapter in the

>Handbook of Research Synthesis.

>Thom

--

This address is for information only. I do not claim that these views

are those of the Statistics Department or of Purdue University.

Herman Rubin, Department of Statistics, Purdue University

hru...@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

Nov 25, 2005, 6:32:07 AM11/25/05

to

>A far more correct statement is that the "statistical

>significance" is approximately a function of the effect

>size multiplied by the square root of the sample size.

That's certainly true, but I think the spirit of Rosenthal's conceptual

equation is to remind/educate researchers that statistical significance

is a function of both effect size and sample size.

Thom

Nov 28, 2005, 10:51:56 AM11/28/05

to

Herman Rubin wrote:

> In article <1132843929.0...@f14g2000cwb.googlegroups.com>,

> Thom <t.s.b...@lboro.ac.uk> wrote:

> >Conceptually:

>

> >Statistical significance = effect size x sample size

>

> A far more correct statement is that the "statistical

> significance" is approximately a function of the effect

> size multiplied by the square root of the sample size.

Both are incorrectly stated because of the undefined

nature of "statistical significance". (See below)

This question/answer was given in the thread in June:

" Do the critical values of linear correlation depend on sample size?"

http://groups.google.com/group/sci.stat.math/msg/601db8302f0f2b2a?hl=en

In the above article in that thread, my stated result was

r is significant (two-tailed) if

RF> |R|* sqrt((n-2)/(1 - R*R)) > t(1-alpha/2;(n-2)).

RF> or equivalently, if |R| > t /(sqrt((n-2) + t*t))

where t is the critical value at alpha/2 for t with (n-2) df.

Since sqrt((n-2) + t*t)) is approximate sqrt(n) for large n,

an easy mnemonic device (using the asymptotic approx,)

is to think of the standard error of r as 1/sqrt(n).

Thus, the r is statistically significant at the 95% level if

I r I > 2/sqrt(100) = 0.2 if n = 100

and I r I > 2/sqrt(10000) = 0.02 if n = 10,000

and so on.

> Even this is not a good guide as to what action should

> be taken. To get that, one should look upon the

> problem as a decision problem.

It is always a bad idea to make any decision based on the

value OR significance of ANY correlation coefficient!

I have shown elsewhere that a highly significant r , with

p value smaller than 0.0001 say, could be a completely

USELESS result in a regression problem. (The actual

example was the SPSS Manual data re-analyzed).

RF> In the case of a regression, testing for the significance of R is

RF> ALWAYS the WRONG QUESTION. John Tukey said something

RF>to the effect that using R is sweep the data under the rug

RF> with a vengeance.

-- Bob.

Dec 1, 2005, 8:57:12 AM12/1/05

to

Reef Fish wrote:> It is always a bad idea to make any decision based on

the

> value OR significance of ANY correlation coefficient!

>

> I have shown elsewhere that a highly significant r , with

> p value smaller than 0.0001 say, could be a completely

> USELESS result in a regression problem. (The actual

> example was the SPSS Manual data re-analyzed).

>

> RF> In the case of a regression, testing for the significance of R is

> RF> ALWAYS the WRONG QUESTION. John Tukey said something

> RF>to the effect that using R is sweep the data under the rug

> RF> with a vengeance.

the

> value OR significance of ANY correlation coefficient!

>

> I have shown elsewhere that a highly significant r , with

> p value smaller than 0.0001 say, could be a completely

> USELESS result in a regression problem. (The actual

> example was the SPSS Manual data re-analyzed).

>

> RF> In the case of a regression, testing for the significance of R is

> RF> ALWAYS the WRONG QUESTION. John Tukey said something

> RF>to the effect that using R is sweep the data under the rug

> RF> with a vengeance.

I'm a big fan of Tukey - especially about correlation coefficients, but

I doubt he'd have agreed "It is always a bad idea to make any decision

based on the value OR significance of ANY correlation coefficient"

given that he doesn't come across in his writing as dogmatic. As a

trivial counter-argument when r = 0 or 1 it is often possible to make

sensible decisions (though they might be trivial in some cases). I

think one could reasonably extend this to values very close to 1 or 0

in many cases.

Thom

Dec 1, 2005, 11:25:00 AM12/1/05

to

Thom wrote:

> Reef Fish wrote:> It is always a bad idea to make any decision based on

> the

> > value OR significance of ANY correlation coefficient!

> >

> > I have shown elsewhere that a highly significant r , with

> > p value smaller than 0.0001 say, could be a completely

> > USELESS result in a regression problem. (The actual

> > example was the SPSS Manual data re-analyzed).

> >

> > RF> In the case of a regression, testing for the significance of R is

> > RF> ALWAYS the WRONG QUESTION. John Tukey said something

> > RF>to the effect that using R is sweep the data under the rug

> > RF> with a vengeance.

>

> I'm a big fan of Tukey - especially about correlation coefficients, but

> I doubt he'd have agreed "It is always a bad idea to make any decision

> based on the value OR significance of ANY correlation coefficient"

> given that he doesn't come across in his writing as dogmatic.

You're quite right in a sense that I'll be glad to elaborate.

First, people have always ridiculed my use of CAPS for emphasis.

In my paraphrase of what Tukey said, I did NOT emphasize the

word "always", but "ANY" and "value OR significance".

Second, it is unfortunate (in more ways than one) that Tukey is no

longer with us, to be able to say whether he would have agreed

with my statement of what he said.

I have absolutely no doubt that I correctly conveyed his message,

though I am not quite sure if that was what he SAID (in person),

in the research conference sponsored by NSF, held in the late

1970s, or was in something WRITTEN, whether in one of his

publications or in one of his unpublished notes (plenty of them

handed out in said conference).

One thing I distinctly remembered, when the conference started,

was that it was going be the FIRST, in any of Tukey's conferences

of that kind (for researchers in academia) that the material in the

5-day conference would be published in the form of a Proceedings.

That "proceedings" turned out to be another one of the rumors

about Tukey's material to be in print. What the participants

received in print, instead, were pre-publication galley prints of

his book on Data Analysis, and the regression book he co-authored

with Fred Mostaller.

On the serious side of that comment, I am absolutely sure of the

quoted phrase "sweeping under the rug with a vengeance", was

correct, because I could hardly come up with such a powerful

phrase myself. :-)

I have already indicated the two extremes (near 0) or (near 1)

that a statistically "significant" correlation could be completely

useless:

r = 0.02 is statistically highly significant for a sample of size

10,000.

(That's the main answer to the SUBJECT of this thread). By

increasing the sample size to 100,000, the 0.02 correlation

will be even more significant with a much smaller p-value.

On the "near one" side of the uselessness, I think the example

in the 1975 SPSS Manual is hard to beat, as an ACTUAL example,

with real data, and published! The Multiple R was so statistically

significant that the writer of that example completely overlooked

two HUGE typos in the data; and when those typos were

corrected, that multiple regression was even more significant.

The balloon was punctured by me in MY Data Analysis Lecture

Notes that the SPSS multiple regression did a WORSE job than

my SIMPLE regression with only one of the three independent

variables, and that ALL of the fitted models were useless for

any practical purposes.

Details of that example (even with participation by several

readers in sci.stat.math for trying out their hands at the data)

can be found in the archives.

So, I stand on my understanding and paraphrased quote of

Tukey, because one can easily extrapolate from the two

examples above that a correlation coefficient of 0.002

could be highly statistically significant for a very large sample;

and that a correlation of 0.999 could be just as spurious or

useless as the SPSS Manual example.

> As a

> trivial counter-argument when r = 0 or 1 it is often possible to make

> sensible decisions (though they might be trivial in some cases).

These two cases of "exactly 0" and "exactly 1" never came up

in any of Tukey's or my discussion of correlations, and I am not

prepared to argue about these pedantically trivial and pathogical

cases, other than saying that would be like arguing how small

is "epsilon" or how many angels can dance on the tip of a pin.

I

> think one could reasonably extend this to values very close to 1 or 0

> in many cases.

To which I would strongly disagree. The examples I've seen and

used above were "very close" (by ordinary standareds of the

the meaning of the term) to 1 or 0, and yet they are both clearly

USELESS.

> Thom

Thanks for your "devil's advocate" challenge of my statement of

Tukey's view, which allowed me to elaborate on several aspects

of the quoted statement, as well as re-iterating on the actual

USELESS examples of correlations very close to 0 or 1.

-- Reef Fish Bob.

In Las Vegas pondering on how small the House has "epsilon"

advantage on me in a game of Blackjact, given the rules of

blackjact in Vegas, and the lack of opportunity in card counting

in single-deck deals; as well as the not-barred card counting

play in 6-8 deck dealer-shuffled (as opposed to machine or

continuous shuffle) shoes.

Dec 2, 2005, 10:03:23 AM12/2/05

to

I agree that in many cases (but not always) r is pretty useless and

that can include r = 0 or r = 1. My sense of Tukey's dislike of r is

that he wasn't keen on the emphasis that many researchers place on

standardized versus unstandardized regression coefficients. In general

r is less interesting, informative and useful than the unstandardized

slope. If r = 0 and r = 1 represent particularly interesting values

then r may be as useful (or possibly better) than the unstandardized

slope. Reliability seems a case in point as r = 1 is a sensible

aspiration and the proximity of observed r to 1 is informative and is

easily comared between different variants of the measure. If your

argument is that r is often (mostly?) useless or inferior to other

easily obtained statistics then I'd agree.

``Like the late Charles P. Winsor, a statistician far ahead of his

time, I find the use of a correlation coefficient a dangerous

symptom.''

J.W. Tukey (1969)

BTW Tukey was a member of Charles Winsor's Society for the Supression

of the Correlation Coefficient.

See Brillinger (2001). "John Tukey and the correlation coefficient",

Computing Science and Statistics, 33.

Thom

Thom

Dec 3, 2005, 6:23:20 PM12/3/05

to

Thom wrote:

> I agree that in many cases (but not always) r is pretty useless and

> that can include r = 0 or r = 1. My sense of Tukey's dislike of r is

> that he wasn't keen on the emphasis that many researchers place on

> standardized versus unstandardized regression coefficients.

I can't speak for Tukey, but I don't think so.

In ALL multiple regression, the coefficient of Xj is r times sy (std.

dev. of y) divided by the std dev. of Xj, where r is either the

Pearson r for simple regression or the partial correlation between

Y and Xj given the other X's in the equation.

Standardized or not, it doesn't take a genius to convert the

unstandardized coef. to a standardized one or vice versa. But

the point remains that the correlation (simple or partial) are

themselves rather useless in the sense already discussed.

-- Bob.

Dec 3, 2005, 9:46:25 PM12/3/05

to

Reef Fish wrote:

> Thom wrote:

> > I agree that in many cases (but not always) r is pretty useless and

> > that can include r = 0 or r = 1. My sense of Tukey's dislike of r is

> > that he wasn't keen on the emphasis that many researchers place on

> > standardized versus unstandardized regression coefficients.

>

> I can't speak for Tukey, but I don't think so.

>

> In ALL multiple regression, the coefficient of Xj is r times sy (std.

> dev. of y) divided by the std dev. of Xj, where r is either the

> Pearson r for simple regression or the partial correlation between

> Y and Xj given the other X's in the equation.

I was a bit careless not stating explicit the "partial standard

deviations"

in the case if multiple regression coefficients. The detail about "all

simple, partial, and multiple correlations are SIMPLE correlations"

can be seen in several of my posts earlier this year, including :

I thought I had better include that reference because of my

ambiguous use of the "std dev. of Xj" in the preceding post.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu