Testing sample > 5,000 data points for normality by testing only subsample?

8 views
Skip to first unread message

Christoph Ruehlemann

unread,
Jan 24, 2020, 8:19:01 AM1/24/20
to corplin...@googlegroups.com
Hi all,

I have a number of samples that I would like to test for normality. One of the samples exceeds 5,000 data points, the limit up to which the Shapiro-Wilk test accepts samples. This is the data:

c1 <- exp(rnorm(505))
c2 <- exp(rnorm(550))
c3 <- exp(rnorm(5500))

cluster.data <- c(c1, c2, c3)
cluster.factors <- c(rep("Cluster_1", length(c1)), 
                     rep("Cluster_2", length(c2)),
                     rep("Cluster_3", length(c3)))

# set up data for test:
cluster.df <- data.frame(cluster.data, cluster.factors)

To circumvent the 5,000 restriction, would it be statistically acceptable if I run the test on smallish subsamples of the data only? Here, for example, I draw a subsample of size 500 for all three variables:

tapply(cluster.df[,1], cluster.df[,2], function(x) shapiro.test(sample(x, 500)))

And the test returns sigificant results for all three:

$Cluster_1

    Shapiro-Wilk normality test

data:  sample(x, 500)
W = 0.59561, p-value < 2.2e-16


$Cluster_2

    Shapiro-Wilk normality test

data:  sample(x, 500)
W = 0.57891, p-value < 2.2e-16


$Cluster_3

    Shapiro-Wilk normality test

data:  sample(x, 500)
W = 0.67686, p-value < 2.2e-16

Best
Chris
--
Albert-Ludwigs-Universität Freiburg
Projekt-Leiter DFG-Projekt "Analyse multimodaler Interaktion im Geschichtenerzählen"
ἰχθύς

Stefan Th. Gries

unread,
Jan 24, 2020, 10:43:32 AM1/24/20
to CorpLing with R
If you really need a significance test for normality, why not just use
the ks.test, much simpler:

qwe <- runif(5001)
ks.test(qwe, "pnorm", mean=mean(qwe), sd=sd(qwe)) # SFLWR2: p. 164

Christoph Ruehlemann

unread,
Jan 24, 2020, 11:44:23 AM1/24/20
to corplin...@googlegroups.com
Are  Normal Quantile plots also an acceptable (visual) method to assess the degree of (non-)normality?

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CAFrBz2%3DLc4PiqixKgiPLJzx8XXBApkgsEHYkYVrL%2BP6YooP_bA%40mail.gmail.com.

Christoph Ruehlemann

unread,
Jan 24, 2020, 11:47:04 AM1/24/20
to corplin...@googlegroups.com
And what to make of the Warning messages the ks.test may give:

Warning messages:
1: In ks.test(x, y = pnorm) :
  ties should not be present for the Kolmogorov-Smirnov test

On Fri, Jan 24, 2020 at 4:43 PM Stefan Th. Gries <stg...@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CAFrBz2%3DLc4PiqixKgiPLJzx8XXBApkgsEHYkYVrL%2BP6YooP_bA%40mail.gmail.com.

Alex Perrone

unread,
Jan 24, 2020, 1:27:06 PM1/24/20
to corplin...@googlegroups.com
If you're testing for normality, it should be for a continuous random variable, so ties (duplicate values in your sample) shouldn't be present. Do you have an idea why they are? Are your values discrete, perhaps? In theory, ties should occur with probability zero for a continuous r.v., though since not strictly impossible especially with finite precision, a warning, as opposed to an error, makes sense. 

Maybe check out these resources? 



Doesn't hurt to do a QQ plot! With the qqplot function... 


Stefan Th. Gries

unread,
Jan 24, 2020, 1:29:52 PM1/24/20
to CorpLing with R
> Are Normal Quantile plots also an acceptable (visual) method to assess the degree of (non-)normality?
Well, I see people using them although I do not always find it easy to
decide when the deviations from the dotted line becomes 'problematic'.

> And what to make of the Warning messages the ks.test may give:
You could jitter the points and do so multiple times to see what
happens when the ties are dealt with.

Martin Schweinberger

unread,
Jan 25, 2020, 5:19:14 AM1/25/20
to corplin...@googlegroups.com
Hi all,

Just adding my 2 cents: We know that both Kolmogorov-Smirnov and Shapiro-Wilk are unreliable when it comes to testing for normality of small and large samples. You can try that out yourselves by simulating normally distributed data and using both tests to check for normality (here is a video doing that in SPSS for a sample size of 1000 - sorry that that vid is in German). Also, both tests are unreliable when it comes to small samples because they are too lenient. Thus, both tests are too harsh for large but too lenient when it comes to small samples - I have not found the reference - I guess somewhere on StackOverflow - but I remember that I read somewhere that both test should not be used for samples smaller than 50 or larger than 200. Another reason for why I usually use visual inspection (qqplot) rather than statistical tests for normality is that, according to ZuurHilbe & Ieno (2015: A Beginner's Guide to GLM and GLMM with R. A frequentist and Bayesian perspective for ecologists. Highland Statistics Ltd: Newburgh) ), Cook and Weisberg (1982. Residuals and Inference in Regression. Chapman & Hall: London) have shown that statistical tests do not outperform visual inspection for normality assessment. But, as I said, you can check that yourself using simulation - maybe a paper in waiting for someone here ;).

Cheers,
Martin

p.s.: As far as I know, the relevent issue is typically normality of residuals and not normality of some dep. var.  Just sayin'
=====================================
Dr. Martin Schweinberger
5/221 Sir Fred Schonell Drive
St Lucia, QLD, 4067

Fon.: +61 (0)404 228 226
Home: http://www.martinschweinberger.de/



--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages