Chi-square or others?

79 views
Skip to first unread message

Leo

unread,
Apr 23, 2013, 6:13:29 PM4/23/13
to statforli...@googlegroups.com
Hi all,

My question may seem to be naive, but I really need your help.

I have now two corpora. The sizes of the corpora are 1,010,910 words and 1,017,190 words.

I have computed the totals of lemmas in the two corpora with the help of TreeTagger. The totals of lemmas are 51480 and 44640 respectively.

Now, I want to compare the totals to see whether there is any difference of the totals of the lemmas in the corpora.

Is it acceptable that I use the chi-square test to compare the two totals?

If so, what I think of to deal with it goes as follows, which does not take the sizes of the corpora into consideration and may be not correct:

> chisq.test(c( 51480, 44640), p=c(0.5, 0.5), correct=F)

    Chi-squared test for given probabilities

data:  c(51480, 44640)
X-squared = 486.7416, df = 1, p-value < 2.2e-16

Thank you very much.

Leo










Stefan Th. Gries

unread,
Apr 23, 2013, 7:38:14 PM4/23/13
to statforli...@googlegroups.com
Why not this?

chisq.test(c( 51480, 44640), p=c(1010910, 1017190)/sum(c(1010910,
1017190)), correct=F)

Of course it's significant, given the sample sizes - the effect size
seems to be small, though. Plus, comparing only the type frequencies
may be using very little of the information of all the token
frequencies, but all that depends on your theoretical goals.

STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

Leo Lei Lei

unread,
Apr 24, 2013, 1:15:51 AM4/24/13
to statforli...@googlegroups.com

Professor Gries,
Thank you very much for your help.
Yes, I am planning to compare both the frequencies of the tokens and the types, and one of the aims of the study is to see whether there is any difference in the results of the two chi-square measures.
Thanks again.
All the Best,
Leo






--
You received this message because you are subscribed to the Google Groups "StatForLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



Leo

unread,
May 6, 2013, 12:38:02 AM5/6/13
to statforli...@googlegroups.com

Dear Professor Gries,
To compute the effect size as you suggested, is the following approach correct? If not, how can I compute the effect size?
Thank you very much!
Leo



library(vcd)
assocstats(matrix(c(51480, 44640,1010910,1017190 ), ncol=2, byrow=T))

# the following is the result

> library(vcd)
> assocstats(matrix(c(51480, 44640,1010910,1017190 ), ncol=2, byrow=T))
                    X^2 df P(> X^2)
Likelihood Ratio 506.45  1        0
Pearson          506.04  1        0

Phi-Coefficient   : 0.015         # the effect size is very small
Contingency Coeff.: 0.015
Cramer's V        : 0.015

Stefan Th. Gries

unread,
May 6, 2013, 12:38:49 AM5/6/13
to statforli...@googlegroups.com
that works, yes
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------


Leo

unread,
May 6, 2013, 12:40:52 AM5/6/13
to statforli...@googlegroups.com

Thank you for your immediate reply!
> email to statforling-with-r+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages