Hi all,
My question may seem to be naive, but I really need your help.
I have now two corpora. The sizes of the corpora are
1,010,910 words and
1,017,190 words.
I have computed the totals of lemmas in the two corpora with the help of TreeTagger. The totals of lemmas are
51480 and
44640 respectively.
Now, I want to compare the totals to see whether there is any difference of the totals of the lemmas in the corpora.
Is it acceptable that I use the chi-square test to compare the two totals?
If so, what I think of to deal with it goes as follows, which does not take the sizes of the corpora into consideration and may be not correct:
> chisq.test(c( 51480, 44640), p=c(0.5, 0.5), correct=F)
Chi-squared test for given probabilities
data: c(51480, 44640)
X-squared = 486.7416, df = 1, p-value < 2.2e-16
Thank you very much.
Leo