chi-squared for keyness

276 views
Skip to first unread message

Warren Tang

unread,
May 22, 2011, 10:26:17 PM5/22/11
to AntConc-discussion
Hi Lawrence,
A question about chi-squared: What is the "settings" (parameters) used
for the chi-squared test in Antconc?

Doing chi-squared manually using R I found different results to the
numbers from Antconc. And looking at the "readme" I was unable to find
the parameters for it.

I tried chi-squared with both correction and without correction in R
but the numbers are slightly different to the Antconc output.

Thanking you in advance.


Warren

Stefan Th. Gries

unread,
May 23, 2011, 8:11:58 AM5/23/11
to AntConc-discussion
For what it's worth, I think a chi-square test is not the best test to
be used for key-words in general: given its great sensitivity to low
expected frequencies, it's gonna inflate the values for low-frequency
items. Why not use Damerau's relative frequency ratio (discussed in
Manning and Schütze and in <http://www.linguistics.ucsb.edu/faculty/
stgries/research/2010_STG_UsefulStats4CorpLing_MosaicCorpLing.pdf>),
which also avoids the perennial critique of significance tests
(because it isn't one).

STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

Laurence Anthony

unread,
May 23, 2011, 2:38:07 PM5/23/11
to ant...@googlegroups.com
Hi Warren,

Sorry for the delay. I thought I had documented the statistics
measures in the Readme file. I will add them later.

For your reference, the Chi-Squared measure I use is the standard "Chi
Square Goodness of Fit" measure.

I couldn't find an easy to use online calculator for you to test, but
perhaps the site below can help:
http://math.hws.edu/javamath/ryan/ChiSquare.html

The one value that is not explained well is the expected value.
However, an explanation of this can be found here:
http://ucrel.lancs.ac.uk/llwizard.html

Note that this site can also be used to replicate the AntConc log
likelihood results for keyness.

For your reference, here is the Perl code I use (with some demo values):

use strict;
use warnings;

my $wordlist_word_freq = 23;
my $wordlist_total_words = 263;

my $ref_corpus_word_freq = 551;
my $ref_corpus_total_words = 8870;

my $chi_score;

my $total_word_freq = $wordlist_word_freq + $ref_corpus_word_freq;
my $wordlist_and_ref_corpus_total_words = $wordlist_total_words +
$ref_corpus_total_words;

#calc expected values of each word in target files
my $wordlist_word_expected_value =
( $wordlist_total_words * $total_word_freq ) /
$wordlist_and_ref_corpus_total_words;
print "target_expected_value:$wordlist_word_expected_value\n";

my $ref_corpus_word_expected_value =
( $ref_corpus_total_words * $total_word_freq ) /
$wordlist_and_ref_corpus_total_words;
print "reference_expected_value:$ref_corpus_word_expected_value\n";

#calc chi-squared
$chi_score =
( ( ( $wordlist_word_freq - $wordlist_word_expected_value )**2 ) /
$wordlist_word_expected_value ) +
( ( ( $ref_corpus_word_freq - $ref_corpus_word_expected_value )**2 ) /
$ref_corpus_word_expected_value );

print "chi-squared:$chi_score\n";

I hope that helps!

Laurence.

Laurence Anthony

unread,
May 24, 2011, 7:39:21 AM5/24/11
to ant...@googlegroups.com
Hi Stefan,

> For what it's worth, I think a chi-square test is not the best test to
> be used for key-words in general: given its great sensitivity to low
> expected frequencies, it's gonna inflate the values for low-frequency
> items. Why not use Damerau's relative frequency ratio (discussed in
> Manning and Schütze and in <http://www.linguistics.ucsb.edu/faculty/
> stgries/research/2010_STG_UsefulStats4CorpLing_MosaicCorpLing.pdf>),
> which also avoids the perennial critique of significance tests
> (because it isn't one).
>

The relative frequency ratio would be very easy to implement and, as
you say, would avoid the need for a significance test. How would you
recommend we deal with zero frequency items in the reference corpus?
In your paper, you mention the Good-Turing estimation, but Damerau
(1993) simply follows Church and Hanks (1989) and avoids calculating
the relative frequency for any item pairs with a freq < 5.

Laurence.

Reply all
Reply to author
Forward
0 new messages