Sorry for the delay. I thought I had documented the statistics
measures in the Readme file. I will add them later.
For your reference, the Chi-Squared measure I use is the standard "Chi
Square Goodness of Fit" measure.
I couldn't find an easy to use online calculator for you to test, but
perhaps the site below can help:
http://math.hws.edu/javamath/ryan/ChiSquare.html
The one value that is not explained well is the expected value.
However, an explanation of this can be found here:
http://ucrel.lancs.ac.uk/llwizard.html
Note that this site can also be used to replicate the AntConc log
likelihood results for keyness.
For your reference, here is the Perl code I use (with some demo values):
use strict;
use warnings;
my $wordlist_word_freq = 23;
my $wordlist_total_words = 263;
my $ref_corpus_word_freq = 551;
my $ref_corpus_total_words = 8870;
my $chi_score;
my $total_word_freq = $wordlist_word_freq + $ref_corpus_word_freq;
my $wordlist_and_ref_corpus_total_words = $wordlist_total_words +
$ref_corpus_total_words;
#calc expected values of each word in target files
my $wordlist_word_expected_value =
( $wordlist_total_words * $total_word_freq ) /
$wordlist_and_ref_corpus_total_words;
print "target_expected_value:$wordlist_word_expected_value\n";
my $ref_corpus_word_expected_value =
( $ref_corpus_total_words * $total_word_freq ) /
$wordlist_and_ref_corpus_total_words;
print "reference_expected_value:$ref_corpus_word_expected_value\n";
#calc chi-squared
$chi_score =
( ( ( $wordlist_word_freq - $wordlist_word_expected_value )**2 ) /
$wordlist_word_expected_value ) +
( ( ( $ref_corpus_word_freq - $ref_corpus_word_expected_value )**2 ) /
$ref_corpus_word_expected_value );
print "chi-squared:$chi_score\n";
I hope that helps!
Laurence.
> For what it's worth, I think a chi-square test is not the best test to
> be used for key-words in general: given its great sensitivity to low
> expected frequencies, it's gonna inflate the values for low-frequency
> items. Why not use Damerau's relative frequency ratio (discussed in
> Manning and Schütze and in <http://www.linguistics.ucsb.edu/faculty/
> stgries/research/2010_STG_UsefulStats4CorpLing_MosaicCorpLing.pdf>),
> which also avoids the perennial critique of significance tests
> (because it isn't one).
>
The relative frequency ratio would be very easy to implement and, as
you say, would avoid the need for a significance test. How would you
recommend we deal with zero frequency items in the reference corpus?
In your paper, you mention the Good-Turing estimation, but Damerau
(1993) simply follows Church and Hanks (1989) and avoids calculating
the relative frequency for any item pairs with a freq < 5.
Laurence.