Tim, hi.
Correct about flukes. That is why the usual default which is set in
the software is conservatively put at 1 in a million. If you have a
million (or half a million, etc.) types to compare you will certainly
expect some false hits, as you say. In usual practice one might have
only say 1,000 or 2,000 types being compared against a reference set.
You might have 750 types in a text of 1,600 words and 1,500 types in a
text of just under 5,000 words.
Another reason for caution is the assumption that words occur at
random: of course they do not, otherwise why would we do linguistics?
Yet statistical procedures have always been based on assumptions such
as that the phenomena investigated are normally distributed.
Third, the KW procedure does not claim the whole set of KWs (13 in
eldivenci's study) is statistically believable, only that each one
is.
Fourth, think about the fact that words do cluster non-randomly in
text.
In 1997 I wrote this ("PC Analysis of Key Words and Key Key Words",
System, Vov 25 No 2, p. 243):
...The misgivings have to do with the skewed
nature of types in a corpus and the very high incidence of singleton
items. If there were l0 occurrences of
"beetroot" in a 1000 word text on gardening, and also 10 occurrences
of "beetroot" in a 1 000000 word corpus
of general texts, then that item would be 1000 times more frequent
than expected on a chance basis and chisquare
would be a reasonable way of saying that the difference is believable.
If the occurrences were 1/1000 and
I/I 000 000, respectively, the same logic applies but the confidence
in results differs, because 1/1 000 000 suggests
the item is very rare and very rare items will not be spread around
all possible corpora very very thinly (at a
uniform rate of one per million words), but will crop up occasionally
in relation to some sort of topicality or
stylistic factor.
> correlated? If you do a hundred tests then by fluke alone you should
> expect 5 results that are statistically significant to 95%. If you do
> thousands upon thousands of tests then the flukes will be all over the
> place. All of them statistically significant. All of them potential
> new superstitions.
Finally, please can you explain about the *sample*, Tim? Your story
about the Israeli pilots does imply one must be cautious but how are
you to know which words to take for the sample? And how will you avoid
your own biases in choosing? If eldivenci wants to study his two sets
of 250-word reports, are you concluding that he should sample some of
the words or some of the reports? How will he decide which? If he goes
for educationally-loaded words, ones his theories suggested are likely
to relate to learning, he'd be doing some sort of corpus-based study
but he'd find it hard to be sure his choices weren't biased. If he
chose random words, will he find anything out that interests him? My
suggestion is for him to use all the words in all the reports he has,
and let the software suggest some pointers, as I said earlier.
Cheers -- Mike