Howdy Folks,
I'm curious about criteria for setting minimum frequencies in lexical analysis. Beyond rule of thumb or or intuition, are is there empirical work that would help guide decisions about setting min. frequencies? Thanks!
--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at http://groups.google.com/group/antconc.
For more options, visit https://groups.google.com/d/optout.
--
1. For N-grams/cluster/lexical bundles: Biber et. al (2004) adopt a "conservative approach" with 40 times per million as a frequency threshold.
2. Log Likelihood: 6.635 as "interesting" and 5 minimum occurrences as meaningful pattern (Hardy, 2007, p. 98)
3. In Keyword Analysis, reference corpora should be 5x greater in size than target corpus. (Berber-Sadinha, 2000).
4. For collocate testing, one study supported a frequency threshold of 20 and an LL score threshold of 10.83 (smallest corpus in the study was 18 million) (Diwersy, 2014).
5. For small, specialized corpora (e.g. under 250k), size is less important than the design criteria (e.g. purpose, source context, genre & register), and so relatively small corpora (e.g. 25k-50k) can still produce valid results. Situational representativeness is the most critical factor, which requires judgment on the part of the researcher (Koester, 2010). For the most common words, register and genre are stable across samples as small as 1,000 word, from 5-10 sources (Biber, 1990).
6. MI and LL are work well given data sparessness, and are accurate for smaller corpus (e.g. 77k words), but compared to other measures (e.g. MI3, Log-dice, minimum sensitivity) they perform poorly on large corpora (e.g. 50 million) (Alrabiah, Maha, et al. 2014).
Alrabiah, Maha, et al. "An empirical study on the Holy Quran based on a large classical Arabic corpus." International Journal of Computational Linguistics (IJCL) 5.1 (2014): 1-13.
Berber-Sardinha, Tony. "Comparing corpora with WordSmith Tools: How large must the reference corpus be?." In Proceedings of the workshop on Comparing corpora-Volume 9, pp. 7-13. Association for Computational Linguistics, 2000.
Biber, Douglas, Susan Conrad and Viviana Cortes. If you look at . . .: Lexical Bundles in University Teaching and Textbooks. (2004). Applied Linguistics 25/3: 371-405.
Hardy, Donald E. The Body in Flannery O'Connor's Fiction: Computational Technique and Linguistic Voice. Donald E. Hardy. Univ of South Carolina Press, 2007
Koester, Elmut. "Building a Small Specialized Corpus," in O'Keeffe, Anne, and Michael McCarthy, eds. The Routledge handbook of corpus linguistics. Routledge, 2010.
Diwersy, Sascha . 2014. The Varitext platform and the Corpus des variétés nationales du français (CoVaNa-FR) as resources for the study of French from a pluricentric perspective. Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pages 48–57, Dublin, Ireland, August 23 2014.