Criteria for Minimum Frequency Threshold

1,489 views
Skip to first unread message

Bill Marcellino

unread,
Jan 28, 2015, 3:55:54 PM1/28/15
to ant...@googlegroups.com
Howdy Folks,

  I'm curious about criteria for setting minimum frequencies in lexical analysis.  Beyond rule of thumb or or intuition, are is there empirical work that would help guide decisions about setting min. frequencies?  Thanks!

JFlorian

unread,
Jan 28, 2015, 4:57:24 PM1/28/15
to ant...@googlegroups.com
Bill,


If you Google lexical+Criteria for Minimum Frequency Threshold, there are other texts too.

Judy

On Wed, Jan 28, 2015 at 3:55 PM, Bill Marcellino <wmmarc...@gmail.com> wrote:
Howdy Folks,

  I'm curious about criteria for setting minimum frequencies in lexical analysis.  Beyond rule of thumb or or intuition, are is there empirical work that would help guide decisions about setting min. frequencies?  Thanks!

--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at http://groups.google.com/group/antconc.
For more options, visit https://groups.google.com/d/optout.

Bill Marcellino

unread,
Jan 28, 2015, 6:49:03 PM1/28/15
to ant...@googlegroups.com
Hi Judy,

  thanks so much for replying.  I did indeed google just that :)  I didn't find anything though that laid out criteria, i.e. a rationale for a certain threshold--it was either "you can" or "we did" mentions.  I went back and took another look at the article you linked, which doesn't explicitly make a rationale claim--in section 5 they describe three different minimums: none, 1/100,000 and 10/100,00.  How do you interpret that section, or what practical guidance would you pull from it?  The middle criteria of 10/100,000 produced the highest matches between LL and bootstrap tests--is that good?

Any insight would be welcome!

  -Bill

JFlorian

unread,
Jan 28, 2015, 9:45:42 PM1/28/15
to ant...@googlegroups.com
Hi Bill,

Wait for the Professor to answer because I'm just a learner.

However, from what I'm reading, I will hazard two guesses:  individually set from researcher's goals and from using tools which I don't understand (e.g. chi-squared or log
likelihood statistical measures) .  Now, we'll see if I'm right when Prof Anthony replies to you.

Here's one from Dr Anthony:
Here is one item that talks about bundles:

This one is a download-- see if you can Google for   etd-0819110-005045.pdf   which is:

CORPUS-BASED LEXICAL ANALYSIS OF

by CT Sun - ‎2010 - ‎Cited by 1 - ‎Related articles

He seems to say it uses inconsistent criteria based on researcher design and goals.

See what Lawrence Anthony tells you...
Judy

Laurence Anthony

unread,
Jan 28, 2015, 9:58:48 PM1/28/15
to ant...@googlegroups.com
I was enjoying watching the discussion develop, but now that Judy has mentioned me by name, I'll add a few comments.

First, Bill, what are you referring to when you say "lexical analysis"? Are you referring to KWIC analysis, collocates, keywords, or some other analysis. Each of them would probably produce a different answer to your question.

Also, have you read any corpus-based statistics books or papers? This is a common theme and you will find that the answers quickly get complicated.

The question is actually probably beyond the scope of the AntConc discussion group as it's a general corpus linguistics question. But, 'll see if I can answer it anyway.

Laurence.

--

JFlorian

unread,
Jan 28, 2015, 10:07:12 PM1/28/15
to ant...@googlegroups.com
Lawrence Anthony wrote:  I was enjoying watching the discussion develop,....

----
I'm smiling.... yes, I can imagine it's sometimes hard being the #1 go-to person for answers.

I enjoy learning from the list and seeing how much I recall from last year's class, but I didn't want to lead Bill to the wrong answers.    Will look forward to how this develops.

Judy

Bill Marcellino

unread,
Jan 29, 2015, 2:51:18 PM1/29/15
to ant...@googlegroups.com
1. Judy, thanks again for taking the time to respond--I very much appreciate it.

2. Lawrence, thanks also for responding.  When I say lexical analysis I mean at the textual level of lexis, as distinct for example from lexicogrammatical analysis at the level of style.  So for example, genres are distinct at the lexical level if we look at KW testing, but also distinct with a completely different set of features at the lexicogrammatical level if we analyze style, and yet again distinct in features at the human reading level of textual themes (http://litlab.stanford.edu/LiteraryLabPamphlet1.pdf).  From that perspective, any analysis of lexis--e.g. KW and collocate measures--I think of as lexical analysis.  I may be coming from a slightly different disciplinary background, so perhaps I'm using that term inelegantly here.

3. The primary textbooks in corpus linguistics I've learned from are Biber, Conrad and Reppen's Corpus Linguistics, and McEnery, Xiao & Tono, Corpus-Based Language Studies.  I'm more familiar with lexicogrammatical analysis (e.g. http://dis.sagepub.com/content/16/3/385.refs), and am more familiar with difference-testing criteria--Bootstrap KS, ANOVA, and EFA testing. 

4. You're point that different kinds of tests would have different thresholds is well taken.  The primary two tests I want to do are for KW and collocates, to try and make "aboutness" claims.  I appreciate that this is indeed more of a CL question than an AntConc question--if you do have some thoughts I'd be grateful.  Thanks,

  -Bill

JFlorian

unread,
Jan 30, 2015, 8:41:04 AM1/30/15
to ant...@googlegroups.com, wmmarc...@gmail.com
Bill,

Like I said, I'm a learner and I've not yet done a study, so I might not give the right info.

I think I remember ... KW frequencies or "keyness" uses the chi-squared or log likelihood statistical measures.  But, that's where I got lost/didn't understand what the instructors were talking about.

 I looked back through the notes from a FutureLearn class (www.futurelearn.com) I took that included Antconc.  You could see if they are doing the course again--it's free.   The notes below might be too basic for you, but perhaps you'll see something that helps you.

Here is one article..
Title:  Searching and Concordancing.  Martin Wynne, University of Oxford  Pre-publication draft. To appear in Handbook of Corpus Linguistics, edited by Merja Kytö & Anke Lüdeling, Mouton de Gruyter, 2007

Notes and class comments I saved about Collocates from Week 2's class mix instructors/students.   instructors' 'voices' come across quite differently; "Tony" mentioned in one of the comments was one instructor from the UK.  I removed last names of students.  

Note the instructor used min.freq of 10.  But if I recall correctly, researchers determine the methods / criteria they will use (sentence boundaries; spans; freqs, and methods) based on the data they have; what they hope to prove (while being open to finding lack of proof), and a solid understanding of the "methods" and "math" so they can see when either one might accidentally misconstrue the findings.

Collocates- "the company they keep"
=================
for collocates, collocation
Span set at +-5 words
Min. Freq 10
Option: sentence boundaries or not

example
keeps company
company keeps
==
function, grammatical words - high on list
---
manipulate frequencies

1. mutual information - how closely words are associated, how powerfully assoc.

how often it is that these words occur in the context of one another relative to how frequently they occur without one another, we can start to generate a statistic that shows us how closely these words might be associated, and also which words are more powerfully associated with the word that we're interested in.

The higher the mutual information score, the more affinity these words have for one
another.  Mutual information, for example, is known to not be very accurate if you're dealing with data which is low in terms of frequency.   It can overstate the importance of an association between two words, for example.

2 dice coefficient - if we use something called the Dice coefficient, again, we can rank these words in terms of their strength of association, but that has different pluses and minuses than mutual information. So we can use different scores.

we can rank these words in terms of their strength of association, but that 
has different pluses and minuses than mutual information. So we can use different scores.

the Dice coefficient has been reported not to overestimate collocational strength of low-frequency items the way MI does.

Comments:
Very revealing as to which measure produces the best results. 

**** But what is considered the 'best' measure is up to the analyst and depends on their purpose  for creating the collocate list.: http://nlp.fi.muni.cz/raslan/2008/papers/13.pdf


PART 2
Part 2: collocation, colligation and related features  (Comments)

Akasha (removed)
As Keith said, both semantic prosody and discourse prosody describes the way in which certain seemingly neutral words can be perceived with positive or negative associations through frequent occurrences with particular collocations.

Sean (removed)
Great lecture. Tony talks about 'discourse prosody' and fellow participants and mentors below refer to 'semantic prosody'. I think the two terms imply slightly different concepts which are worth differentiating. I think the former implies an evaluative relation (as in 'cause + problems'), whereas the latter implies a simple meaning relation (as in 'glass of + drinkable liquid'). 
I'd be interested to hear what others think.
 14h 
Keith (removed)
Hello Sean. here is my take on it. Semantic prosody and discourse prosody are the same thing. But some people prefer the term discourse prosody because it refers to issues of pragmatics, in how a text is 'loaded' with more than what the pure linguistic material shows. In this way discourse prosody is like discourse analysis, looking at how the text actually 'comes across' to the reader/listener. This explains why the word 'discourse' is preferred by some rather than 'semantic', for what is essentially a pragmatic concern.

Discourse (semantic) prosody can be considered a part of the wider semantic preference of the keyword. This means that the list of collocates and colligations often reveal a semantic preference (i.e. 'bad things' in the case of 'cause') and then looking at this semantic preference(s) can reveal deeper discourse prosody (evaluation) in the collocates (e.g. A negative prosody relating to the bad things revealed in the semantic preference of the keyword 'cause').

My research is moving down this fascinating route of semantic preference/prosody so I am just getting to grips with the terms myself. What do you think of my take on it?
 13h 
Beth McCarthy (Mentor)
Beth McCarthy (Mentor)
Hi Sean. I think you're conflating (what is generally thought of as) 
'semantic prosody' with (what is generally thought of as) 'semantic preference'. 
Different researchers use the terms in different ways, but generally speaking, 
'discourse prosody', you're right, is when patterns in discourse 
can be found between a word, phrase or lemma, and a set of 
related words which suggest a discourse. 

The way that words in a corpus can collocate with a related set of words or phrases, often revealing (hidden) attitudes.


'Semantic preference' is between a lemma/ word-form and a set of semantically-related words. 'Semantic prosody', on the other hand, is a term which has been used by researchers in ways which makes it seem synonymous with discourse prosody, for instance Cotterill's famous analysis of the language used in the O.J. Simpson trial (2001).

 13h 
Beth McCarthy (Mentor). Thanks for the contribution of the reason behind the distinction of 'semantic' and 'discourse'. I hadn't come across this before and found it very enlightening.
 13h 
Václav (removed)
I personally perceive the difference between discourse and semantics. Semantic preference may be associated with kind of logical relationships between words. Discourse is higher level connected with culture. We might (semantically) connect causes with pleasant happenings as strongly as we do with the unpleasant ones. But there is a cultural (or psychological) tendency to ascribe negative effects to outter causes and positive effects to inner causes, which may be demonstrated on the level of discourse, which is very interesting and there may be several explanations why it is so. But maybe I see it wrongly.
 19h 
Uschi Maden-Weinberger (Mentor)
I think that's a good point here - it's not that our intuition is "usually" wrong, and if the question is asked in that way, i.e. "do you think cause collocates more frequently with positive or negative contexts" - a native speaker's intuition would probably be right. However, two point here: only corpus linguistics can provide EMPIRICAL evidence to reveal semantic prosody patterns and, secondly, this type of question about negativ/neutre/positive semantic prosody only actually started being asked because corpus linguists noticed these patterns when dealing with collocations.

pros·o·dy
'präs?de/
noun
1.
the patterns of rhythm and sound used in poetry.

==
Where there's this strong affinity between a word and a particular grammatical class, for example, we often call that colligation.   And distinguish that apart from collocation, which is more associated directly with meaning, rather than grammar.

Sometimes the types of words that are being associated with you might say characterise a speaker's attitude, in some sense.  Discourse prosodies express speaker attitude. And you can see how this would be important for discourse analysis, for example, as we'll see later in the course.


THE WORD "cause" has this negative discourse prosody, such as cause an accident.

Stronger modal verbs, "shall", "must", "should". Why are they declining?

We are less comfortable with imposing strong obligations on people. Hence, these words are declining in frequency, more key in the past, less key in the present.  And also, longer forms are contracting. Could this be further example of what we've seen in language and certainly the English language identification, this desire to squeeze as much as you can into as short of a space as possible?

Wh- words, questions-- who, what, whether, where, these words which typically introduce questions into the forms. Body parts, also interestingly, they also get mentioned just as much now as they did in the past. And some other nouns.  "Money" is an example I've given already, but you've also got "life", "world", and "government".

Writing numbers in numeric form rather than spelling them out fully. Again, it links through to this densification hypothesis. But, interestingly, social terms are actually mentioned more frequently, or 2 they're more key, if you like, in the present than they were in the past. Words like "family", "children", "child", "people", "social", "health", and "help".

stable keywords and emerging keywords
moral panics around a word?
social reasons?

 Sorry - should have been clearer. The word 'half' is in decline while putting the fraction or 0.5 is rising. I should probably change the slide next time around, for consistency's sake.

 'decoding argumentation strategies'. Does that refer to understanding rhetoric, and how speakers or writers try to use langauge to their own purposes?
 03 FEB  2
Uschi Maden-Weinberger (Mentor)
yes, in a nutshell :-)
----------------
you looked explicitly at all layers of the research: RQ, data collection and data, method, findings, aplicability & impact...very useful reminder for all interested to do research
===end FutureLearn Class notes=========


I hope something in this gives you a tidbit of direction, Bill, so you can go find what you need.  

Judy

Laurence Anthony

unread,
Jan 31, 2015, 9:12:25 AM1/31/15
to ant...@googlegroups.com
HI Bill,

I'll reply in a very much more brief way than Judy!

For Keywords, the standard is to use Log-Likelihood. For this measure, the statistics measure uses a contingency table of frequencies, where each cell should be greater than 5 for the assumptions of the stat to be valid (although there are cases where a lower value may be OK). In short, this means that the frequencies of the word in the target corpus and reference corpus should be greater than 5. The keyness values for Log-Likelihood at different significance levels are in the AntConc help (repeated below):

95th percentile; 5% level; p < 0.05; critical value = 3.84
99th percentile; 1% level; p < 0.01; critical value = 6.63
99.9th percentile; 0.1% level; p < 0.001; critical value = 10.83
99.99th percentile; 0.01% level; p < 0.0001; critical value = 15.13

For collocation, the standard in AntConc is to use MI (Mutual Information). This is an effect size measure (not a statistical significance measure), so there are no thresholds. But, a 'large' effect is assumed to be 8 times greater than expected which corresponds to an MI score of 3. However, the MI score is sensitive to low frequency items, so it is important to use frequency floors. My colleague Andrew Hardie (here at Lancaster) has suggested frequency floors of 10 for the target word, collocate, and combination, but obviously this is dependent also on the size of the corpus.

I hope that helps.

Laurence.

Bill Marcellino

unread,
Feb 4, 2015, 11:44:20 AM2/4/15
to ant...@googlegroups.com
1. Judy, thanks so much for responding and sharing.  I don't think you need to apologize for being a learner--I'm certainly striving to be a learner ;)

2. Lawrence, thanks for responding also, and for the very helpful answer.  If I could ask, can you point me references I can cite for a) >5 as a threshold for LL calculation of KW, and b) Hardie's guidance for 10 as a frequency floor for collocates?  Thank you.

I've been reading a lot on various thresholds, as well as corpus size considerations, and thought it might be helpful to share some take-aways:

1.     For N-grams/cluster/lexical bundles: Biber et. al (2004) adopt a "conservative approach" with 40 times per million as a frequency threshold.

2.    Log Likelihood: 6.635 as "interesting" and 5 minimum occurrences as meaningful pattern (Hardy, 2007, p. 98)

3.    In Keyword Analysis, reference corpora should be 5x greater in size than target corpus. (Berber-Sadinha, 2000).

4.     For collocate testing, one study supported a frequency threshold of 20 and an LL score threshold of 10.83 (smallest corpus in the study was 18 million) (Diwersy, 2014).

5.     For small, specialized corpora (e.g. under 250k), size is less important than the design criteria (e.g. purpose, source context, genre & register), and so relatively small corpora (e.g. 25k-50k) can still produce valid results.  Situational representativeness is the most critical factor, which requires judgment on the part of the researcher (Koester, 2010).  For the most common words, register and genre are stable across samples as small as 1,000 word, from 5-10 sources (Biber, 1990).

6.     MI and LL are work well given data sparessness, and are accurate for smaller corpus (e.g. 77k words), but compared to other measures (e.g. MI3, Log-dice, minimum sensitivity) they perform poorly on large corpora (e.g. 50 million) (Alrabiah, Maha, et al. 2014).

 

 

 

 

Alrabiah, Maha, et al. "An empirical study on the Holy Quran based on a large classical Arabic corpus." International Journal of Computational Linguistics (IJCL) 5.1 (2014): 1-13.

 

Berber-Sardinha, Tony. "Comparing corpora with WordSmith Tools: How large must the reference corpus be?." In Proceedings of the workshop on Comparing corpora-Volume 9, pp. 7-13. Association for Computational Linguistics, 2000.

 

Biber, Douglas, Susan Conrad and Viviana Cortes.  If you look at . . .: Lexical Bundles in University Teaching and Textbooks. (2004). Applied Linguistics 25/3: 371-405.

 

Hardy, Donald E. The Body in Flannery O'Connor's Fiction: Computational Technique and Linguistic Voice. Donald E. Hardy. Univ of South Carolina Press, 2007

 

Koester, Elmut. "Building a Small Specialized Corpus," in O'Keeffe, Anne, and Michael McCarthy, eds. The Routledge handbook of corpus linguistics. Routledge, 2010.

 

Diwersy, Sascha . 2014. The Varitext platform and the Corpus des variétés nationales du français (CoVaNa-FR) as resources for the study of French from a pluricentric perspective. Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pages 48–57, Dublin, Ireland, August 23 2014.



Reply all
Reply to author
Forward
0 new messages