Collocate Likelihood and Effect

698 views
Skip to first unread message

Kierin Mackenzie

unread,
May 11, 2022, 7:24:15 PM5/11/22
to AntConc-Discussion
Hi everyone,

I'm sure this is written somewhere that I'm missing, but I can't seem to find it. How are Likelihood and Effect calculated for the Collocate tool?

Thanks,

Kierin

Laurence Anthony

unread,
May 12, 2022, 3:26:44 AM5/12/22
to ant...@googlegroups.com
Hi Keirin,

The statistical measures used can be found listed in the Tool Settings menu. For the precise details of how each statistic is calculated, you can refer to the standard literature. Unfortunately, I don't know of a single (published) paper that lists them all. I referred to an unpublished paper by Andrew Hardie of Lancaster University, when building AntConc 4. Perhaps I can ask him for permission to print the equations used in the paper. But, as I say, the statistics are all fairly standard in the field.

Regards,

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/4818a4a4-18b0-4022-a355-1c3ef08c0ab5n%40googlegroups.com.

Kierin Mackenzie

unread,
May 12, 2022, 3:43:39 AM5/12/22
to AntConc-Discussion
Hi Laurence,

Thank you for your reply! I was using MI it looks like. Now I'm trying out the rest.

I found some of the equations here in Table 3.3

https://www.cambridge.org/core/books/statistics-in-corpus-linguistics/semantics-and-discourse/3CC9D42A719A484A565BC139E9353A2C

Best,

Kierin

Laurence Anthony

unread,
May 12, 2022, 4:08:14 AM5/12/22
to ant...@googlegroups.com
Hi Keirin,

Yes, the default effect size measure for collocations is MI (with a Log-Likelihood threshold cut off).

I have also contacted Andrew Hardie and asked if I could publish the list of equations on my site from his paper (or if he could put them on the Lancaster site).

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Hanna Schmueck

unread,
Jun 13, 2022, 1:16:30 PM6/13/22
to AntConc-Discussion
Hi both, 
I had a related question that I couldn't find an answer to in the documentation - could you let us know what the contingency tables used to calculate the scores are based on in AntConc 4? I'm particularly interested in what O12 and O21 are based on and what count is used for N (is it the total number of words in the corpus like in WordSmith?).  I've tried manually calculating some MI scores for a very small toy corpus to compare them to the AntConc scores and must have made a mistake somewhere because they don't quite match up. 
Thanks!
Hanna

Laurence Anthony

unread,
Jun 13, 2022, 8:39:16 PM6/13/22
to ant...@googlegroups.com
Hi Hanna,

Which statistic are you referring to? 

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Hanna Schmueck

unread,
Jun 14, 2022, 7:56:38 AM6/14/22
to AntConc-Discussion
So sorry, let me be a bit clearer:

I was referring to the values in a contingency table that are usually used to produce association measures, I believe the standard notation is as follows:

So what I am asking is -  for instance when working with the standard notation for MU (MU = O11/E11) - what values AntConc will use to calculate E11. Are they derived from the frequencies of the constituent words in the entire corpus - the frequency of the collocation? (This is what WordSmith would use as the basis for their calculations, but there is also software out there that uses other counts). 
Sorry for the lengthy question, I hope I managed to clarify what I meant now.

Thanks so much for your help!
Hanna

Laurence Anthony

unread,
Jun 14, 2022, 8:04:43 AM6/14/22
to ant...@googlegroups.com
Hi,

I know the O11 and O21 notation. So, do you just want to know about MU? AntConc has many, many statistics. Some use O11 and O21 and some don't. 

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Laurence Anthony

unread,
Jun 14, 2022, 8:08:16 AM6/14/22
to ant...@googlegroups.com
Just a quick follow on... Are you only asking about the MU measure as used in the Collocate tool? Some statistics are used in multiple tools, where the O11 and O21 values will be determined differently.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Hanna Schmueck

unread,
Jun 14, 2022, 8:52:28 AM6/14/22
to AntConc-Discussion
Thanks so much, this is already nice to know. 
My question is more general, I am interested in what O11 and O21, N etc. values are used in which calculations.
Thanks again
Hanna

Laurence Anthony

unread,
Jun 14, 2022, 9:42:06 AM6/14/22
to ant...@googlegroups.com
Generally speaking, the values are as you expect. So, O11 is the observed frequency in set A (e.g. the frequency of a word in the target corpus, or the frequency of a target word appearing in a bigram with word X) and O21 is the observed frequency in set B (e.g. the frequency of a word in the reference corpus, or the frequency of a target word not appearing with word X).

As I wrote earlier, the 'observation' that is counted in O11 or O21 is different depending on the tool being used (e.g. observations of collocates in a window span of 5 or words in a target vs a reference corpus).

I hope that helps!

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Hanna Schmueck

unread,
Jun 14, 2022, 11:51:07 AM6/14/22
to AntConc-Discussion
Thank you so much for your speedy reply. That makes total sense - sorry I wasn't overly precise with my question before, I had only thought about collocations and not keywords. 
One more follow up question (collocations only): You said O21 is the "frequency of a target word not appearing with word X", would this be calculated using the total number of generated windows that contain the target word - but not word X -  in any position, or just the total number of generated windows that contain the target word where the target word is not preceded by word X? 
Thanks so much for your help and patience.
Hanna

Laurence Anthony

unread,
Jun 14, 2022, 12:01:59 PM6/14/22
to ant...@googlegroups.com
For collocations, the window span complicates things. The easiest way to understand O11 and O21 in this context is to think of all the words within the span as a kind of target corpus and all the words in the corpus as a whole as a reference corpus. Then, O21 and O22 are calculated exactly as they would be for keywords.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Reply all
Reply to author
Forward
0 new messages