Dealing with zero frequency items in effect size calculation, keyword analysis

129 views
Skip to first unread message

Joe Geluso

unread,
Apr 1, 2020, 5:10:42 AM4/1/20
to AntConc-Discussion
Hello all,

I have found different methods in the literature for dealing with zero frequencies when calculating most effect sizes (e.g., odds ratio, log ratio) in keyword analysis. For example, one approach is to add 1 to all words in both the target and reference corpus, and another is to replace 0 with a super small number like .00000001 (Gabrielatos, 2018).

I have tried some of these different methods with my own scripts calculating effect sizes, and the results are noticeably different depending on the method. How does AntConc handle zero frequencies in the reference corpus?

Apologies in advance if this question is addressed in the documentation and I missed it.

Best wishes,

Joe

Daniel HENKEL

unread,
Apr 1, 2020, 8:29:41 AM4/1/20
to ant...@googlegroups.com, Joe Geluso

Hello Joe,

I ran into the same problem when I manually calculated some Odds Ratios about a year ago and got somewhat different results.  As nearly as I could reckon, AntConc uses a sort of quasi Haldane-Anscombe correction, which involves adding 0.5 when there are 0 frequency items.  I say "quasi" as true HA correction would entail adding 0.5 to all values, whereas AntConc seems to substitute 0.5 for whichever frequency is 0 but leaves the other values unchanged.  The discrepancy, however, is negligible unless all the other frequencies are also quite low, in which case the +0.5 starts to make a difference.

Best,

Daniel

--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/1e4b43eb-5fdc-4748-baa1-1fc194a3705c%40googlegroups.com.

Joe Geluso

unread,
Apr 1, 2020, 9:22:36 AM4/1/20
to Daniel HENKEL, ant...@googlegroups.com
Hi Daniel,

Thanks for your reply. Your hypothesis about using a quasi Haldane-Anscombe correction sounds reasonable to me. The keyness figures I get comparing Antconc's output to my own scripts--one which simply adds 1 to everything, and one that adds a small number like .000001 in place of zeros in the reference corpus (following Gabrielatos suggestion)--are very similar when the reference corpus frequencies are higher, say in the hundreds, and land in between the two methods in my scripts when frequencies are zero in the reference corpus.

Best wishes,

Joe

On Wed, Apr 1, 2020 at 7:25 PM Daniel HENKEL <daniel...@univ-paris8.fr> wrote:

Hello Joe,

I ran into the same problem when I manually calculated some Odds Ratios about a year ago and got somewhat different results.  As nearly as I could reckon, AntConc uses a sort of quasi Haldane-Anscombe correction, which involves adding 0.5 when there are 0 frequency items.  I say "quasi" as true HA correction would entail adding 0.5 to all values, whereas AntConc seems to substitute 0.5 for whichever frequency is 0 but leave the other values unchanged.  The difference, however, is negligible unless all the other frequencies are also quite low, in which case the +0.5 starts to make a difference.

Best,

Daniel


On 01/04/2020 11:10, Joe Geluso wrote:
--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/1e4b43eb-5fdc-4748-baa1-1fc194a3705c%40googlegroups.com.
--
Daniel HENKEL
Maître de Conférences (Linguistique et Traduction)
UFR5 LLCE-LEA • EA1569 TransCrit

Université Paris 8 Vincennes-St-Denis

“non si può stendere una tipologia delle traduzioni, ma al massimo una tipologia di diversi modi di tradurre, volta per volta negoziando il fine che ci si propone
– e volta per volta scoprendo che i modi di tradurre sono più di quelli che sospettiamo.”
U. Eco


--
Joe Geluso, PhD
Nihon University, College of Law
Areas of Interest: Corpus Linguistics, Phraseology, SLA, Technology and Language Learning, Applied Linguistics as a Discipline, Quantitative Research Methods
joegeluso.coffee
Reply all
Reply to author
Forward
0 new messages