%DIF Formula in AntConc (Developmental version)

183 views
Skip to first unread message

Vanlee

unread,
Sep 23, 2017, 3:07:03 PM9/23/17
to AntConc-Discussion

Dear Prof. Anthony

I would be very grateful for your help and advice. I have used AntConc (Developmental version) to analyse my research project, and would like to ask you the following questions:

1)      What is the common base for relative frequencies you use in the calculation of keyword function?

2)     One word in my corpus A has the total frequency of 197, but this word doesn’t appear in corpus B (reference corpus), how could it be calculated with %DIF in the program? Because from Gabrielatos and Marchi’s (2011) formula, 0.000000000000000001 will be used for the calculation in case of zero occurrence in the reference corpus.

Note: The size of my corpus A is 31,130, and corpus B is 61,117. When I used the keyword function with LL (4-term) and %DIF, the keyness of this word appeared to be 436.92 with %DIF of 79688.27.

The reason for asking question 2 is that I would like to do the manual calculation with the frequency lists to see the % difference between the words in the top of the frequency lists of the two corpora (some of these words are not keywords, so I cannot find the results with the Keyword List function.)

I do hope you understand my questions, and your advice and clarifications would be very much appreciated. Thank you very much for your time. Hope my questions are relevant to the group discussion.

Best wishes,

Vanlee

Laurence Anthony

unread,
Sep 23, 2017, 9:41:59 PM9/23/17
to ant...@googlegroups.com
Hi,

I was originally going to use that (0.000000000000000001), but I changed it to 0.5 after reading another paper on the topic. Below is the actual code I use.

Laurence.


sub DIFF {
    my $tar_freq  = shift;
    my $ref_freq  = shift;
    my $tar_total = shift;
    my $ref_total = shift;

    my $o11 = $tar_freq;
    my $o21 = $ref_freq;

    #my $o12 = $tar_total - $tar_freq;
    #my $o22 = $ref_total - $ref_freq;

    my $r1 = $tar_total;    #$o11 + $o12
    my $r2 = $ref_total;    #$o21 + $o22
                            #my $c1 = $o11 + $o21;
                            #my $c2 = $o12 + $o22;
                            #my $n = $r1 + $r2;

    if ( $o21 == 0 ) {
        #$o21 = 0.000000000000000001;
        $o21 = 0.5;

    }

    my $DIFF = ( ( 100 * ( $r2 * $o11 ) ) / ( $r1 * $o21 ) ) - 100;

    return ( sprintf( "%.4f", $DIFF ) );
}

Regards,

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+unsubscribe@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at https://groups.google.com/group/antconc.
For more options, visit https://groups.google.com/d/optout.

Vanlee

unread,
Sep 24, 2017, 2:50:11 AM9/24/17
to AntConc-Discussion


เมื่อ วันเสาร์ที่ 23 กันยายน ค.ศ. 2017 20 นาฬิกา 07 นาที 03 วินาที UTC+1, Vanlee เขียนว่า:
Hi Lawrence Sensei,

Thank you very much for your reply. I'll try to figure out first what this code means by using Excel as I'm not really good at maths. And would it be possible that I can have the title of the paper you mentioned about the stats?

Much appreciated for your help.

Best wishes,
Vanlee

 

Elvis Coimbra Gomes

unread,
Apr 3, 2020, 11:30:58 AM4/3/20
to AntConc-Discussion
Dear Laurence,

Thank you for having put the formula here. I spent the whole day trying to understand why AntConc wasn't giving me the same %DIFF scores as LancsBox, until I found this page.

As I understand, you use a completely different formula than Gabrielatos and Marchi (2012), although it's described as such in AntConc's Keyness settings. Your formula will give a different ranking of keywords than Gabrielatos and Marchi's. I suggest you re-name it in the settings, because this can confuse people. Also, it would be highly appreciated if there was a document with all mathematical formulas used in AntConc on the software's webpage.

Thanks for your great work!

Best,
Elvis
To unsubscribe from this group and stop receiving emails from it, send an email to ant...@googlegroups.com.

Laurence Anthony

unread,
Apr 3, 2020, 12:09:25 PM4/3/20
to ant...@googlegroups.com
Dear Elvis,

As far as I understand, the %DIFF measure doesn't specify what the small number should be for the reference corpus frequency. The authors recommend 0.000000000000000001, but I think the assumption is that any small number is fine. As you say, though, I should document this more clearly. On the other hand, I'm not sure why you would say the calculation in AntConc is a "completely different formula". Are you suggesting that this adhoc value to account for zero values in the reference corpus is crucial to the calculation and hugely affects the results? If so, I think this point alone would seriously question the validity of the equation itself. 

In the next version of AntConc, I'll address the issue you raised.

I hope that helps.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/a2146286-1566-49b9-aecf-eaf30465bae0%40googlegroups.com.

Elvis Coimbra Gomes

unread,
Apr 3, 2020, 3:18:29 PM4/3/20
to AntConc-Discussion
Dear Laurence,

Thanks for the quick reply!

As I understand, Gabrielatos and Marchi add 0.000000000000000001 in order to account for the absence of a word, i.e. being as close as possible to 0. Whereas 0.5 is as near to 1 as to 0, and thus doesn't really replace the absence of a word and will yield a slightly different ranked keyword list.

Gabrielatos and Marchi calculate %DIFF as such:
NF= normalized frequency of word; AF= absolute frequency of word

(NF target corpus – NF reference corpus)*100 / NF reference corpus

As I interpret the code you posted on this page, AntConc seems to calculate %DIFF as:

100*(Total reference corpus * AF target corpus) / (Total target corpus * AF reference corpus) – 100

From my understanding, effect-size metrics that are based on corpus size and word frequency (e.g., ratio, odds ratio, log ratio, %DIFF, difference coefficient) should rank the keywords in the same way regardless of their different scores. However, these two equations will give different %DIFF scores and ranks.

For instance, if I compare the %DIFF scores between AntConc and LancsBox with corpora that have the exact same number of tokens and types in both softwares, I get different scores. While Gabrielatos and Marchi's %DIFF in LancsBox yields the top keywords that are completely absent in the reference corpus, the one in AntConc will introduce some words that appear once in the reference corpus. As such, the ranking of the keyword list will be different. It seems that this specific ranking is due to AntConc replacing 0 by 0.5. Without knowing how other effect sizes deal with 0, it seems that they all replace with 0.5. As such, the different effect sizes in AntConc will generate keyword lists that are ranked the same. But if I compare it with other softwares, it will yield different results.

I find the same, if I calculate keywords manually with the help of Paul Rayson's spreadsheet (http://ucrel.lancs.ac.uk/people/paul/SigEff.xlsx). Interestingly, Log Ratio is there the only metric that gives me the same ranking as AntConc, precisely because it uses 0.5 as a zero adjustment.

To sum up, my critique would be to include all the equations of AntConc on a document that is then available on its webpage (for the sake of transparency), and maybe rename %DIFF for the sake of consistency with Gabrielatos and Marchi's proposal. It becomes frustrating for linguists who want to reduplicate findings, if different softwares claim to use the same stats but at the end use modified versions that yield different results. And this clearly does not only apply to AntConc, but is a wide problem in corpus linguistics as a research field.

Please let me know, if there is something I misunderstood or if something is not clear.

As always, thank you very much for all you do! I can imagine how difficult it must be to control for all the details in a software.

Best,
Elvis
To unsubscribe from this group and stop receiving emails from it, send an email to ant...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages