Questions about TAALES, Association Scores Calculating

34 views
Skip to first unread message

Yan, Kexin

unread,
Dec 3, 2022, 8:48:14 PM12/3/22
to 'Yan, Kexin' via Suite of automatic linguistic analysis tools
Hi everyone,

Thanks for your reply. I have another question to ask.

In some published papers, authors will search for bigrams in one large national corpus to see the frequency, and then decide whether to calculate the MI score and t-score.

For example, in this paper https://doi.org/10.1515/iral-2014-0011

authors search bigrams in BNC. If the frequencies are below 5, then those target bigrams will be labelled in the below threshold group. Then those word sequences will not be assigned association scores. If word sequences can reach the frequency threshold of 5 occurrences in the BNC, then those word sequences will be assigned MI scores and t-scores.  Obviously, in the below threshold group, the occurrences can be 4, 3, 2, 1 even 0. 0 means that the matched word sequences are absent in BNC. For some sequences, even if the frequency of one target word sequence is 4, it still cannot be assigned association scores. 

In TAALES 2.0, as for association strength calculation, I also have questions about the frequencies. I copy the calculation results of TAALES as an example. 
COCA_academic_bi_ MI how is -0.436766314163

is going 2.4478375438

you lately  N/A

you want  4.46709863611

about with  -2.59387173594

was helped N/A

It is clear that in COCA_academic_bi_ MI index, different bigrams have different MI results. N/A means that two bigrams cannot be found in COCA academic corpus. My question is about the other 4 bigrams that assigned MI scores. For example, in COCA_ academic corpus, whether as long as the word sequences can be found in it, then those sequences will be assigned association scores automatically? Even the occurrence of a word sequence in COCA_academic probably is 1, which means that this word sequence is rare in the COCA_academic corpus? 

By the way, after COCA calculation, there is a Microsoft Excel file named results, containing results of the target self-built corpus. I would like to know if in this excel file, does index "COCA_academic_bi_MI" means that this index is the mean score of one txt file or txt files on the bases of bigram MI scores of the COCA academic corpus? 

Thank you for your reply. 

Kind regards,

Kexin

Kristopher Kyle

unread,
Dec 5, 2022, 3:08:43 PM12/5/22
to Yan, Kexin, 'Yan, Kexin' via Suite of automatic linguistic analysis tools
Hi Kexin,

I am not 100% sure that I understand your question, but I will attempt to answer it here.

First, yes, we use a similar arbitrary cut-off of 5 occurrences. This helps filter out infrequent typos and other weirdness.

As you note, "COCA_academic_bi_MI" means that this index is the mean score for bigrams in one text file, based on norms derived from the academic portion of COCA.

The individual output that you display in your email are the scores for each bigram in a text.

Best,

Kris

--
You received this message because you are subscribed to the Google Groups "Suite of automatic linguistic analysis tools" group.
To unsubscribe from this group and stop receiving emails from it, send an email to linguistic-analysi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/linguistic-analysis-tools/LO4P265MB3583B1924AC359F9C8A04092DB199%40LO4P265MB3583.GBRP265.PROD.OUTLOOK.COM.
For more options, visit https://groups.google.com/d/optout.


--
Kristopher Kyle
Associate Professor
Department of Linguistics
University of Oregon
Reply all
Reply to author
Forward
0 new messages