Questions about TAALES, Association Scores Calculating

53 views

TAALES

Skip to first unread message

Yan, Kexin

unread,

Dec 3, 2022, 8:48:14 PM12/3/22

to 'Yan, Kexin' via Suite of automatic linguistic analysis tools

Hi everyone,

Thanks for your reply. I have another question to ask.

In some published papers, authors will search for bigrams in one large national corpus to see the frequency, and then decide whether to calculate the MI score and t-score.

For example, in this paper https://doi.org/10.1515/iral-2014-0011,

authors search bigrams in BNC. If the frequencies are below 5, then those target bigrams will be labelled in the below threshold group. Then those word sequences will not be assigned association scores. If word sequences can reach the frequency threshold of 5 occurrences in the BNC, then those word sequences will be assigned MI scores and t-scores. Obviously, in the below threshold group, the occurrences can be 4, 3, 2, 1 even 0. 0 means that the matched word sequences are absent in BNC. For some sequences, even if the frequency of one target word sequence is 4, it still cannot be assigned association scores.

In TAALES 2.0, as for association strength calculation, I also have questions about the frequencies. I copy the calculation results of TAALES as an example.

COCA_academic_bi_ MI	how is	-0.436766314163
	is going	2.4478375438
	you lately	N/A
	you want	4.46709863611
	about with	-2.59387173594
	was helped	N/A

It is clear that in COCA_academic_bi_ MI index, different bigrams have different MI results. N/A means that two bigrams cannot be found in COCA academic corpus. My question is about the other 4 bigrams that assigned MI scores. For example, in COCA_ academic corpus, whether as long as the word sequences can be found in it, then those sequences will be assigned association scores automatically? Even the occurrence of a word sequence in COCA_academic probably is 1, which means that this word sequence is rare in the COCA_academic corpus?

By the way, after COCA calculation, there is a Microsoft Excel file named results, containing results of the target self-built corpus. I would like to know if in this excel file, does index "COCA_academic_bi_MI" means that this index is the mean score of one txt file or txt files on the bases of bigram MI scores of the COCA academic corpus?

Thank you for your reply.

Kind regards,

Kexin

Kristopher Kyle

unread,

Dec 5, 2022, 3:08:43 PM12/5/22

to Yan, Kexin, 'Yan, Kexin' via Suite of automatic linguistic analysis tools

Hi Kexin,

I am not 100% sure that I understand your question, but I will attempt to answer it here.

First, yes, we use a similar arbitrary cut-off of 5 occurrences. This helps filter out infrequent typos and other weirdness.

As you note, "COCA_academic_bi_MI" means that this index is the mean score for bigrams in one text file, based on norms derived from the academic portion of COCA.

The individual output that you display in your email are the scores for each bigram in a text.

Best,

Kris

--
You received this message because you are subscribed to the Google Groups "Suite of automatic linguistic analysis tools" group.
To unsubscribe from this group and stop receiving emails from it, send an email to linguistic-analysi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/linguistic-analysis-tools/LO4P265MB3583B1924AC359F9C8A04092DB199%40LO4P265MB3583.GBRP265.PROD.OUTLOOK.COM.
For more options, visit https://groups.google.com/d/optout.

Kristopher Kyle

Associate Professor

Department of Linguistics

University of Oregon

www.kristopherkyle.com

Reply all

Reply to author

Forward

0 new messages