Question about TAALES.

Yan, Kexin

unread,

Nov 30, 2022, 12:11:18 AM11/30/22

to linguistic-a...@googlegroups.com

Hi everyone,

I would like to ask questions about TAALES 2.0. In the indices list about COCA, for example, COCA_academic_bigram_ frequency, if I find the frequency of a bigram is 17.03, then does it is normalized automatically? After all, there is no signs about normalisation in the indices name. If yes, the normalisation is based on per million?

Thank you for your reply.

Kind regards,

Kexin Yan

Kristopher Kyle

unread,

Nov 30, 2022, 7:15:15 PM11/30/22

to Yan, Kexin, linguistic-a...@googlegroups.com

Hi Kexin,

The frequency values are normed per million words.

Best,

Kris

--
You received this message because you are subscribed to the Google Groups "Suite of automatic linguistic analysis tools" group.
To unsubscribe from this group and stop receiving emails from it, send an email to linguistic-analysi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/linguistic-analysis-tools/LO4P265MB3583A4B05AA458E467E68E31DB159%40LO4P265MB3583.GBRP265.PROD.OUTLOOK.COM.
For more options, visit https://groups.google.com/d/optout.

--

Kristopher Kyle

Associate Professor

Department of Linguistics

University of Oregon

www.kristopherkyle.com

Yan, Kexin

unread,

Dec 1, 2022, 10:35:33 PM12/1/22

to Kristopher Kyle, linguistic-a...@googlegroups.com

Hi Kristopher,

Thanks for your reply. I have another question to ask.

In some published papers, authors will search for bigrams in one large national corpus to see the frequency, and then decide whether to calculate the MI score and t-score.

For example, in this paper https://doi.org/10.1515/iral-2014-0011,

authors search bigrams in BNC. If the frequencies are below 5, then those target bigrams will be labelled in the below threshold group. Then those word sequences will not be assigned association scores. If word sequences can reach the frequency threshold of 5 occurrences in the BNC, then those word sequences will be assigned MI scores and t-scores. Obviously, in the below threshold group, the occurrences can be 4, 3, 2, 1 even 0. 0 means that the matched word sequences are absent in BNC. For some sequences, even if the frequency of one target word sequence is 4, it still cannot be assigned association scores.

In TAALES 2.0, as for association strength calculation, I also have questions about the frequencies. I copy the calculation results of TAALES as an example.

COCA_academic_bi_ MI	how is	-0.436766314163
	is going	2.4478375438
	you lately	N/A
	you want	4.46709863611
	about with	-2.59387173594
	was helped	N/A

It is clear that in COCA_academic_bi_ MI index, different bigrams have different MI results. N/A means that two bigrams cannot be found in COCA academic corpus. My question is about the other 4 bigrams that assigned MI scores. For example, in COCA_ academic corpus, whether as long as the word sequences can be found in it, then those sequences will be assigned association scores automatically? Even the occurrence of a word sequence in COCA_academic probably is 1, which means that this word sequence is rare in the COCA_academic corpus?

Thank you for your reply.

Kind regards,

Kexin

发件人: linguistic-a...@googlegroups.com <linguistic-a...@googlegroups.com> 代表 Kristopher Kyle <kristop...@gmail.com>
发送时间: Thursday, December 1, 2022 12:15:02 AM
收件人: Yan, Kexin <ky...@exeter.ac.uk>
抄送: linguistic-a...@googlegroups.com <linguistic-a...@googlegroups.com>
主题: Re: Question about TAALES.

CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe.

To view this discussion on the web visit https://groups.google.com/d/msgid/linguistic-analysis-tools/CAJUaXFjrvkJgnb55Czbj3P09qtArV4RoATkO5MpCnJFnj9fLwg%40mail.gmail.com.

Reply all

Reply to author

Forward