Questions about the corpus version of TAALES 2.2

79 views
Skip to first unread message

Yan Kexin

unread,
Aug 20, 2023, 8:14:49 PM8/20/23
to Suite of automatic linguistic analysis tools
Hi Kristopher and Scott,

I am Kexin Yan, a current PhD student focusing on corpus linguistics and using Taales 2.2 in my study. I would like to ask some questions about TAALES 2.2.

First, as for the word frequency index, in TAALES 2.2, there are at least 9 indices about the written word frequency indices, corpora ranging from BNC, COCA (4 SUB-CORPORA), Kucera-Francis, SUBTLEXus, Thorndike-Lorge, to Brown corpus. I know that some corpora, such as COCA, are updated gradually. For example, here is the link to COCA: https://www.english-corpora.org/coca/ , and the official intro says the current version contains "25+ million words each year 1990-2019 from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, TV and Movies subtitles, blogs, and other web pages.". Can I please know the COCA version information used in TAALES, such as word number, and the data collecting year (e.g. from 2001 to 2005)? 

By the way, I see the user manual of TAALES 2.5 and 2.8, there are differences in the definition of content words and function words between the 2 versions. However, now the user manual of TAALES 2.2 is unavailable from the website, can you describe the definition of content words and function words in TAALES 2.2, please?

Thank you so much.

Kind regards,

Kexin





Kristopher Kyle

unread,
Aug 21, 2023, 1:45:09 PM8/21/23
to Yan Kexin, Suite of automatic linguistic analysis tools
Hi Kexin,

With the exception of the polysemy and hypernymy indices, TAALES 2.2 uses a function-word stop list to differentiate between content words and function words. This was updated in subsequent versions of TAALES through the use of a POS tagger and dependency parser.

See attached (final row) for the function word stop list used by TAALES 2.2.

Best,

Kris 

--
You received this message because you are subscribed to the Google Groups "Suite of automatic linguistic analysis tools" group.
To unsubscribe from this group and stop receiving emails from it, send an email to linguistic-analysi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/linguistic-analysis-tools/41d43456-1ced-4f49-857a-069ae530c84cn%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Kristopher Kyle
Associate Professor
Department of Linguistics
University of Oregon
master_word_list.txt
Reply all
Reply to author
Forward
0 new messages