TAALES: Interpreting output

Shireen

unread,

Mar 15, 2023, 3:31:51 PM3/15/23

to Suite of automatic linguistic analysis tools

Hi all,

When I run TAALES, I get two different output files. One is "results" and the other is "results_index_coverage." Could someone explain to me the difference between the two files?

Many thanks in advance.

Masaki Eguchi

unread,

Mar 15, 2023, 3:53:02 PM3/15/23

to Shireen, Suite of automatic linguistic analysis tools

Hi Shireen,

I am responding here from what I know.

- result.csv (by default name) is the output you would normally look at. They are the result of TAALES analysis.

- results_index_coverage.csv is a version of the output for how much of your tokens in each text file are in accounted for in each index. This is usually used just for a sanity check. There are not many topical interests in this file. You can check how many of your words are actually reflected in the number, so extremely low coverage means that you need to be careful about your interpretation of the main output because many words from your text files were out of the vocabulary of the TAALES indices.

I hope that helps.

Let us know other questions you might have!

Best,

Masaki

--
You received this message because you are subscribed to the Google Groups "Suite of automatic linguistic analysis tools" group.
To unsubscribe from this group and stop receiving emails from it, send an email to linguistic-analysi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/linguistic-analysis-tools/ab7c09df-d59b-43cc-a707-23ca6ce49a6an%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shireen

unread,

Mar 15, 2023, 7:08:47 PM3/15/23

to Suite of automatic linguistic analysis tools

Dear Masaki,

Many thanks for your quick reply and explanation! I think I understand now.

I am using 4 indices in my research but I'm not sure I understand how they are calculated. The indices are:

1. COCA_Academic_Frequency_Log_CW

2. COCA_academic_bi_T

3. COCA_academic_tri_MI

4. Written_AFL_Normed

I was hoping someone could explain to me how each index is calculated. I've tried reading all the material I can find, but I'm still confused. I appreciate any help you can give me.

Thanks again in advance,

Shireen

Masaki Eguchi

unread,

Mar 15, 2023, 11:34:40 PM3/15/23

to Shireen, Suite of automatic linguistic analysis tools

Dear Shireen,

Thanks for the follow up questions. I will answer this from what I know!

For COCA indices, there are two parts in the analysis—reference frequency calculation and analysis of input texts.

1. COCA_Academic_Frequency_Log_CW

In the reference norm calculation, COCA academic Frequency Log means that word frequencies in the Academic section of COCA are log-transformed.

When analyzing the text you input for TAALES analysis, it will take an average of the above frequency values using the number of words with the frequency scores assigned in the denominator. This means that words that were not in the reference COCA norm are excluded in the calculation of the index score. This part is relevant to your original question about the coverage index, FYI.

2. COCA_academic_bi_T

In the reference norm calculation, COCA academic frequency bigrams are extracted and their T-scores are calculated.

When analyzing the input text, TAALES takes the average over the bigrams with T-score assigned.

3. COCA_academic_tri_MI

In the trigram, things are a little complicated because in general T-scores or MI is calculated for two linguistic units. In the indices involving trigrams, there are two variants. You seem to be using the first variant. This first variant (e.g., COCA_academic_tri_MI) calculates the association between the first unigram and the following bigram, if I remember correctly (e.g., in + other words). The second variant, with _2_ in the index names as in COCA_academic_tri_2_MI indices calculates the association between bigram and unigram (e.g., in other + words).

4. Written_AFL_Normed

This is the proportion of words that are part of the Academic Formulas List, divided by the number of words in the text, as I see in the index description sheet.

You can find this information in the index description sheet.

https://docs.google.com/spreadsheets/d/1axmeHlKE-aelPHX4L17WpHjC7Jn4yQlE/edit#gid=858394526

Or you may find these in the papers:

Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly, 49(4), 757–786. https://doi.org/10.1002/tesq.194

Kyle, K., Crossley, S. A., & Berger, C. (2018). The tool for the automatic analysis of lexical sophistication (TAALES): Version 2.0. Behavior Research Methods, 50(3), 1030–1046. https://doi.org/10.3758/s13428-017-0924-4

Best,

Masaki

To view this discussion on the web visit https://groups.google.com/d/msgid/linguistic-analysis-tools/bdb2283a-c547-4f29-ac43-e398958a8d36n%40googlegroups.com.

Message has been deleted

Shireen

unread,

Mar 16, 2023, 3:24:05 PM3/16/23

to Suite of automatic linguistic analysis tools

Thank you so much for your explanations. Everything is becoming a bit clearer. I have two follow up questions.

First, I want to check if my understanding of #1 (COCA_Academic_Frequency_Log_CW) is correct. Here is what I understand:

1. The program checks to see which words in the text appear in COCA.
2. Each word in COCA appears X number of times -- that is its "frequency score." This frequency score is then log-transformed.
3. Each word in the text gets a frequency score based on the log-transformed frequency scores in COCA.
4. The frequency scores of all the words in the text are summed. This sum becomes the numerator.
5. The denominator is the number of words in the text that appear in COCA.
6. Interpretation: texts with higher index scores use more high frequency words.

Second, I want to check if my understanding of #2 (COCA_academic_bi_T) is correct. Here is what I understand:

1. The program checks to see which bigrams in the text appear in COCA.
2. The T-score for each bigram is calculated.
3. The T-scores for all the bigrams in the text that appear in COCA are summed. This becomes the numerator.
4. The denominator is the number of bigrams in text with T scores in COCA.
5. Interpretation: texts with higher index scores contain more bigrams with high T-scores.

Please correct me if am understanding incorrectly. I really appreciate your time and patience!

Best,
Shireen

Reply all

Reply to author

Forward