Type-Token Ratio (TTR) and Standardised Type-Token Ratio (STTR)

2,942 views
Skip to first unread message

David King

unread,
Jun 1, 2013, 5:31:13 AM6/1/13
to ant...@googlegroups.com
Hi

Like so many on here, I am indebted to Prof Anthony for AntConc.  I wouldn't be able to undertake my corpus-based research without it!    I am, however, having some difficulty with trying to represent linguistic diversity.  Obviously, I can see the Token Type counts and Token Counts, and the TTR is just a ratio of # of Types/# of Tokens x 100.  The problem with the TTR though, is that it is really only valid when texts are of more or less the same length (because # of types has been found to be influenced by # of tokens).  The STTR is considered (by some) as a way around this.  It calculates the mean TTR for every N-words (N is generally 1,000 but can be anything), but I can't figure out how to get AntConc to perform this.  If it's not possible in AntConc, maybe someone knows whether there is a variance in Token Count, less than which the TTR is considered useful/valid?  Baker (2006) suggests the TTR is fine to use for texts under 5,000 tokens, but I am unsure whether you could compare the TTR of a text of 1,000 tokens and a text of 5,000 and draw any meaningful conclusions re: linguistic diversity.  Any ideas?  Many thanks in advance.

David

Laurence Anthony

unread,
Jun 1, 2013, 11:05:13 PM6/1/13
to ant...@googlegroups.com
On Sat, Jun 1, 2013 at 6:31 PM, David King <dckl...@gmail.com> wrote:

Hi

Like so many on here, I am indebted to Prof Anthony for AntConc.  I wouldn't be able to undertake my corpus-based research without it!    I am, however, having some difficulty with trying to represent linguistic diversity.  Obviously, I can see the Token Type counts and Token Counts, and the TTR is just a ratio of # of Types/# of Tokens x 100.  The problem with the TTR though, is that it is really only valid when texts are of more or less the same length (because # of types has been found to be influenced by # of tokens).  The STTR is considered (by some) as a way around this.  It calculates the mean TTR for every N-words (N is generally 1,000 but can be anything), but I can't figure out how to get AntConc to perform this.  If it's not possible in AntConc, maybe someone knows whether there is a variance in Token Count, less than which the TTR is considered useful/valid?  Baker (2006) suggests the TTR is fine to use for texts under 5,000 tokens, but I am unsure whether you could compare the TTR of a text of 1,000 tokens and a text of 5,000 and draw any meaningful conclusions re: linguistic diversity.  Any ideas?  Many thanks in advance.

Hi David,

What you are asking for is a quite specialized tool. The tool would need to split the corpus into sections of N-words, and then calculate the TTR for each section, and then perhaps work out the average TTR value.

At the moment, AntConc cannot do this. In fact it does not even calculate the TTR for the corpus as a whole (although as you say it is easy to just divide the types by the tokens). However, I do have two other tools that I put together that might perhaps be useful. One tool takes a file and divides it into equal size chunks of words, with the size of the chunk determined by the user. If your corpus was stored as a single file, this tool could be used to split it into sections. I have another tool that creates a vocabulary profile for individual files. If you put your section files into this tool, it would tell you the number of types and tokens in each, and from this you could create your STTR value.

Another alternative is to create a new dedicated tool to calculate STTR values. If you think many users would be interested in such a tool, I could quite easily put it together.

A final alternative would be for you to setup a programming environment like Python or R, and code the tool directly. The tool itself would be relatively simple to code, but if you don't have a programming background, getting to the stage of being able to code the tool might take quite a while.

Saying all that, I suspect that there is a much better measure of lexical diversity than STTR, which might be easier to calculate, too.

Does anybody else have any suggestions? Perhaps you could post the question on the Corpora List, where I'm sure you'd get some very detailed answers.

Laurence.

David King

unread,
Jun 4, 2013, 9:40:41 AM6/4/13
to ant...@googlegroups.com
Dear Laurence

Thank you for your prompt and comprehensive reply - much appreciated!   My current research corpus consists of 2,000 .txt files arranged along 3 variables.  Not to bore you with too much of the nitty-gritty, but these 2,000 .txt files have been grouped into 4 primary sub-corpora, which in turn have been divided into 5 secondary sub-corpora, which in turn have been divided into 2 tertiary sub-corpora.  In total then, there are 40 groups of 50 individual .txt files.

Based on how I've organised my corpus, it would seem to me that the second tool you mentioned would be useful.  But, can I clarify that it is meant to provide the type and token counts for multiple files at once; i.e. it is possible to run, for example, a group of 50 .txt files through it in one go and get a type and token count for each of the 50 files?  If I've misunderstood how that works, my apologies.  Otherwise, wouldn't I need to run each of my 2,000 .txt files individually through AntConc (I use AntConc 3.2.4w) and separately record the type and token as it appears from the wordlist function?

Many thanks again, both for AntConc and for your time.

Kind regards
David

nova lyna

unread,
Jun 8, 2013, 6:52:12 PM6/8/13
to ant...@googlegroups.com
Dear Lawrence,

I am facing a little trouble about the present regex, Antconc displays some characters, as it can be seen in the first line of this capture.How can I avoid these characters.

Thanks in advance.


2013/6/4 David King <dckl...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at http://groups.google.com/group/antconc?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

irrelevant words in complex terms 65.PNG

Warren Tang

unread,
Jun 8, 2013, 8:52:10 PM6/8/13
to ant...@googlegroups.com
Hi Nova,
Open your text file. 

Find the problem character. 

Highlight it and copy (ctrl c). 

Open 'find and replace'

Paste character into the did input field. 

Type in the replace field the character you want using the English keyboard setting. Clock 'replace all'. 

Hope that helps


Warren
--
Sent from Gmail Mobile

Laurence Anthony

unread,
Jun 8, 2013, 10:02:48 PM6/8/13
to ant...@googlegroups.com
Hi Nova,

Warren's suggestion is a good one. It seems that your data has some tags that you probably don't want. If they are completely regular, you might also be able to filter them out using the tag settings. However, if you really don't need them, just delete them as Warren suggests.

Laurence.
Reply all
Reply to author
Forward
0 new messages