t-score of collocates

617 views
Skip to first unread message

Earl Brown

unread,
Feb 25, 2014, 9:58:52 PM2/25/14
to corplin...@googlegroups.com
Corpus linguistics R-ists,

Hunston 2002:70 (Corpora in Applied Linguistics published by Cambridge) says: "The t-score is calculated by subtracting Expected from Observed and dividing the result by the standard deviation." I've googled around trying to find as clear a formula for t-score as Davies gives for Mutual Information here:

http://corpus.byu.edu/mutualInformation.asp

but have been unsuccessful. Hence, my post.

Can anyone supply me with the formula (preferably in R) for the t-score of a collocate, given:
  •     the freq of the node word
  •     the freq of the collocate
  •     the freq of the collocation
  •     the span around the node word
  •     the size of the corpus in number of words

Here's what I've got in R for MI (given Davies' formula for MI):

freq.node <- 1262
freq.collocate <- 115
freq.collocation <- 24
size.corpus <- 96263399
span <- 3 # words on either side of node

log10((freq.collocation * size.corpus) / (freq.node * freq.collocate * (span * 2))) / log10(2)

Thanks in advance for any help. Best, Earl Brown

Stefan Th. Gries

unread,
Feb 25, 2014, 10:10:50 PM2/25/14
to CorpLing with R

Hi Earl

If you go to my research website, to appear h, you will find a formula for t that is easy to paraphrase into R.

Cheers,
STG
--
Stefan Th. Gries
----------------------------------
Univ. of California, Santa Barbara
http://tinyurl.com/stgries
----------------------------------

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To post to this group, send email to corplin...@googlegroups.com.
Visit this group at http://groups.google.com/group/corpling-with-r.
For more options, visit https://groups.google.com/groups/opt_out.

Earl Brown

unread,
Feb 26, 2014, 1:36:29 AM2/26/14
to corplin...@googlegroups.com
Thanks Stefan. I think I've implemented it correctly in R. I'll paste a script below that I wrote.

BTW, on page 6, in the paragraph above example 7, there's a typo of "100,00,000", which might be "99,966,209" anyway.

Thanks again. Earl Brown

##########
# Script to calculate raw freq, mutual information, and t-score for collocations

rm(list = ls(all = T))

# loads novel
novel <- tolower(scan("/Users/earlbrown/Corpora/Jane_Austen/Pride_and_Prejudice.txt", what = "char", sep = "\n"))

# gets words
words <- unlist(strsplit(novel, "[^-'a-zA-Z]+"))
words <- gsub("(^'|'$)", "", words)
words <- words[nchar(words) > 0]
freq.list.words <- sort(table(words), decreasing = T)

# gets bigrams
bigrams <- apply(mapply(seq, 1:(length(words) - 2 + 1), 2:length(words)), 2, function(x) paste(words[x], collapse = " "))
freq.list.bigrams <- sort(table(bigrams), decreasing = T)

# loops over bigrams
all.output <- c()
for (i in 1:length(freq.list.bigrams)) {
  cur.bigram <- names(freq.list.bigrams[i])
  if (i %% 100 == 0) cat("Working on bigram ", i, " of ", length(freq.list.bigrams), ": ", cur.bigram, "\n", sep = "")
 
  # gets bigram freq and expected freq
  wd1 <- unlist(strsplit(cur.bigram, " "))[1]
  wd2 <- unlist(strsplit(cur.bigram, " "))[2]
  aa <- freq.list.bigrams[names(freq.list.bigrams) == cur.bigram]
  bb <- freq.list.words[names(freq.list.words) == wd1] - aa
  cc <- freq.list.words[names(freq.list.words) == wd2] - aa
  exclude1 <- grep(sprintf("\\b%s\\b", wd1), names(freq.list.bigrams))
  exclude2 <- grep(sprintf("\\b%s\\b", wd2), names(freq.list.bigrams))
  exclude <- union(exclude1, exclude2)
  dd <- length(bigrams) - length(exclude)    
  aa.expected <- (aa + bb) * (aa + cc) / dd
 
  # gets MI and t-score
  mi.score <- log(aa/aa.expected, base = 2)
  t.score <- (aa - aa.expected) / sqrt(aa)
 
  # saves current bigram to collector
  cur.output <- paste(cur.bigram, aa, mi.score, t.score, sep = "\t")
  all.output <- append(all.output, cur.output)
 
} # next bigram

cat("BIGRAM\tRAW_FREQ\tMI\tT_SCORE", all.output, file = "bigrams.csv", sep = "\n")
cat("\aAll done!\n")

Stefan Th. Gries

unread,
Feb 26, 2014, 1:40:06 AM2/26/14
to CorpLing with R

Thanks, I'll keep my eyes open for that typo when the proofs arrive!

Stefan Th. Gries

unread,
Feb 28, 2014, 12:11:16 PM2/28/14
to CorpLing with R
> BTW, on page 6, in the paragraph above example 7, there's a typo of "100,00,000", which might be "99,966,209" anyway.
you sure? The computation of a_expected requires division by N, not d?

Earl Brown

unread,
Feb 28, 2014, 8:56:41 PM2/28/14
to corplin...@googlegroups.com
I don't know. That would mind my R code is incorrect, right?

Stefan Th. Gries

unread,
Feb 28, 2014, 9:00:09 PM2/28/14
to CorpLing with R
It would :-) The expected frequency of the relevant 2x2 tables is
computed as in a chi-squared test so the denominator should be
N=a+b+c+d. Practically that will often make not much of a big
difference, but it's of course still wrong,
Reply all
Reply to author
Forward
0 new messages