Issue with permutation test

3 views
Skip to first unread message

Christoph Ruehlemann

unread,
May 4, 2018, 1:15:04 AM5/4/18
to corplin...@googlegroups.com
Dear All

I'm interested in linguistic units occurring in two different conditions: (i) as utterances (or 'speaking turns') and (ii) as constructed dialog (or 'direct speech'). Here's a sample dataframe:
ALL <- data.frame(
  Units = c("yeah", "mm", "no", "oh", "yes"),
  FREQinCD = c(12, 1, 19, 13, 6)
  FREQinUTT = c(352, 199, 122, 72, 70)
  )  

To establish whether the units occur more frequently in either condition I want to perform a permutation test. Here's the code:

n_cd <- sum(ALL$FREQinCD) # Total number of tokens in constructed dialog in the actual (much bigger df): 1769 
n_utt <- sum(ALL$FREQinUTT) # Total number of tokens in utterances : 8064
for(i in 1:length(ALL10$Units)) { x_cd <- ALL10[i,2] # Frequency of i-th unit in constructed dialog x_utt <- ALL10[i,3] # Frequency of i-th unit in utterances Occurrence_cd <- c(rep(1, x_cd), rep(0, n_cd - x_cd)) # Permutation for constructed dialog Occurrence_utt <- c(rep(1, x_utt), rep(0, n_utt - x_utt)) # Permutation for utterances p <- perm.test(Occurrence_cd, Occurrence_utt, conf.level=0.95, exact=TRUE,conf.int=TRUE) if(i==1) print(c("Word","Freq_cd","Freq_utt","CI_lower","CI_upper","P$perm")) print(c(ALL10$Units[i], x_cd, x_utt, round(p$conf[1:2],5), round(p$p.value,8))) }
# Total number of tokens in constructed dialog: 1769

The code, however, must be somewhat faulty: the execution takes ages and, what is more, confidence intervals are invariably NA and p-values are 0. Where's the mistake?

Chris

Christoph Ruehlemann

unread,
May 6, 2018, 6:57:44 AM5/6/18
to corplin...@googlegroups.com
To make the sample reproducible, I should note that the perm.test() is part of the package exactRankTests. So here's the full code:

# install package 'exactRankTests' for perm.test:

library
(exactRankTests)

# data:


ALL
<- data.frame( Units = c("yeah", "mm", "no", "oh", "yes"), FREQinCD = c(12, 1, 19, 13, 6) FREQinUTT = c(352, 199, 122, 72, 70) )
# Total number of tokens:
n_cd <- sum(ALL$FREQinCD)
# Total number of tokens in constructed dialog in the actual (much bigger df): 1769
n_utt <- sum(ALL$FREQinUTT) # Total number of tokens in utterances : 8064

# run perm.test:

ludovic de cuypere

unread,
May 10, 2018, 5:11:07 AM5/10/18
to corplin...@googlegroups.com

Dear Chris


My guess would be to change exact=TRUE to FALSE to obtain an approximation of the p-value. I believe exact=TRUE considers all permutations (which can be computationally expensive), while FALSE randomly sampled permutations. 


Best

Ludovic








Van: 'Christoph Ruehlemann' via CorpLing with R <corplin...@googlegroups.com>
Verzonden: zondag 6 mei 2018 12:57
Aan: corplin...@googlegroups.com
Onderwerp: [CorpLing with R] Re: Issue with permutation test
 
--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To post to this group, send email to corplin...@googlegroups.com.
Visit this group at https://groups.google.com/group/corpling-with-r.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages