Issue with permutation test

Christoph Ruehlemann

unread,

May 4, 2018, 1:15:04 AM5/4/18

to corplin...@googlegroups.com

Dear All

I'm interested in linguistic units occurring in two different conditions: (i) as utterances (or 'speaking turns') and (ii) as constructed dialog (or 'direct speech'). Here's a sample dataframe:

ALL <- data.frame(
  Units = c("yeah", "mm", "no", "oh", "yes"),
  FREQinCD = c(12, 1, 19, 13, 6)
  FREQinUTT = c(352, 199, 122, 72, 70)
  )

To establish whether the units occur more frequently in either condition I want to perform a permutation test. Here's the code:

n_cd <- sum(ALL$FREQinCD) # Total number of tokens in constructed dialog in the actual (much bigger df): 1769 
n_utt <- sum(ALL$FREQinUTT) # Total number of tokens in utterances : 8064
for(i in 1:length(ALL10$Units)) {  
  x_cd <- ALL10[i,2]  # Frequency of i-th unit in constructed dialog
  x_utt <- ALL10[i,3] # Frequency of i-th unit in utterances
  Occurrence_cd <- c(rep(1, x_cd), rep(0, n_cd - x_cd))  # Permutation for constructed dialog
  Occurrence_utt <- c(rep(1, x_utt), rep(0, n_utt - x_utt)) # Permutation for utterances
  p <- perm.test(Occurrence_cd, Occurrence_utt, conf.level=0.95, exact=TRUE,conf.int=TRUE)
  if(i==1) print(c("Word","Freq_cd","Freq_utt","CI_lower","CI_upper","P$perm"))
  print(c(ALL10$Units[i], x_cd, x_utt, round(p$conf[1:2],5), round(p$p.value,8)))
}
# Total number of tokens in constructed dialog: 1769

The code, however, must be somewhat faulty: the execution takes ages and, what is more, confidence intervals are invariably NA and p-values are 0. Where's the mistake?

Chris

--

https://www.uni-marburg.de/fb10/iaa/institut/personal/ruehlemann

ἰχθύς

Christoph Ruehlemann

unread,

May 6, 2018, 6:57:44 AM5/6/18

to corplin...@googlegroups.com

To make the sample reproducible, I should note that the perm.test() is part of the package exactRankTests. So here's the full code:

# install package 'exactRankTests' for perm.test:

library(exactRankTests) 

# data:



ALL <- data.frame(
  Units = c("yeah", "mm", "no", "oh", "yes"),
  FREQinCD = c(12, 1, 19, 13, 6)
  FREQinUTT = c(352, 199, 122, 72, 70)
  )

# Total number of tokens:
n_cd <- sum(ALL$FREQinCD) # Total number of tokens in constructed dialog in the actual (much bigger df): 1769

n_utt <- sum(ALL$FREQinUTT) # Total number of tokens in utterances : 8064


# run perm.test:

--

https://www.uni-marburg.de/fb10/iaa/institut/personal/ruehlemann

ἰχθύς

ludovic de cuypere

unread,

May 10, 2018, 5:11:07 AM5/10/18

to corplin...@googlegroups.com

Dear Chris

My guess would be to change exact=TRUE to FALSE to obtain an approximation of the p-value. I believe exact=TRUE considers all permutations (which can be computationally expensive), while FALSE randomly sampled permutations.

Best

Ludovic

Van: 'Christoph Ruehlemann' via CorpLing with R <corplin...@googlegroups.com>
Verzonden: zondag 6 mei 2018 12:57
Aan: corplin...@googlegroups.com
Onderwerp: [CorpLing with R] Re: Issue with permutation test

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To post to this group, send email to corplin...@googlegroups.com.
Visit this group at https://groups.google.com/group/corpling-with-r.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward