ALL <- data.frame(
Units = c("yeah", "mm", "no", "oh", "yes"),
FREQinCD = c(12, 1, 19, 13, 6)
FREQinUTT = c(352, 199, 122, 72, 70)
)
To establish whether the units occur more frequently in either condition I want to perform a permutation test. Here's the code:
n_cd <- sum(ALL$FREQinCD)
# Total number of tokens in constructed dialog in the actual (much bigger df): 1769
n_utt <- sum(ALL$FREQinUTT)
# Total number of tokens in utterances
: 8064
for(i in 1:length(ALL10$Units)) {
x_cd <- ALL10[i,2] # Frequency of i-th unit in constructed dialog
x_utt <- ALL10[i,3] # Frequency of i-th unit in utterances
Occurrence_cd <- c(rep(1, x_cd), rep(0, n_cd - x_cd)) # Permutation for constructed dialog
Occurrence_utt <- c(rep(1, x_utt), rep(0, n_utt - x_utt)) # Permutation for utterances
p <- perm.test(Occurrence_cd, Occurrence_utt, conf.level=0.95, exact=TRUE,conf.int=TRUE)
if(i==1) print(c("Word","Freq_cd","Freq_utt","CI_lower","CI_upper","P$perm"))
print(c(ALL10$Units[i], x_cd, x_utt, round(p$conf[1:2],5), round(p$p.value,8)))
}
# Total number of tokens in constructed dialog: 1769
The code, however, must be somewhat faulty: the execution takes ages and, what is more, confidence intervals are invariably NA and p-values are 0. Where's the mistake?
# install package 'exactRankTests' for perm.test:
library(exactRankTests)
# data:
ALL <- data.frame(
Units = c("yeah", "mm", "no", "oh", "yes"),
FREQinCD = c(12, 1, 19, 13, 6)
FREQinUTT = c(352, 199, 122, 72, 70)
)
# Total number of tokens:
n_cd <- sum(ALL$FREQinCD)
# Total number of tokens in constructed dialog in the actual (much bigger df): 1769
n_utt <- sum(ALL$FREQinUTT)
# Total number of tokens in utterances
: 8064
# run perm.test:
Dear Chris
My guess would be to change exact=TRUE to FALSE to obtain an approximation of the p-value. I believe exact=TRUE considers all permutations (which can be computationally expensive), while FALSE randomly sampled permutations.
Best
Ludovic