I'm working on the conversational subcorpus of the BNC trying to determine how words differ in terms of being able to form a complete utterance by themselves. To that end I've computed two frequency lists:
F_standalone
: the frequency of how often a word did stand alone in an utterance in the corpus.F_overall
: the frequency of the respective words overall in the corpus.Initially, I assumed that a word's potential of forming a complete utterance by itself could be read off the ratio of F_standalone
divided by F_overall
.
While this may make sense for words that have high frequencies in both conditions it makes much less sense for
words that are rare in the corpus overall: you get the maximum ratio of 1 if a word that occurs just once in the whole corpus happens to occur as a stand-alone word. And that sole occurrence as a stand-alone could be by chance.
A reproducible sample of the data I have is this:
mysample <- data.frame(
Word = c("vesuvius", "cruel","pentonville","mortuary","yuck","bollocks","yeah","mm","pardon"),
F_standalone = c(1,1,1,2,7,26,22875,11576,584),
F_overall = c(1,35,2,3,58,140,60158,21954,877),
Ratio = c(1.000000000,0.028571429,0.500000000,0.666666667,0.1206896552,0.18571429,0.3802487,0.5272843,0.6659065)
)
mysample
Word F_standalone F_overall Ratio
1 vesuvius 1 1 1.00000000
2 cruel 1 35 0.02857143
3 pentonville 1 2 0.50000000
4 mortuary 2 3 0.66666667
5 yuck 7 58 0.12068966
6 bollocks 26 140 0.18571429
7 yeah 22875 60158 0.38024870
8 mm 11576 21954 0.52728430
9 pardon 584 877 0.66590650
As can be seen from the sample, vesuvius
occurs just once in either condition (as stand-alone and in the corpus as a whole) and thus has a ratio of 1
; pentonville
occurs once as a stand-alone utterance but twice overall, yielding a ratio of 0.50000000
. On the other hand, words such as yeah
, mm
, or pardon
have both high frequencies as stand alone items and overall and get ratios between 0.3. and 0.7.
Given the much higher observed frequencies, yeah
, mm
, and pardon
intuitively seem to have a much higher capability of forming an utterance by themselves than vesuvius
and pentonville
.
So the ratio surely is an unreliable metric. How can an item's
capability of forming a complete utterance be determined more reliably
and with more statistical rigor? Is the Fisher's exact test an appropriate method? Or is a more complex statistical method warranted?
Advice is greatly appreciated!
Chris
F_overall
? It clearly has the desired effect: it does not alter much the probability of overall frequent
words, but it lowers the probability of overall rare words quite a lot. But is this method acceptable?--
You received this message because you are subscribed to the Google Groups "StatForLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/statforling-with-r/CALFCMoXKYh-wJBBF%3DL8gwTDOrLj%2B%2BBPxq%3Di5d8tpTX5pw-5L_g%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/statforling-with-r/CANMdTKjjivh0-8sRn7r4EsyTYxQhzWfTSznBwRphBGfvOtk7HA%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "StatForLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/statforling-with-r/CAFrBz2%3Dk-hr-CNyHZFe%3DcaCbuZb9hxBCtAGUF20jtJzk_aExKg%40mail.gmail.com.