Stand-alone potential of words

14 views
Skip to first unread message

Christoph Ruehlemann

unread,
Jun 25, 2020, 9:31:07 AM6/25/20
to statforli...@googlegroups.com
Hi all,

I'm working on the conversational subcorpus of the BNC trying to determine how words differ in terms of being able to form a complete utterance by themselves. To that end I've computed two frequency lists:

  1. F_standalone: the frequency of how often a word did stand alone in an utterance in the corpus.
  2. F_overall: the frequency of the respective words overall in the corpus.

Initially, I assumed that a word's potential of forming a complete utterance by itself could be read off the ratio of F_standalone divided by F_overall. While this may make sense for words that have high frequencies in both conditions it makes much less sense for words that are rare in the corpus overall: you get the maximum ratio of 1 if a word that occurs just once in the whole corpus happens to occur as a stand-alone word. And that sole occurrence as a stand-alone could be by chance.

A reproducible sample of the data I have is this:

mysample <- data.frame(
  Word = c("vesuvius", "cruel","pentonville","mortuary","yuck","bollocks","yeah","mm","pardon"),
  F_standalone = c(1,1,1,2,7,26,22875,11576,584),
  F_overall = c(1,35,2,3,58,140,60158,21954,877),
  Ratio = c(1.000000000,0.028571429,0.500000000,0.666666667,0.1206896552,0.18571429,0.3802487,0.5272843,0.6659065)
)

mysample
         Word F_standalone F_overall      Ratio
1    vesuvius            1         1 1.00000000
2       cruel            1        35 0.02857143
3 pentonville            1         2 0.50000000
4    mortuary            2         3 0.66666667
5        yuck            7        58 0.12068966
6    bollocks           26       140 0.18571429
7        yeah        22875     60158 0.38024870
8          mm        11576     21954 0.52728430
9      pardon          584       877 0.66590650

As can be seen from the sample, vesuvius occurs just once in either condition (as stand-alone and in the corpus as a whole) and thus has a ratio of 1; pentonville occurs once as a stand-alone utterance but twice overall, yielding a ratio of 0.50000000. On the other hand, words such as yeah, mm, or pardon have both high frequencies as stand alone items and overall and get ratios between 0.3. and 0.7.

Given the much higher observed frequencies, yeah, mm, and pardon intuitively seem to have a much higher capability of forming an utterance by themselves than vesuvius and pentonville. So the ratio surely is an unreliable metric. How can an item's capability of forming a complete utterance be determined more reliably and with more statistical rigor? Is the Fisher's exact test an appropriate method? Or is a more complex statistical method warranted?

Advice is greatly appreciated!

Chris


--
Albert-Ludwigs-Universität Freiburg
Projekt-Leiter DFG-Projekt "Analyse multimodaler Interaktion im Geschichtenerzählen"
ἰχθύς

Christoph Ruehlemann

unread,
Jun 26, 2020, 10:50:06 AM6/26/20
to statforli...@googlegroups.com
in pursuit of an answer to my inquiry ...

Is additive smoothing an option? See, for example, https://en.wikipedia.org/wiki/Additive_smoothing.

That is, in my case, adding, say, 20 or 50 to F_overall? It clearly has the desired effect: it does not alter much the probability of overall frequent words, but it lowers the probability of overall rare words quite a lot. But is this method acceptable?

Martin Schweinberger

unread,
Jun 26, 2020, 9:02:13 PM6/26/20
to statforli...@googlegroups.com
Hi Chris,

I am not sure if this is what you are looking for but I wrote a short Notebook which should find elements that occur more frequently as stand-alone utterances than would be expected by chance. The Notebook showcases how you could do this (attached and below) but it comes without warranties (better check each step and see if it makes sense)...I was just dabbling ;)

Cheers,
Martin

Here's the code for the Notebook:

---
title: "Determining the tendency for elements to serve as stand-alone utterances"
author: "Anonymous"
date: "`r format(Sys.time(), '%Y-%m-%d')`"
output:
  bookdown::html_document2: default
bibliography: bibliography.bib
link-citations: yes
---

This Notebook determines the tendency for elements to serve as stand-alone utterances.

```{r prep1, echo=T, eval = T, message=FALSE, warning=FALSE}
# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)
options(scipen = 999)
options(max.print=10000)
# actiavte packages
library("dplyr")
```


Create sample data

```{r standalone_01, eval = T, echo = T, warning = F}

mysample <- data.frame(
  Word = c("vesuvius", "cruel","pentonville","mortuary","yuck","bollocks","yeah","mm","pardon"),
  F_standalone = c(1,1,1,2,7,26,22875,11576,584),
  F_overall = c(1,35,2,3,58,140,60158,21954,877),
  Ratio = c(1.000000000,0.028571429,0.500000000,0.666666667,0.1206896552,0.18571429,0.3802487,0.5272843,0.6659065)
)
# inspect data
head(mysample)
```

```{r standalone_03, eval = T, echo = T, warning = F}
newsample <- mysample %>%
  dplyr::select(-Ratio) %>%
  dplyr::rename(StandAlone = F_standalone,
                OverallFrequency = F_overall) %>%
  dplyr::mutate(OtherContexts = OverallFrequency-StandAlone) %>%
  dplyr::select(-OverallFrequency)
# inspect data
head(newsample)
```

In order to identify which words that occur more frequently as stand-alone utterances than would be expected by chance, we have to determine the following frequencies:

* a = Number of times a term occurs as stand-alone utterance

* b = Number of times a term occurs in other contexts

* c = Number of times other terms occur  as stand-alone utterance

* d = Number of times other terms occur in other contexts

In a first step, we create a table which holds these quantities.

```{r standalone_05, eval = T, echo = T, warning = F}
newsample <- newsample %>%
  dplyr::mutate(AllStandAlone = sum(StandAlone)) %>%
  dplyr::mutate(AllOtherContexts = sum(OtherContexts)) %>%
  dplyr::arrange(Word) %>%
  dplyr::mutate(a = StandAlone,
                b = OtherContexts,
                c = AllStandAlone - a,
                d = AllOtherContexts - b) %>%
  dplyr::mutate(NRows = nrow(newsample))
# inspect data
newsample
```

Perform the statz and inspect results

```{r standalone_07, eval = T, echo = T, warning = F}
results <- newsample %>%
  dplyr::rowwise() %>%
  dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(a, b, c, d),
                                                        ncol = 2, byrow = T))[1]))) %>%
  dplyr::mutate(p = round(p, 5)) %>%
    dplyr::mutate(x2 = as.vector(unlist(chisq.test(matrix(c(a, b, c, d),                                                           ncol = 2, byrow = T))[1]))) %>%
  dplyr::mutate(x2 = round(x2, 5)) %>%
  dplyr::mutate(phi = sqrt((x2/(a + b + c + d)))) %>%
      dplyr::mutate(expected = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))$expected[1]))) %>%
  dplyr::mutate(expected = round(expected, 2)) %>%
  dplyr::mutate(phi = round(phi, 3)) %>%
  dplyr::mutate(BonferroniCorrectedSignificance = ifelse(p <= .05, "p<.05",
                               ifelse(p <= .01, "p<.01",
                               ifelse(p <= .001, "p<.001", "n.s.")))) %>%
  # determine type (Type = more freq. as stand-alone)
  dplyr::mutate(Type = ifelse(StandAlone > expected, "Type", "Antitype")) %>%
  # remove superfluous columns
  dplyr::select(-a, -b, -c, -d, -NRows, -AllStandAlone, -AllOtherContexts)
# inspect results
results
```

=====================================
Dr. Martin Schweinberger
5/221 Sir Fred Schonell Drive
St Lucia, QLD, 4067

Fon.: +61 (0)404 228 226
Home: http://www.martinschweinberger.de/



--
You received this message because you are subscribed to the Google Groups "StatForLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/statforling-with-r/CALFCMoXKYh-wJBBF%3DL8gwTDOrLj%2B%2BBPxq%3Di5d8tpTX5pw-5L_g%40mail.gmail.com.
StandAloneNotebook.Rmd

Stefan Th. Gries

unread,
Jun 26, 2020, 9:53:56 PM6/26/20
to StatForLing with R
Here's my decidedly more old-school take on this - let's see if the internet breaks if 'it sees' R code that doesn't have a single tidyverse thing ;-)

# generate data but as matrix
rm(list=ls(all=TRUE))
(x <- matrix(c(c(1,1,1,2,7,26,22875,11576,584),
               c(0,34,1,1,51,114,37283,10378,293)),
             ncol=2, dimnames=list(
                WORD=c("vesuvius", "cruel","pentonville","mortuary","yuck","bollocks","yeah","mm","pardon"),
                USE=c("standalone", "other"))))
# do stuff
p.values <- apply(x, 1, function (af) {
   temp <- matrix(c(af, colSums(x)-af), ncol=2, byrow=TRUE)
   round(fisher.test(temp)$p.value, 5) })
expecteds <- chisq.test(x, correct=FALSE)$expected
result <- data.frame(x, expecteds,
                     p.adjust(p.values),
                     x[,1]>expecteds[,1])
names(result) <- c(colnames(x), paste0(colnames(x), "_exp"), "Padj", "StandaloneIsPreferred")
result

However, the above doesn't address the smoothing issue; adding a small constant might help of course, as might Goot-Turing.

Christoph Ruehlemann

unread,
Jun 27, 2020, 3:12:34 AM6/27/20
to statforli...@googlegroups.com
Hi Martin,

Thank you so much for this incredible piece of work! Way more than I expected!  I'll take my time and try and wrap my head around it.

Best
Chris

Christoph Ruehlemann

unread,
Jun 27, 2020, 3:14:35 AM6/27/20
to statforli...@googlegroups.com
Hi Stefan,

Thanks so much for this! It's far more than I expected!

Best
Chris


--
You received this message because you are subscribed to the Google Groups "StatForLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages