# Stand-alone potential of words

15 views

### Christoph Ruehlemann

Jun 25, 2020, 9:31:07 AM6/25/20
Hi all,

I'm working on the conversational subcorpus of the BNC trying to determine how words differ in terms of being able to form a complete utterance by themselves. To that end I've computed two frequency lists:

1. `F_standalone`: the frequency of how often a word did stand alone in an utterance in the corpus.
2. `F_overall`: the frequency of the respective words overall in the corpus.

Initially, I assumed that a word's potential of forming a complete utterance by itself could be read off the ratio of `F_standalone` divided by `F_overall`. While this may make sense for words that have high frequencies in both conditions it makes much less sense for words that are rare in the corpus overall: you get the maximum ratio of 1 if a word that occurs just once in the whole corpus happens to occur as a stand-alone word. And that sole occurrence as a stand-alone could be by chance.

A reproducible sample of the data I have is this:

``````mysample <- data.frame(
Word = c("vesuvius", "cruel","pentonville","mortuary","yuck","bollocks","yeah","mm","pardon"),
F_standalone = c(1,1,1,2,7,26,22875,11576,584),
F_overall = c(1,35,2,3,58,140,60158,21954,877),
Ratio = c(1.000000000,0.028571429,0.500000000,0.666666667,0.1206896552,0.18571429,0.3802487,0.5272843,0.6659065)
)

mysample
Word F_standalone F_overall      Ratio
1    vesuvius            1         1 1.00000000
2       cruel            1        35 0.02857143
3 pentonville            1         2 0.50000000
4    mortuary            2         3 0.66666667
5        yuck            7        58 0.12068966
6    bollocks           26       140 0.18571429
7        yeah        22875     60158 0.38024870
8          mm        11576     21954 0.52728430
9      pardon          584       877 0.66590650``````

As can be seen from the sample, `vesuvius` occurs just once in either condition (as stand-alone and in the corpus as a whole) and thus has a ratio of `1`; `pentonville` occurs once as a stand-alone utterance but twice overall, yielding a ratio of `0.50000000`. On the other hand, words such as `yeah`, `mm`, or `pardon` have both high frequencies as stand alone items and overall and get ratios between 0.3. and 0.7.

Given the much higher observed frequencies, `yeah`, `mm`, and `pardon` intuitively seem to have a much higher capability of forming an utterance by themselves than `vesuvius` and `pentonville`. So the ratio surely is an unreliable metric. How can an item's capability of forming a complete utterance be determined more reliably and with more statistical rigor? Is the Fisher's exact test an appropriate method? Or is a more complex statistical method warranted?

Advice is greatly appreciated!

Chris

--
Albert-Ludwigs-Universität Freiburg
Projekt-Leiter DFG-Projekt "Analyse multimodaler Interaktion im Geschichtenerzählen"
ἰχθύς

### Christoph Ruehlemann

Jun 26, 2020, 10:50:06 AM6/26/20
in pursuit of an answer to my inquiry ...

Is additive smoothing an option? See, for example, https://en.wikipedia.org/wiki/Additive_smoothing.

That is, in my case, adding, say, 20 or 50 to `F_overall`? It clearly has the desired effect: it does not alter much the probability of overall frequent words, but it lowers the probability of overall rare words quite a lot. But is this method acceptable?

### Martin Schweinberger

Jun 26, 2020, 9:02:13 PM6/26/20
Hi Chris,

I am not sure if this is what you are looking for but I wrote a short Notebook which should find elements that occur more frequently as stand-alone utterances than would be expected by chance. The Notebook showcases how you could do this (attached and below) but it comes without warranties (better check each step and see if it makes sense)...I was just dabbling ;)

Cheers,
Martin

Here's the code for the Notebook:

---
title: "Determining the tendency for elements to serve as stand-alone utterances"
author: "Anonymous"
date: "`r format(Sys.time(), '%Y-%m-%d')`"
output:
bookdown::html_document2: default
bibliography: bibliography.bib
---

This Notebook determines the tendency for elements to serve as stand-alone utterances.

```{r prep1, echo=T, eval = T, message=FALSE, warning=FALSE}
# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)
options(scipen = 999)
options(max.print=10000)
# actiavte packages
library("dplyr")
```

Create sample data

```{r standalone_01, eval = T, echo = T, warning = F}

mysample <- data.frame(
Word = c("vesuvius", "cruel","pentonville","mortuary","yuck","bollocks","yeah","mm","pardon"),
F_standalone = c(1,1,1,2,7,26,22875,11576,584),
F_overall = c(1,35,2,3,58,140,60158,21954,877),
Ratio = c(1.000000000,0.028571429,0.500000000,0.666666667,0.1206896552,0.18571429,0.3802487,0.5272843,0.6659065)
)
# inspect data
```

```{r standalone_03, eval = T, echo = T, warning = F}
newsample <- mysample %>%
dplyr::select(-Ratio) %>%
dplyr::rename(StandAlone = F_standalone,
OverallFrequency = F_overall) %>%
dplyr::mutate(OtherContexts = OverallFrequency-StandAlone) %>%
dplyr::select(-OverallFrequency)
# inspect data
```

In order to identify which words that occur more frequently as stand-alone utterances than would be expected by chance, we have to determine the following frequencies:

* a = Number of times a term occurs as stand-alone utterance

* b = Number of times a term occurs in other contexts

* c = Number of times other terms occur  as stand-alone utterance

* d = Number of times other terms occur in other contexts

In a first step, we create a table which holds these quantities.

```{r standalone_05, eval = T, echo = T, warning = F}
newsample <- newsample %>%
dplyr::mutate(AllStandAlone = sum(StandAlone)) %>%
dplyr::mutate(AllOtherContexts = sum(OtherContexts)) %>%
dplyr::arrange(Word) %>%
dplyr::mutate(a = StandAlone,
b = OtherContexts,
c = AllStandAlone - a,
d = AllOtherContexts - b) %>%
dplyr::mutate(NRows = nrow(newsample))
# inspect data
newsample
```

Perform the statz and inspect results

```{r standalone_07, eval = T, echo = T, warning = F}
results <- newsample %>%
dplyr::rowwise() %>%
dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(a, b, c, d),
ncol = 2, byrow = T))[1]))) %>%
dplyr::mutate(p = round(p, 5)) %>%
dplyr::mutate(x2 = as.vector(unlist(chisq.test(matrix(c(a, b, c, d),                                                           ncol = 2, byrow = T))[1]))) %>%
dplyr::mutate(x2 = round(x2, 5)) %>%
dplyr::mutate(phi = sqrt((x2/(a + b + c + d)))) %>%
dplyr::mutate(expected = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))\$expected[1]))) %>%
dplyr::mutate(expected = round(expected, 2)) %>%
dplyr::mutate(phi = round(phi, 3)) %>%
dplyr::mutate(BonferroniCorrectedSignificance = ifelse(p <= .05, "p<.05",
ifelse(p <= .01, "p<.01",
ifelse(p <= .001, "p<.001", "n.s.")))) %>%
# determine type (Type = more freq. as stand-alone)
dplyr::mutate(Type = ifelse(StandAlone > expected, "Type", "Antitype")) %>%
# remove superfluous columns
dplyr::select(-a, -b, -c, -d, -NRows, -AllStandAlone, -AllOtherContexts)
# inspect results
results
```

=====================================
Dr. Martin Schweinberger
5/221 Sir Fred Schonell Drive
St Lucia, QLD, 4067

Fon.: +61 (0)404 228 226
Home: http://www.martinschweinberger.de/

--
You received this message because you are subscribed to the Google Groups "StatForLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/statforling-with-r/CALFCMoXKYh-wJBBF%3DL8gwTDOrLj%2B%2BBPxq%3Di5d8tpTX5pw-5L_g%40mail.gmail.com.
StandAloneNotebook.Rmd

### Stefan Th. Gries

Jun 26, 2020, 9:53:56 PM6/26/20
to StatForLing with R
Here's my decidedly more old-school take on this - let's see if the internet breaks if 'it sees' R code that doesn't have a single tidyverse thing ;-)

# generate data but as matrix
rm(list=ls(all=TRUE))
(x <- matrix(c(c(1,1,1,2,7,26,22875,11576,584),
c(0,34,1,1,51,114,37283,10378,293)),
ncol=2, dimnames=list(
WORD=c("vesuvius", "cruel","pentonville","mortuary","yuck","bollocks","yeah","mm","pardon"),
USE=c("standalone", "other"))))
# do stuff
p.values <- apply(x, 1, function (af) {
temp <- matrix(c(af, colSums(x)-af), ncol=2, byrow=TRUE)
round(fisher.test(temp)\$p.value, 5) })
expecteds <- chisq.test(x, correct=FALSE)\$expected
result <- data.frame(x, expecteds,
x[,1]>expecteds[,1])
names(result) <- c(colnames(x), paste0(colnames(x), "_exp"), "Padj", "StandaloneIsPreferred")
result

However, the above doesn't address the smoothing issue; adding a small constant might help of course, as might Goot-Turing.

### Christoph Ruehlemann

Jun 27, 2020, 3:12:34 AM6/27/20
Hi Martin,

Thank you so much for this incredible piece of work! Way more than I expected!  I'll take my time and try and wrap my head around it.

Best
Chris

### Christoph Ruehlemann

Jun 27, 2020, 3:14:35 AM6/27/20
Hi Stefan,

Thanks so much for this! It's far more than I expected!

Best
Chris

--
You received this message because you are subscribed to the Google Groups "StatForLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.