15 views

Skip to first unread message

Jun 25, 2020, 9:31:07 AM6/25/20

to statforli...@googlegroups.com

Hi all,

I'm working on the conversational subcorpus of the BNC trying to determine *how words differ in terms of being able to form a complete utterance by themselves*. To that end I've computed two frequency lists:

`F_standalone`

: the frequency of how often a word did stand alone in an utterance in the corpus.`F_overall`

: the frequency of the respective words overall in the corpus.

Initially, I assumed that a word's potential of forming a complete utterance by itself could be read off the **ratio** of `F_standalone`

divided by `F_overall`

.
While this may make sense for words that have high frequencies in both conditions it makes much less sense for
words that are rare in the corpus overall: you get the maximum ratio of 1 if a word that occurs just once in the whole corpus happens to occur as a stand-alone word. And that sole occurrence as a stand-alone could be by chance.

A reproducible sample of the data I have is this:

```
mysample <- data.frame(
Word = c("vesuvius", "cruel","pentonville","mortuary","yuck","bollocks","yeah","mm","pardon"),
F_standalone = c(1,1,1,2,7,26,22875,11576,584),
F_overall = c(1,35,2,3,58,140,60158,21954,877),
Ratio = c(1.000000000,0.028571429,0.500000000,0.666666667,0.1206896552,0.18571429,0.3802487,0.5272843,0.6659065)
)
mysample
Word F_standalone F_overall Ratio
1 vesuvius 1 1 1.00000000
2 cruel 1 35 0.02857143
3 pentonville 1 2 0.50000000
4 mortuary 2 3 0.66666667
5 yuck 7 58 0.12068966
6 bollocks 26 140 0.18571429
7 yeah 22875 60158 0.38024870
8 mm 11576 21954 0.52728430
9 pardon 584 877 0.66590650
```

As can be seen from the sample, `vesuvius`

occurs just once in either condition (as stand-alone and in the corpus as a whole) and thus has a ratio of `1`

; `pentonville`

occurs once as a stand-alone utterance but twice overall, yielding a ratio of `0.50000000`

. On the other hand, words such as `yeah`

, `mm`

, or `pardon`

have both high frequencies as stand alone items and overall and get ratios between 0.3. and 0.7.

Given the much higher observed frequencies, `yeah`

, `mm`

, and `pardon`

intuitively seem to have a much higher capability of forming an utterance by themselves than `vesuvius`

and `pentonville`

.
So the ratio surely is an unreliable metric. How can an item's
capability of forming a complete utterance be determined more reliably
and with more statistical rigor? Is the Fisher's exact test an appropriate method? Or is a more complex statistical method warranted?

Advice is greatly appreciated!

Chris

--

Albert-Ludwigs-Universität Freiburg

Projekt-Leiter DFG-Projekt "Analyse multimodaler Interaktion im Geschichtenerzählen"

ἰχθύς

Jun 26, 2020, 10:50:06 AM6/26/20

to statforli...@googlegroups.com

in pursuit of an answer to my inquiry ...

Is additive smoothing an option? See, for example, https://en.wikipedia.org/wiki/Additive_smoothing.

That is, in my case, adding, say, 20 or 50 to

`F_overall`

? It clearly has the desired effect: it does not alter much the probability of overall frequent
words, but it lowers the probability of overall rare words quite a lot. But is this method acceptable?Jun 26, 2020, 9:02:13 PM6/26/20

to statforli...@googlegroups.com

Hi Chris,

I am not sure if this is what you are looking for but I wrote a short Notebook which should find elements that occur more frequently as stand-alone utterances than would be expected by chance. The Notebook showcases how you could do this (attached and below) but it comes without warranties (better check each step and see if it makes sense)...I was just dabbling ;)

Cheers,

Martin

Here's the code for the Notebook:

---

title: "Determining the tendency for elements to serve as stand-alone utterances"

author: "Anonymous"

date: "`r format(Sys.time(), '%Y-%m-%d')`"

output:

bookdown::html_document2: default

bibliography: bibliography.bib

link-citations: yes

---

This Notebook determines the tendency for elements to serve as stand-alone utterances.

```{r prep1, echo=T, eval = T, message=FALSE, warning=FALSE}

# clean current workspace

rm(list=ls(all=T))

# set options

options(stringsAsFactors = F)

options(scipen = 999)

options(max.print=10000)

# actiavte packages

library("dplyr")

```

Create sample data

```{r standalone_01, eval = T, echo = T, warning = F}

title: "Determining the tendency for elements to serve as stand-alone utterances"

author: "Anonymous"

date: "`r format(Sys.time(), '%Y-%m-%d')`"

output:

bookdown::html_document2: default

bibliography: bibliography.bib

link-citations: yes

---

This Notebook determines the tendency for elements to serve as stand-alone utterances.

```{r prep1, echo=T, eval = T, message=FALSE, warning=FALSE}

# clean current workspace

rm(list=ls(all=T))

# set options

options(stringsAsFactors = F)

options(scipen = 999)

options(max.print=10000)

# actiavte packages

library("dplyr")

```

Create sample data

```{r standalone_01, eval = T, echo = T, warning = F}

mysample <- data.frame(

Word = c("vesuvius", "cruel","pentonville","mortuary","yuck","bollocks","yeah","mm","pardon"),

F_standalone = c(1,1,1,2,7,26,22875,11576,584),

F_overall = c(1,35,2,3,58,140,60158,21954,877),

Ratio = c(1.000000000,0.028571429,0.500000000,0.666666667,0.1206896552,0.18571429,0.3802487,0.5272843,0.6659065)

)

# inspect data

head(mysample)

```

```{r standalone_03, eval = T, echo = T, warning = F}

newsample <- mysample %>%

dplyr::select(-Ratio) %>%

dplyr::rename(StandAlone = F_standalone,

OverallFrequency = F_overall) %>%

dplyr::mutate(OtherContexts = OverallFrequency-StandAlone) %>%

dplyr::select(-OverallFrequency)

# inspect data

head(newsample)

```

In order to identify which words that occur more frequently as stand-alone utterances than would be expected by chance, we have to determine the following frequencies:

* a = Number of times a term occurs as stand-alone utterance

* b = Number of times a term occurs in other contexts

* c = Number of times other terms occur as stand-alone utterance

* d = Number of times other terms occur in other contexts

In a first step, we create a table which holds these quantities.

```{r standalone_05, eval = T, echo = T, warning = F}

newsample <- newsample %>%

dplyr::mutate(AllStandAlone = sum(StandAlone)) %>%

dplyr::mutate(AllOtherContexts = sum(OtherContexts)) %>%

dplyr::arrange(Word) %>%

dplyr::mutate(a = StandAlone,

b = OtherContexts,

c = AllStandAlone - a,

d = AllOtherContexts - b) %>%

dplyr::mutate(NRows = nrow(newsample))

# inspect data

newsample

```

Perform the statz and inspect results

```{r standalone_07, eval = T, echo = T, warning = F}

results <- newsample %>%

dplyr::rowwise() %>%

dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(a, b, c, d),

ncol = 2, byrow = T))[1]))) %>%

dplyr::mutate(p = round(p, 5)) %>%

dplyr::mutate(x2 = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))[1]))) %>%

dplyr::mutate(x2 = round(x2, 5)) %>%

dplyr::mutate(phi = sqrt((x2/(a + b + c + d)))) %>%

dplyr::mutate(expected = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))$expected[1]))) %>%

dplyr::mutate(expected = round(expected, 2)) %>%

dplyr::mutate(phi = round(phi, 3)) %>%

dplyr::mutate(BonferroniCorrectedSignificance = ifelse(p <= .05, "p<.05",

ifelse(p <= .01, "p<.01",

ifelse(p <= .001, "p<.001", "n.s.")))) %>%

# determine type (Type = more freq. as stand-alone)

dplyr::mutate(Type = ifelse(StandAlone > expected, "Type", "Antitype")) %>%

# remove superfluous columns

dplyr::select(-a, -b, -c, -d, -NRows, -AllStandAlone, -AllOtherContexts)

# inspect results

results

```

head(mysample)

```

```{r standalone_03, eval = T, echo = T, warning = F}

newsample <- mysample %>%

dplyr::select(-Ratio) %>%

dplyr::rename(StandAlone = F_standalone,

OverallFrequency = F_overall) %>%

dplyr::mutate(OtherContexts = OverallFrequency-StandAlone) %>%

dplyr::select(-OverallFrequency)

# inspect data

head(newsample)

```

In order to identify which words that occur more frequently as stand-alone utterances than would be expected by chance, we have to determine the following frequencies:

* a = Number of times a term occurs as stand-alone utterance

* b = Number of times a term occurs in other contexts

* c = Number of times other terms occur as stand-alone utterance

* d = Number of times other terms occur in other contexts

In a first step, we create a table which holds these quantities.

```{r standalone_05, eval = T, echo = T, warning = F}

newsample <- newsample %>%

dplyr::mutate(AllStandAlone = sum(StandAlone)) %>%

dplyr::mutate(AllOtherContexts = sum(OtherContexts)) %>%

dplyr::arrange(Word) %>%

dplyr::mutate(a = StandAlone,

b = OtherContexts,

c = AllStandAlone - a,

d = AllOtherContexts - b) %>%

dplyr::mutate(NRows = nrow(newsample))

# inspect data

newsample

```

Perform the statz and inspect results

```{r standalone_07, eval = T, echo = T, warning = F}

results <- newsample %>%

dplyr::rowwise() %>%

dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(a, b, c, d),

ncol = 2, byrow = T))[1]))) %>%

dplyr::mutate(p = round(p, 5)) %>%

dplyr::mutate(x2 = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))[1]))) %>%

dplyr::mutate(x2 = round(x2, 5)) %>%

dplyr::mutate(phi = sqrt((x2/(a + b + c + d)))) %>%

dplyr::mutate(expected = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))$expected[1]))) %>%

dplyr::mutate(expected = round(expected, 2)) %>%

dplyr::mutate(phi = round(phi, 3)) %>%

dplyr::mutate(BonferroniCorrectedSignificance = ifelse(p <= .05, "p<.05",

ifelse(p <= .01, "p<.01",

ifelse(p <= .001, "p<.001", "n.s.")))) %>%

# determine type (Type = more freq. as stand-alone)

dplyr::mutate(Type = ifelse(StandAlone > expected, "Type", "Antitype")) %>%

# remove superfluous columns

dplyr::select(-a, -b, -c, -d, -NRows, -AllStandAlone, -AllOtherContexts)

# inspect results

results

```

=====================================

Dr. Martin SchweinbergerSt Lucia, QLD, 4067

--

You received this message because you are subscribed to the Google Groups "StatForLing with R" group.

To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/statforling-with-r/CALFCMoXKYh-wJBBF%3DL8gwTDOrLj%2B%2BBPxq%3Di5d8tpTX5pw-5L_g%40mail.gmail.com.

Jun 26, 2020, 9:53:56 PM6/26/20

to StatForLing with R

Here's my decidedly more old-school take on this - let's see if the internet breaks if 'it sees' R code that doesn't have a single tidyverse thing ;-)

# generate data but as matrix

rm(list=ls(all=TRUE))

# generate data but as matrix

rm(list=ls(all=TRUE))

(x <- matrix(c(c(1,1,1,2,7,26,22875,11576,584),

c(0,34,1,1,51,114,37283,10378,293)),

ncol=2, dimnames=list(

WORD=c("vesuvius", "cruel","pentonville","mortuary","yuck","bollocks","yeah","mm","pardon"),

USE=c("standalone", "other"))))

# do stuff

p.values <- apply(x, 1, function (af) {

temp <- matrix(c(af, colSums(x)-af), ncol=2, byrow=TRUE)

round(fisher.test(temp)$p.value, 5) })

expecteds <- chisq.test(x, correct=FALSE)$expected

result <- data.frame(x, expecteds,

p.adjust(p.values),

x[,1]>expecteds[,1])

names(result) <- c(colnames(x), paste0(colnames(x), "_exp"), "Padj", "StandaloneIsPreferred")

result

However, the above doesn't address the smoothing issue; adding a small constant might help of course, as might Goot-Turing.

c(0,34,1,1,51,114,37283,10378,293)),

ncol=2, dimnames=list(

WORD=c("vesuvius", "cruel","pentonville","mortuary","yuck","bollocks","yeah","mm","pardon"),

USE=c("standalone", "other"))))

# do stuff

p.values <- apply(x, 1, function (af) {

temp <- matrix(c(af, colSums(x)-af), ncol=2, byrow=TRUE)

round(fisher.test(temp)$p.value, 5) })

expecteds <- chisq.test(x, correct=FALSE)$expected

result <- data.frame(x, expecteds,

p.adjust(p.values),

x[,1]>expecteds[,1])

names(result) <- c(colnames(x), paste0(colnames(x), "_exp"), "Padj", "StandaloneIsPreferred")

result

However, the above doesn't address the smoothing issue; adding a small constant might help of course, as might Goot-Turing.

Jun 27, 2020, 3:12:34 AM6/27/20

to statforli...@googlegroups.com

Hi Martin,

Thank you so much for this incredible piece of work! Way more than I expected! I'll take my time and try and wrap my head around it.

Best

Chris

To view this discussion on the web visit https://groups.google.com/d/msgid/statforling-with-r/CANMdTKjjivh0-8sRn7r4EsyTYxQhzWfTSznBwRphBGfvOtk7HA%40mail.gmail.com.

Jun 27, 2020, 3:14:35 AM6/27/20

to statforli...@googlegroups.com

Hi Stefan,

Thanks so much for this! It's far more than I expected!

Best

Chris

--

You received this message because you are subscribed to the Google Groups "StatForLing with R" group.

To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/statforling-with-r/CAFrBz2%3Dk-hr-CNyHZFe%3DcaCbuZb9hxBCtAGUF20jtJzk_aExKg%40mail.gmail.com.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu