How to sum up durations if certain patterns are found across columns

Christoph Ruehlemann

unread,

Dec 9, 2020, 7:22:31 AM12/9/20

to corplin...@googlegroups.com

Hi all,

I have a dataframe with words and their durations in speech (reproducible data below):

test1
       d1    d2    d3    d4    d5    d6    d7    d8    d9   d10  w1       w2      w3  w4  w5    w6  w7  w8  w9  w10
10  0.103 0.168 0.198 0.188 0.359 0.343 0.064 0.075 0.095 0.367 And        I thought  oh  no Sarah  do n't  do   it
132 0.091 0.072 0.109 0.119 0.113 0.087 0.088 0.264 0.092 0.249   I       du       n  no you    ca n't see his head
784 0.152 0.341 0.117 0.108 0.123 0.263 0.083 0.095 0.099 0.098  Oh honestly       I did n't touch  it   I did  n't

The short form n't is treated as if it were a separate word. That's okay as long as the preceding word ends on a consonant such as did, but that's not okay if the preceding word ends on a vowel such as do or ca. Because that separation into different words is incorrect the separation into different durations is incorrect too.

What I'd like to do is sum up the durations of ca and n't as well as do and n't (but leave alone the separate durations for did and n't), move the remaining durations one column to the left and replace the last duration with NA.

I know how to select the rows where the changes need to be implemented:

test1[which(grepl("(?<=(ca|do)\\s)n't", apply(test1, 1, paste0, collapse = " "), perl = T)),]

but I'm stuck going forward.

The desired result would look like this:

       d1    d2    d3    d4    d5    d6    d7    d8    d9   d10  w1       w2      w3  w4  w5    w6  w7  w8  w9  w10
10  0.103 0.168 0.198 0.188 0.359 0.343 0.139 0.095 0.367    NA And        I thought  oh  no Sarah  do n't  do   it
132 0.091 0.072 0.109 0.119 0.113 0.175 0.264 0.092 0.249    NA   I       du       n  no you    ca n't see his head
784 0.152 0.341 0.117 0.108 0.123 0.263 0.083 0.095 0.099 0.098  Oh honestly       I did n't touch  it   I did  n't

How can this be done? Help is much appreciated.

Reproducible data:

test1 <- structure(list(d1 = c(0.103, 0.091, 0.152), d2 = c(0.168, 0.072, 
                   0.341), d3 = c(0.198, 0.109, 0.117), d4 = c(0.188, 0.119, 0.108
                   ), d5 = c(0.359, 0.113, 0.123), d6 = c(0.343, 0.087, 0.263), 
                   d7 = c(0.064, 0.088, 0.083), d8 = c(0.075, 0.264, 0.095), 
                   d9 = c(0.095, 0.092, 0.099), d10 = c(0.367, 0.249, 0.098), 
                   w1 = c("And", "I", "Oh"), w2 = c("I", "du", "honestly"), 
                   w3 = c("thought", "n", "I"), w4 = c("oh", "no", "did"), w5 = c("no", 
                   "you", "n't"), w6 = c("Sarah", "ca", "touch"), w7 = c("do", 
                   "n't", "it"), w8 = c("n't", "see", "I"), w9 = c("do", "his", 
                   "did"), w10 = c("it", "head", "n't")), row.names = c(10L, 
                   132L, 784L), class = "data.frame")

Many thanks in advance!

Chris

--

Albert-Ludwigs-Universität Freiburg

Projekt-Leiter DFG-Projekt "Analyse multimodaler Interaktion im Geschichtenerzählen"

http://paul.igl.uni-freiburg.de/ruehlemann/?Home

ἰχθύς

Stefan Th. Gries

unread,

Dec 9, 2020, 8:10:39 AM12/9/20

to CorpLing with R

You do not also want to merge the instances of "do"/"ca" and "n't"
whose durations were summed up? (The current version of the desired
result has a mismatch between the d and the w columns, which doesn't
seem desirable to me.)

Martin Schweinberger

unread,

Dec 9, 2020, 8:51:51 AM12/9/20

to corplin...@googlegroups.com

Super ugly (sorry about that) but does the trick (I think)

# prepare session
library(tidyverse)

# generate data

test1 <- structure(list(d1 = c(0.103, 0.091, 0.152), d2 = c(0.168, 0.072,
0.341), d3 = c(0.198, 0.109, 0.117), d4 = c(0.188, 0.119, 0.108
), d5 = c(0.359, 0.113, 0.123), d6 = c(0.343, 0.087, 0.263),
d7 = c(0.064, 0.088, 0.083), d8 = c(0.075, 0.264, 0.095),
d9 = c(0.095, 0.092, 0.099), d10 = c(0.367, 0.249, 0.098),
w1 = c("And", "I", "Oh"), w2 = c("I", "du", "honestly"),
w3 = c("thought", "n", "I"), w4 = c("oh", "no", "did"), w5 = c("no",
"you", "n't"), w6 = c("Sarah", "ca", "touch"), w7 = c("do",
"n't", "it"), w8 = c("n't", "see", "I"), w9 = c("do", "his",
"did"), w10 = c("it", "head", "n't")), row.names = c(10L,
132L, 784L), class = "data.frame")

# inspect data
test1

# convert to tidy format
test2 <- test1 %>%
dplyr::mutate(Sentence = rownames(.)) %>%
tidyr::gather(Position1, Duration, d1:d10) %>%
tidyr::gather(Position2, Word, w1:w10) %>%
dplyr::mutate(Position1 = stringr::str_remove_all(Position1, "[a-z]"),
Position2 = stringr::str_remove_all(Position2, "[a-z]")) %>%
dplyr::filter(Position1 == Position2) %>%
dplyr::rename(Position = Position1) %>%
dplyr::select(-Position2) %>%
unique() %>%
dplyr::mutate(Position = factor(Position, levels = 1:10, labels = 1:10)) %>%
dplyr::arrange(Sentence, Position)
# inspect data
test2

# identify contractions and calculate new Duration
test3 <- test2 %>%
dplyr::mutate(NextWord = c(Word[2:length(Word)], NA),
WordPlusNextWord = paste0(Word, NextWord),
BigramRep = c(NA, WordPlusNextWord[1:length(Word)-1]),
NewWord = dplyr::case_when(WordPlusNextWord == "don't" ~ "don't",
WordPlusNextWord == "can't" ~ "can't",
BigramRep == "don't" ~ "don't",
BigramRep == "can't"~ "can't",
TRUE ~ Word)) %>%
dplyr::group_by(Sentence, NewWord) %>%
dplyr::summarise(Duration = sum(Duration),
Position = Position) %>%
dplyr::arrange(Sentence, Position)
# inspect data
test3

# remove superfluous conractions and add new positions
test4 <- test3 %>%
dplyr::rename(Word = NewWord) %>%
dplyr::ungroup() %>%
dplyr::mutate(NextWord = c(Word[2:length(Word)], NA),
Bigram = paste0(Word, NextWord)) %>%
dplyr::filter(Bigram != "don'tdon't",
Bigram != "can'tcan't") %>%
dplyr::group_by(Sentence) %>%
dplyr::mutate(Position = 1:n()) %>%
dplyr::select(Sentence, Word, Duration, Position)
# inspect data
test4

# convert back to messy
test5 <- test4 %>%
dplyr::mutate(WordPosition = paste0("w", Position),
DurationPosition = paste0("d", Position)) %>%
tidyr::spread(WordPosition, Word) %>%
tidyr::spread(DurationPosition, Duration) %>%
dplyr::select(-Position) %>%
tidyr::fill(w1:d9) %>%
dplyr::group_by(Sentence) %>%
dplyr::mutate(Position = 1:n(),
MaxPosition = which(Position == max(Position))) %>%
dplyr::filter(Position == MaxPosition) %>%
dplyr::relocate(w1, w2, w3, w4, w4, w5, w6, w7, w8, w9, w10, d1, d2, d3, d4, d5, d6, d7, d8, d9, d10) %>%
dplyr::select(-Position, -MaxPosition)
# add rownames
test6 <- data.frame(test5)
rownames(test6) <- test6$Sentence
test6$Sentence <- NULL
# inspect data
test6

Is this what you were looking for?

=====================================

Dr. Martin Schweinberger

5/221 Sir Fred Schonell Drive

St Lucia, QLD, 4067

Fon.: +61 (0)404 228 226

Home: http://www.martinschweinberger.de/

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CALFCMoW0u6am93TDBqc_CPOFExr9kY5C1LzaZOxY90FxGvWm0w%40mail.gmail.com.

Christoph Ruehlemann

unread,

Dec 9, 2020, 9:18:06 AM12/9/20

to corplin...@googlegroups.com

Nice!

Thanks a lot, Martin.

Best

Chris

To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CANMdTKiRM8R%3DYi5niNAJ8p%3Dq08q9ViVpCeHSpYjLdG0v78DDRQ%40mail.gmail.com.

Martin Schweinberger

unread,

Dec 9, 2020, 9:20:56 AM12/9/20

to corplin...@googlegroups.com

No worries!

=====================================

Dr. Martin Schweinberger

5/221 Sir Fred Schonell Drive

St Lucia, QLD, 4067

Fon.: +61 (0)404 228 226

Home: http://www.martinschweinberger.de/

To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CALFCMoUUKaKp_Nx12DixytOM3C2K9%2BXpi787ZxxLwUXfyvodCg%40mail.gmail.com.

Stefan Th. Gries

unread,

Dec 9, 2020, 10:00:54 AM12/9/20

to CorpLing with R

Unless I missed or messed up something, I think this is it:

# create data (slightly changed to test something)

x.old <- x <- structure(list(

d1 = c(0.103, 0.091, 0.152), d2 = c(0.168, 0.072, 0.341),
d3 = c(0.198, 0.109, 0.117), d4 = c(0.188, 0.119, 0.108),
d5 = c(0.359, 0.113, 0.123), d6 = c(0.343, 0.087, 0.263),
d7 = c(0.064, 0.088, 0.083), d8 = c(0.075, 0.264, 0.095),
d9 = c(0.095, 0.092, 0.099), d10 = c(0.367, 0.249, 0.098),

w1 = c("And", "I", "Oh"), w2 = c("I", "do", "honestly"),
w3 = c("thought", "n't", "I"), w4 = c("oh", "no", "did"),

w5 = c("no", "you", "n't"), w6 = c("Sarah", "ca", "touch"),
w7 = c("do", "n't", "it"), w8 = c("n't", "see", "I"),
w9 = c("do", "his", "did"), w10 = c("it", "head", "n't")),
row.names = c(10L, 132L, 784L), class = "data.frame")

# process

for (i in seq(nrow(x))) {
find.1st <- grep("^(ca|do)$", x[i,11:20])
find.2nd <- grep("n't", x[i,11:20])-1
both <- intersect(find.1st, find.2nd)
if (length(both)==0) { next }
for (j in rev(both)) {
# summing up in the d-columns
x[i,j] <- x[i,j]+x[i,j+1]
x[i,min(j+2, 10):10] -> x[i,min(j+1, 9):9]
x[i,10] <- NA
# merging/shifting in the w columns
x[i,10+j] <- paste0(x[i,10+j], x[i,10+j+1])
x[i,min(j+12, 20):20] -> x[i,min(j+11, 19):19]
x[i,20] <- NA
}
}

# check:

x.old
x

Martin Schweinberger

unread,

Dec 9, 2020, 10:07:43 AM12/9/20

to corplin...@googlegroups.com

way more elegant and old school ;)

=====================================

Dr. Martin Schweinberger

5/221 Sir Fred Schonell Drive

St Lucia, QLD, 4067

Fon.: +61 (0)404 228 226

Home: http://www.martinschweinberger.de/

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CAFrBz2mu%3D_53TBL1zhHZaOqT0yx6QdNVS0qohGUT03MtQ8OT9w%40mail.gmail.com.

Stefan Th. Gries

unread,

Dec 9, 2020, 10:10:12 AM12/9/20

to CorpLing with R

Old school is my middle name! Maybe I'll retire soon and then -
finally! - a new order can take over ;-))

Message has been deleted

Christoph Ruehlemann

unread,

Dec 9, 2020, 10:53:12 AM12/9/20

to corplin...@googlegroups.com

Beautifully clear!

Thanks a lot.

Best

Chris

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CAFrBz2mu%3D_53TBL1zhHZaOqT0yx6QdNVS0qohGUT03MtQ8OT9w%40mail.gmail.com.

Christoph Ruehlemann

unread,

Dec 18, 2020, 2:58:21 AM12/18/20

to corplin...@googlegroups.com

Sorry for this totally delayed answer! (only saw your post yesterday).

Yes, sure, I did want to also merge the relevant word pairs but did not ask for a solution for that too as I felt asking for the other solution was already demanding a lot. But no worries, I've found a solution that takes care of that as well. But thanks a lot!

Chirs

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CAFrBz2kzWoY6CCVsVnNRaMiJOphEQ6oagSM1F5zcp6PDC5t1uQ%40mail.gmail.com.

Stefan Th. Gries

unread,

Dec 18, 2020, 9:47:39 AM12/18/20

to CorpLing with R

> Sorry for this totally delayed answer! (only saw your post yesterday).
> Yes, sure, I did want to also merge the relevant word pairs but did not ask for a solution for that too as I felt asking for the other solution was already demanding a lot. But no worries, I've found a solution that takes care of that as well. But thanks a lot!

The solution I sent you 9 days ago did both ...

Earl Brown

unread,

Dec 23, 2020, 9:46:36 PM12/23/20

to CorpLing with R

I know I'm late to the party, but I wanted to contribute the solution below in case you need to fix sentences that aren't exactly ten words long.

### create test1 data frame with the code given by Chris in his original post ###

### load the workhorses ###

library(tidyverse)

### define function ###

fix_df <- function(df) {

n_rows <- nrow(df)

df_rownames <- rownames(df)

df_temp <- df %>%

pivot_longer(everything(), names_to = c(".value", "n"), names_pattern = "([dw])(\\d+)") %>%

mutate(n = as.integer(n))

n_wds <- df_temp %>% pull(n) %>% max()

df_fixed <- df_temp %>%

mutate(d = ifelse(str_detect(w, "^(ca|do)$") & str_detect(lead(w), "^n't$"), d + lead(d), d)) %>%

mutate(w = ifelse(str_detect(w, "^(ca|do)$") & str_detect(lead(w), "^n't$"), str_c(w, lead(w)), w)) %>%

mutate(keep = ifelse(!(str_detect(w, "^n't$") & str_detect(lag(w), "^(ca|do)")), TRUE, FALSE)) %>%

mutate(set = rep(1:n_rows, each = n_wds)) %>%

arrange(set, desc(keep), n) %>%

mutate(w = ifelse(keep, w, NA), d = ifelse(keep, d, NA)) %>%

group_by(set) %>%

mutate(n = seq(n_wds)) %>%

ungroup() %>%

select(-keep) %>%

pivot_wider(names_from = n, names_glue = "{.value}{n}", values_from = c(d, w)) %>%

select(-set) %>%

as.data.frame()

rownames(df_fixed) <- df_rownames

return(df_fixed)

} # end function definition

### test the function ###

fix_df(test1)

Reply all

Reply to author

Forward