How to sum up durations if certain patterns are found across columns

58 views
Skip to first unread message

Christoph Ruehlemann

unread,
Dec 9, 2020, 7:22:31 AM12/9/20
to corplin...@googlegroups.com
Hi all,

I have a dataframe with words and their durations in speech (reproducible data below):

test1
       d1    d2    d3    d4    d5    d6    d7    d8    d9   d10  w1       w2      w3  w4  w5    w6  w7  w8  w9  w10
10  0.103 0.168 0.198 0.188 0.359 0.343 0.064 0.075 0.095 0.367 And        I thought  oh  no Sarah  do n't  do   it
132 0.091 0.072 0.109 0.119 0.113 0.087 0.088 0.264 0.092 0.249   I       du       n  no you    ca n't see his head
784 0.152 0.341 0.117 0.108 0.123 0.263 0.083 0.095 0.099 0.098  Oh honestly       I did n't touch  it   I did  n't

The short form n't is treated as if it were a separate word. That's okay as long as the preceding word ends on a consonant such as did, but that's not okay if the preceding word ends on a vowel such as do or ca. Because that separation into different words is incorrect the separation into different durations is incorrect too.

What I'd like to do is sum up the durations of ca and n't as well as do and n't (but leave alone the separate durations for did and n't), move the remaining durations one column to the left and replace the last duration with NA.

I know how to select the rows where the changes need to be implemented:

test1[which(grepl("(?<=(ca|do)\\s)n't", apply(test1, 1, paste0, collapse = " "), perl = T)),]

but I'm stuck going forward.

The desired result would look like this:

       d1    d2    d3    d4    d5    d6    d7    d8    d9   d10  w1       w2      w3  w4  w5    w6  w7  w8  w9  w10
10  0.103 0.168 0.198 0.188 0.359 0.343 0.139 0.095 0.367    NA And        I thought  oh  no Sarah  do n't  do   it
132 0.091 0.072 0.109 0.119 0.113 0.175 0.264 0.092 0.249    NA   I       du       n  no you    ca n't see his head
784 0.152 0.341 0.117 0.108 0.123 0.263 0.083 0.095 0.099 0.098  Oh honestly       I did n't touch  it   I did  n't

How can this be done? Help is much appreciated.

Reproducible data:

test1 <- structure(list(d1 = c(0.103, 0.091, 0.152), d2 = c(0.168, 0.072, 
                   0.341), d3 = c(0.198, 0.109, 0.117), d4 = c(0.188, 0.119, 0.108
                   ), d5 = c(0.359, 0.113, 0.123), d6 = c(0.343, 0.087, 0.263), 
                   d7 = c(0.064, 0.088, 0.083), d8 = c(0.075, 0.264, 0.095), 
                   d9 = c(0.095, 0.092, 0.099), d10 = c(0.367, 0.249, 0.098), 
                   w1 = c("And", "I", "Oh"), w2 = c("I", "du", "honestly"), 
                   w3 = c("thought", "n", "I"), w4 = c("oh", "no", "did"), w5 = c("no", 
                   "you", "n't"), w6 = c("Sarah", "ca", "touch"), w7 = c("do", 
                   "n't", "it"), w8 = c("n't", "see", "I"), w9 = c("do", "his", 
                   "did"), w10 = c("it", "head", "n't")), row.names = c(10L, 
                   132L, 784L), class = "data.frame")
Many thanks in advance!

Chris

--
Albert-Ludwigs-Universität Freiburg
Projekt-Leiter DFG-Projekt "Analyse multimodaler Interaktion im Geschichtenerzählen"
ἰχθύς

Stefan Th. Gries

unread,
Dec 9, 2020, 8:10:39 AM12/9/20
to CorpLing with R
You do not also want to merge the instances of "do"/"ca" and "n't"
whose durations were summed up? (The current version of the desired
result has a mismatch between the d and the w columns, which doesn't
seem desirable to me.)

Martin Schweinberger

unread,
Dec 9, 2020, 8:51:51 AM12/9/20
to corplin...@googlegroups.com
Super ugly (sorry about that) but does the trick (I think)

# prepare session
library(tidyverse)

# generate data

test1 <- structure(list(d1 = c(0.103, 0.091, 0.152), d2 = c(0.168, 0.072,
                   0.341), d3 = c(0.198, 0.109, 0.117), d4 = c(0.188, 0.119, 0.108
                   ), d5 = c(0.359, 0.113, 0.123), d6 = c(0.343, 0.087, 0.263),
                   d7 = c(0.064, 0.088, 0.083), d8 = c(0.075, 0.264, 0.095),
                   d9 = c(0.095, 0.092, 0.099), d10 = c(0.367, 0.249, 0.098),
                   w1 = c("And", "I", "Oh"), w2 = c("I", "du", "honestly"),
                   w3 = c("thought", "n", "I"), w4 = c("oh", "no", "did"), w5 = c("no",
                   "you", "n't"), w6 = c("Sarah", "ca", "touch"), w7 = c("do",
                   "n't", "it"), w8 = c("n't", "see", "I"), w9 = c("do", "his",
                   "did"), w10 = c("it", "head", "n't")), row.names = c(10L,
                   132L, 784L), class = "data.frame")
# inspect data
test1

# convert to tidy format
test2 <- test1 %>%
  dplyr::mutate(Sentence = rownames(.)) %>%
  tidyr::gather(Position1, Duration, d1:d10) %>%
  tidyr::gather(Position2, Word, w1:w10) %>%
  dplyr::mutate(Position1 = stringr::str_remove_all(Position1, "[a-z]"),
                Position2 = stringr::str_remove_all(Position2, "[a-z]")) %>%
  dplyr::filter(Position1 == Position2) %>%
  dplyr::rename(Position = Position1) %>%
  dplyr::select(-Position2) %>%
  unique() %>%
  dplyr::mutate(Position = factor(Position, levels = 1:10, labels = 1:10)) %>%
  dplyr::arrange(Sentence, Position)
# inspect data
test2

# identify contractions and calculate new Duration
test3 <- test2 %>%
  dplyr::mutate(NextWord = c(Word[2:length(Word)], NA),
                WordPlusNextWord = paste0(Word, NextWord),
                BigramRep = c(NA, WordPlusNextWord[1:length(Word)-1]),
                NewWord = dplyr::case_when(WordPlusNextWord == "don't" ~ "don't",
                                           WordPlusNextWord == "can't" ~ "can't",
                                           BigramRep == "don't" ~ "don't",
                                           BigramRep == "can't"~ "can't",
                                           TRUE ~ Word)) %>%
  dplyr::group_by(Sentence, NewWord) %>%
  dplyr::summarise(Duration = sum(Duration),
                   Position = Position) %>%
  dplyr::arrange(Sentence, Position)
# inspect data
test3

# remove superfluous conractions and add new positions
test4 <- test3 %>%
  dplyr::rename(Word = NewWord) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(NextWord = c(Word[2:length(Word)], NA),
                Bigram = paste0(Word, NextWord)) %>%
  dplyr::filter(Bigram != "don'tdon't",
                Bigram != "can'tcan't") %>%
  dplyr::group_by(Sentence) %>%
  dplyr::mutate(Position = 1:n()) %>%
  dplyr::select(Sentence, Word, Duration, Position)
# inspect data
test4  

# convert back to messy
test5 <- test4 %>%
  dplyr::mutate(WordPosition = paste0("w", Position),
                DurationPosition = paste0("d", Position)) %>%
  tidyr::spread(WordPosition, Word) %>%
  tidyr::spread(DurationPosition, Duration) %>%
  dplyr::select(-Position) %>%
  tidyr::fill(w1:d9) %>%
  dplyr::group_by(Sentence) %>%
  dplyr::mutate(Position = 1:n(),
                MaxPosition = which(Position == max(Position))) %>%
  dplyr::filter(Position == MaxPosition) %>%
  dplyr::relocate(w1, w2, w3, w4, w4, w5, w6, w7, w8, w9, w10, d1, d2, d3, d4, d5, d6, d7, d8, d9, d10) %>%
  dplyr::select(-Position, -MaxPosition)
# add rownames
test6 <- data.frame(test5)
rownames(test6) <- test6$Sentence
test6$Sentence <- NULL
# inspect data
test6

Is this what you were looking for?

=====================================
Dr. Martin Schweinberger
5/221 Sir Fred Schonell Drive
St Lucia, QLD, 4067

Fon.: +61 (0)404 228 226
Home: http://www.martinschweinberger.de/



--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CALFCMoW0u6am93TDBqc_CPOFExr9kY5C1LzaZOxY90FxGvWm0w%40mail.gmail.com.

Christoph Ruehlemann

unread,
Dec 9, 2020, 9:18:06 AM12/9/20
to corplin...@googlegroups.com
Nice!
Thanks a lot, Martin.

Best
Chris

Martin Schweinberger

unread,
Dec 9, 2020, 9:20:56 AM12/9/20
to corplin...@googlegroups.com
No worries!

=====================================
Dr. Martin Schweinberger
5/221 Sir Fred Schonell Drive
St Lucia, QLD, 4067

Fon.: +61 (0)404 228 226
Home: http://www.martinschweinberger.de/


Stefan Th. Gries

unread,
Dec 9, 2020, 10:00:54 AM12/9/20
to CorpLing with R
Unless I missed or messed up something, I think this is it:

# create data (slightly changed to test something)
x.old <- x <- structure(list(
   d1 = c(0.103, 0.091, 0.152), d2 = c(0.168, 0.072, 0.341),
   d3 = c(0.198, 0.109, 0.117), d4 = c(0.188, 0.119, 0.108),
   d5 = c(0.359, 0.113, 0.123), d6 = c(0.343, 0.087, 0.263),
   d7 = c(0.064, 0.088, 0.083), d8 = c(0.075, 0.264, 0.095),
   d9 = c(0.095, 0.092, 0.099), d10 = c(0.367, 0.249, 0.098),
   w1 = c("And", "I", "Oh"), w2 = c("I", "do", "honestly"),
   w3 = c("thought", "n't", "I"), w4 = c("oh", "no", "did"),

   w5 = c("no", "you", "n't"), w6 = c("Sarah", "ca", "touch"),
   w7 = c("do", "n't", "it"), w8 = c("n't", "see", "I"),
   w9 = c("do", "his", "did"), w10 = c("it", "head", "n't")),
   row.names = c(10L, 132L, 784L), class = "data.frame")
# process
for (i in seq(nrow(x))) {
   find.1st <- grep("^(ca|do)$", x[i,11:20])
   find.2nd <- grep("n't", x[i,11:20])-1
   both <- intersect(find.1st, find.2nd)
   if (length(both)==0) { next }
   for (j in rev(both)) {
      # summing up in the d-columns
      x[i,j] <- x[i,j]+x[i,j+1]
      x[i,min(j+2, 10):10] -> x[i,min(j+1, 9):9]
      x[i,10] <- NA
      # merging/shifting in the w columns
      x[i,10+j] <- paste0(x[i,10+j], x[i,10+j+1])
      x[i,min(j+12, 20):20] -> x[i,min(j+11, 19):19]
      x[i,20] <- NA
   }
}
# check:
x.old
x

Martin Schweinberger

unread,
Dec 9, 2020, 10:07:43 AM12/9/20
to corplin...@googlegroups.com
way more elegant and old school ;)

=====================================
Dr. Martin Schweinberger
5/221 Sir Fred Schonell Drive
St Lucia, QLD, 4067

Fon.: +61 (0)404 228 226
Home: http://www.martinschweinberger.de/


--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.

Stefan Th. Gries

unread,
Dec 9, 2020, 10:10:12 AM12/9/20
to CorpLing with R
Old school is my middle name! Maybe I'll retire soon and then -
finally! - a new order can take over ;-))
Message has been deleted

Christoph Ruehlemann

unread,
Dec 9, 2020, 10:53:12 AM12/9/20
to corplin...@googlegroups.com
Beautifully clear!

Thanks a lot.

Best

Chris

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.

Christoph Ruehlemann

unread,
Dec 18, 2020, 2:58:21 AM12/18/20
to corplin...@googlegroups.com
Sorry for this totally delayed answer! (only saw your post yesterday).
Yes, sure, I did want to also merge the relevant word pairs but did not ask for a solution for that too as I felt asking for the other solution was already demanding a lot. But no worries, I've found a solution that takes care of that as well. But thanks a lot!

Chirs

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.

Stefan Th. Gries

unread,
Dec 18, 2020, 9:47:39 AM12/18/20
to CorpLing with R
> Sorry for this totally delayed answer! (only saw your post yesterday).
> Yes, sure, I did want to also merge the relevant word pairs but did not ask for a solution for that too as I felt asking for the other solution was already demanding a lot. But no worries, I've found a solution that takes care of that as well. But thanks a lot!
The solution I sent you 9 days ago did both ...

Earl Brown

unread,
Dec 23, 2020, 9:46:36 PM12/23/20
to CorpLing with R
I know I'm late to the party, but I wanted to contribute the solution below in case you need to fix sentences that aren't exactly ten words long.

### create test1 data frame with the code given by Chris in his original post ###

### load the workhorses ###
library(tidyverse)

### define function ###
fix_df <- function(df) {
  n_rows <- nrow(df)
  df_rownames <- rownames(df)
  df_temp <- df %>% 
    pivot_longer(everything(), names_to = c(".value", "n"), names_pattern = "([dw])(\\d+)") %>% 
    mutate(n = as.integer(n))
  n_wds <- df_temp %>% pull(n) %>% max()
  df_fixed <- df_temp %>% 
    mutate(d = ifelse(str_detect(w, "^(ca|do)$") & str_detect(lead(w), "^n't$"), d + lead(d), d)) %>% 
    mutate(w = ifelse(str_detect(w, "^(ca|do)$") & str_detect(lead(w), "^n't$"), str_c(w, lead(w)), w))  %>% 
    mutate(keep = ifelse(!(str_detect(w, "^n't$") & str_detect(lag(w), "^(ca|do)")), TRUE, FALSE)) %>% 
    mutate(set = rep(1:n_rows, each = n_wds)) %>% 
    arrange(set, desc(keep), n) %>% 
    mutate(w = ifelse(keep, w, NA), d = ifelse(keep, d, NA)) %>% 
    group_by(set) %>% 
    mutate(n = seq(n_wds)) %>% 
    ungroup() %>% 
    select(-keep) %>% 
    pivot_wider(names_from = n, names_glue = "{.value}{n}", values_from = c(d, w))  %>% 
    select(-set) %>% 
    as.data.frame()
  rownames(df_fixed) <- df_rownames
  return(df_fixed)
}  # end function definition

### test the function ###
fix_df(test1)
Reply all
Reply to author
Forward
0 new messages