If speed is important, using str_replace_all() from the stringr package is faster:
gsubWay <- function(charvec) {
gsub(paste0("(", paste(LETTERS[!(LETTERS %in% c("L", "R"))], collapse = "|"), ")\\1+"), "\\1", charvec)
}
library("stringr")
stringrWay <- function(charvec) {
str_replace_all(charvec, paste0("(", paste(LETTERS[!(LETTERS %in% c("L", "R"))], collapse = "|"), ")\\1+"), "\\1")
}
charvec <- c("EARL", "EEARL", "ELLIOT", "JULIE", "JJJULIE", "CARRIE")
charvec <- rep(charvec, times = 10000)
library("microbenchmark")
microbenchmark(
gsubWay(charvec),
stringrWay(charvec),
times = 10
)
While not the point of dplyr, string manipulation can be placed in a dplyr pipeline if needed:
charvec <- c("EARL", "EEARL", "ELLIOT", "JULIE", "JJJULIE", "CARRIE")
toy_df <- data.frame(
id = 1:6,
name = charvec,
age = seq(35:40)
)
library("dplyr")
toy_df %>%
mutate(
nameCleaned = str_replace_all(.$name, paste0("(", paste(LETTERS[!(LETTERS %in% c("L", "R"))], collapse = "|"), ")\\1+"), "\\1")
)
String manipulation is the bread and butter of corpus linguists: