Cleaning character repetitions

24 views
Skip to first unread message

DrunkenPhD

unread,
Apr 10, 2015, 5:44:13 AM4/10/15
to manip...@googlegroups.com
Dear All,

I have a long list of names containing repetitions of characters like:

AAANTON

BBERT

etc

I want some script to clean all the repetitions leaving only one letter. For example from AAA I want to be left only one A etc
There is only one complication that only in cases of R or L repetitions I need two of them like RR or LL because in these cases one repetition is possible.

Please help

Regards

Brandon Hurr

unread,
Apr 10, 2015, 12:33:10 PM4/10/15
to DrunkenPhD, manipulatr
I'm not very skilled at grep, but I was curious and I think I've figured out how to find them and substitute all capital letters which occur in pairs or more (except R and L). 

charvec <- c("AARDVARK", "ELLIOT", "JULIE", "AAARDVARK")

gsub('(A|B|C|D|E|F|G|H|I|J|K|M|N|O|P|Q|S|T|U|V|W|X|Y|Z){1,}\\1', '\\1', charvec)

I'm not sure this is a manipulatr question so much though. AFAIK, character manipulation isn't built into plyr or dplyr. 

HTH, 
B


--
You received this message because you are subscribed to the Google Groups "manipulatr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to manipulatr+...@googlegroups.com.
To post to this group, send email to manip...@googlegroups.com.
Visit this group at http://groups.google.com/group/manipulatr.
For more options, visit https://groups.google.com/d/optout.

Karl Ove Hufthammer

unread,
Apr 10, 2015, 3:04:56 PM4/10/15
to Brandon Hurr, DrunkenPhD, manipulatr
On 10. april 2015 18:32, Brandon Hurr wrote:
> I'm not very skilled at grep, but I was curious and I think I've
> figured out how to find them and substitute all capital letters which
> occur in pairs or more (except R and L).
>
> charvec <- c("AARDVARK", "ELLIOT", "JULIE", "AAARDVARK")
>
> gsub('(A|B|C|D|E|F|G|H|I|J|K|M|N|O|P|Q|S|T|U|V|W|X|Y|Z){1,}\\1',
> '\\1', charvec)

This doesn’t work if you have

charvec="ELLIOTT"

(it returns ELLT). You want to reverse the {1,} (which is the same as +)
and the \\1:

gsub('(A|B|C|D|E|F|G|H|I|J|K|M|N|O|P|Q|S|T|U|V|W|X|Y|Z)\\1+', '\\1',
charvec)

Or more simply

gsub('([^LR])\\1+', '\\1', charvec)

if non-letter repetition (e.g. 222) also can be removed.

--
Karl Ove Hufthammer

Karl Ove Hufthammer

unread,
Apr 10, 2015, 3:18:01 PM4/10/15
to Brandon Hurr, DrunkenPhD, manipulatr
On 10. april 2015 21:04, Karl Ove Hufthammer wrote:
> On 10. april 2015 18:32, Brandon Hurr wrote:
>> I'm not very skilled at grep, but I was curious and I think I've
>> figured out how to find them and substitute all capital letters which
>> occur in pairs or more (except R and L).
>>
>> charvec <- c("AARDVARK", "ELLIOT", "JULIE", "AAARDVARK")
>
> Or more simply
>
> gsub('([^LR])\\1+', '\\1', charvec)
>
> if non-letter repetition (e.g. 222) also can be removed.

Just for fun, here’s an alternative that doesn’t use regular
expressions, but *does* use my favourite little-known R function, rle():

compact=function(letters) {
r=rle(letters)
r$lengths[!(r$values %in% c("L","R"))]=1
paste(inverse.rle(r), collapse="")
}
sapply(strsplit(charvec, ""), compact)

This approach can be useful if you have more complicated exceptions than
the L/R rule.

BTW, I would have preferred to use str_split() from the stringr package,
but this introduces an extra space character (element) at the start of
each string, for some reason:

str_split("hello","")

[[1]]
[1] "" "h" "e" "l" "l" "o"

--
Karl Ove Hufthammer

Brandon Hurr

unread,
Apr 10, 2015, 3:26:49 PM4/10/15
to Karl Ove Hufthammer, DrunkenPhD, manipulatr
Karl, 

Out of curiosity, am I breaking this down right?
A|B... is the characters that it tries to match (^LR is not L or R, but is less explicit because it includes numbers or other repeated characters). 
() tells it to remember
//1 is pattern 1
+ is greedy matching of pattern 1 
//1 again is it's replacement with pattern 1 (a single capital letter in this instance)

+ and {1,} is the same thing functionally

Thanks, 

Brandon

Karl Ove Hufthammer

unread,
Apr 10, 2015, 3:40:02 PM4/10/15
to Brandon Hurr, DrunkenPhD, manipulatr
On 10. april 2015 21:26, Brandon Hurr wrote:
> Karl,
>
> Out of curiosity, am I breaking this down right?
> A|B... is the characters that it tries to match (^LR is not L or R,
> but is less explicit because it includes numbers or other repeated
> characters).
> () tells it to remember
> //1 is pattern 1
> + is greedy matching of pattern 1

Regular expressions are by default always greedy. The + matches ‘one or
more’ of the previous character/group/token.

> //1 again is it's replacement with pattern 1 (a single capital letter
> in this instance)
>
> + and {1,} is the same thing functionally

Yes – if you change // to \\. :)

--
Karl Ove Hufthammer

Earl Brown

unread,
Apr 11, 2015, 12:27:01 AM4/11/15
to manip...@googlegroups.com
If speed is important, using str_replace_all() from the stringr package is faster:

gsubWay <- function(charvec) {
  gsub(paste0("(", paste(LETTERS[!(LETTERS %in% c("L", "R"))], collapse = "|"), ")\\1+"), "\\1", charvec)  
}

library("stringr")
stringrWay <- function(charvec) {
  str_replace_all(charvec, paste0("(", paste(LETTERS[!(LETTERS %in% c("L", "R"))], collapse = "|"), ")\\1+"), "\\1")
}

charvec <- c("EARL", "EEARL", "ELLIOT", "JULIE", "JJJULIE", "CARRIE")
charvec <- rep(charvec, times = 10000)

library("microbenchmark")
microbenchmark(
  gsubWay(charvec),
  stringrWay(charvec),
  times = 10
)

While not the point of dplyr, string manipulation can be placed in a dplyr pipeline if needed:

charvec <- c("EARL", "EEARL", "ELLIOT", "JULIE", "JJJULIE", "CARRIE")
toy_df <- data.frame(
  id = 1:6,
  name = charvec,
  age = seq(35:40)
)

library("dplyr")
toy_df %>%
  mutate(
    nameCleaned = str_replace_all(.$name, paste0("(", paste(LETTERS[!(LETTERS %in% c("L", "R"))], collapse = "|"), ")\\1+"), "\\1")
  )

String manipulation is the bread and butter of corpus linguists:

Reply all
Reply to author
Forward
0 new messages