Reg Ex Transform Help

268 views
Skip to first unread message

Grace Kelch

unread,
Aug 2, 2016, 4:49:20 PM8/2/16
to OpenRefine
Hi All,

I'm hoping to transform a particular column using a regular expression (which are new to me) but have been struggling to do so.

Each cell in the column contains a series that always starts with a group of one or more letters, followed by one or more numbers, another letter followed by one or more numbers, then either a letter followed by one or more numbers or a series of numbers. 

In short, it typically looks something like this: AB 1234 C56 2000 or AB 1234 C56 D78.

I need to transform the column so that third chunk of each entry will begin with a period: e.g. AB 1234 .C56 2000 - I haven't been able to accomplish this without getting rid of the following text.

Any suggestions?

John Little

unread,
Aug 2, 2016, 5:26:36 PM8/2/16
to openr...@googlegroups.com
Try this:

Edit Cells > Transform...
Expression = value.replace(/(^\w+\s\d+)\s(.*)/,"$1.$2")


Documentation on  

My Explanation:
  • the function "replace" can use either regular expressions: value.replace(/<regex>/,"new text")
    or regular text:  value.replace("this","that")  
  • You need two different patterns
    • always starts with a group of of one or more letters
      • \w  - a character that is part of a word
      • + - the multiplier meaning "one or more"
      • ^ - means must begin
        • ^\w+ - must begin a word character and be one or more
    • \s - a whitespace character
    • \d - a digit
      • \d+ - one or more digits
    • () - parenthesis inside the two slashes // are capture groups, used later and referred to by the order of each capture group:  e.g. first is $1, second is $2
    • . - a wild card - matches anything
    • * - a multiplier - zero or more

  • So, first capture group $1 = (^\w+\s\d+)
    • begins with one ore more letters, followed by a space, followed by one or more digits
  • followed by a space:  \s
  • And followed by the second capture group $2 = (.*)
    • followed by anything
  • Then replace it using the reference IDs to the capture groups
    • $1.$2 
      • note the period (full stop) between dollar-one and dollar-two is a literal punctuation mark which replaces a space with a period.

Note:  There can be a lot of reasons to explain why patterns don't match all the data.  You may have to share more data if this regular expression doesn't work.


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

olivia solis

unread,
Aug 2, 2016, 5:28:46 PM8/2/16
to OpenRefine
Maybe

value.replace(/(^[a-zA-Z]+ \d+) ([a-zA-Z]\d+)/,"$1 .$2")

That's if there are spaces between the first 3 parts — e.g. one in between "AB" and "1234" and one in between "1234" and "C56" in your example. There might be a cleaner way to do it.

Grace Kelch

unread,
Aug 3, 2016, 11:45:49 AM8/3/16
to OpenRefine
Thanks so much, this was really helpful!
Reply all
Reply to author
Forward
0 new messages