Simple Regex problem

1,253 views
Skip to first unread message

Casey

unread,
Sep 16, 2011, 4:35:01 PM9/16/11
to Google Refine
I'm a beginner with Refine and I have a data set with a 6 digit ID
like 555555 embedded in random parts of strings that contain letters
and numbers like (_ABC-073_555555_). I would like to extract the 6
digit IDs.
I expected value.match(\d{6}) to work but I get the following error
"Parsing error at offset 12: missing number, string, identifier,
regex, or parenthesized expression".

Any ideas on how to do this? Thank you.

Shawn Simister

unread,
Sep 16, 2011, 4:50:04 PM9/16/11
to google...@googlegroups.com
Try this instead:

value.match(/.*(\d{6}).*/)[0]
--
Shawn Simister

Developer Programs Engineer
Google, San Francisco

Thad Guidry

unread,
Sep 16, 2011, 5:19:52 PM9/16/11
to google...@googlegroups.com
Casey, did you consult our Refine wiki at all on the function of match() for the correct syntax ?  or did you just look at the Help tab within Refine's expression editor ?  I'm curious.
--
-Thad
http://www.freebase.com/view/en/thad_guidry

Casey

unread,
Sep 16, 2011, 7:07:55 PM9/16/11
to Google Refine
Thank you Shawn, that worked perfectly.

Thad, I did consult the wiki (but not the help tab). I also searched
for regex questions in this Group. I think the issue is that I'm also
learning regex and the expressions that were working in other programs
were not working in Refine. Is there a specific type or syntax of
regex that needs to be used? Thank you for helping me with this.





On Sep 16, 2:19 pm, Thad Guidry <thadgui...@gmail.com> wrote:
> Casey, did you consult our Refine wiki at all on the function of match() for
> the correct syntax ?  or did you just look at the Help tab within Refine's
> expression editor ?  I'm curious.
>

Thad Guidry

unread,
Sep 16, 2011, 8:15:09 PM9/16/11
to google...@googlegroups.com
You just forgot to wrap any Regex with / / like the match() examples on the wiki


Sent from my iPad

Faizal Reza

unread,
Aug 30, 2012, 8:49:25 AM8/30/12
to google...@googlegroups.com
Hi Thad,

I am newbee here in using google refine. I am now having war of words with my database team that google refine is better.

I am having problem in using regex say if zip codes of U.S.A are of always 5 digits and say i have zip codes in these format  and want to do make all zip codes of 5 digits and remove any inconsistency .


1234 -  01234
, 1234? ,- 01234 ( it will remove question mark too)
1234-bgh, ( it will remove text after dash and put 0 as a prefix)

1234-568 , ( it will remove the number after dash and put 0 as a prefix

ashgs, (totally remove these text and make them blank )
 
I have attached a file for that . please take a look at that.

Thanks
Faizal Reza
zip codes.xlsx

Jonathon Paarlberg

unread,
Jul 14, 2015, 5:07:47 PM7/14/15
to openr...@googlegroups.com, google...@googlegroups.com
Thanks for the pointers. Now, how would you match everything except the numbers? I want to do the same thing but exclude the numbers from the original field.

Owen Stephens

unread,
Jul 15, 2015, 5:14:18 AM7/15/15
to openr...@googlegroups.com, google...@googlegroups.com
The 'match' function in OpenRefine requires your regular expression to match the whole of the value in the cell. So:

value.match(/\d{6}/)

will only find a match in cells where the cell contains only 6 digits in a row and nothing else. To allow the six digits to be surrounded by other characters you can use the 'wildcard' pattern of '.*' before and after the digits:

value.match(/.*\d{6}.*/)

This matches the cell if the value in the cell contains six digits in a row, and any other characters (or none) before and after the 6 digits.

The 'match' function output is an array of captured values from the regular expression. To capture part of your regular expression you need to surround that part of your regular expression with brackets ( ). If you don't do this, then a successful 'match' will give you an empty array - which may not be what you want.

Example: Cell contains value:

hello 123456 goodbye

value.match(/\d{6}/) = null
value.match(/.*\d{6}.*/) = [ ] (empty array)
value.match(/.*(\d{6}).*/) = [ "123456" ] (array with one captured value)
value.match(/(.*)(\d{6})(.*)/) = [ "hello ", "123456", " goodbye" ] (array with three captured values)

So for the question 'everything except the numbers' - you only need to capture the non-number bits of your regular expression

Owen

Owen Stephens

unread,
Jul 15, 2015, 5:26:50 AM7/15/15
to openr...@googlegroups.com, google...@googlegroups.com
Faizal,

I would say that OpenRefine and Databases are for different tasks, so I don't think one is better than the other, but that they both have things they are good at. The type of task you are describing here I'd definitely say OpenRefine can help.

Generally with data like this I'd suggest you try to incrementally improve the data one step at a time. So in your examples I might do the following:

Remove question marks, commas, periods, and spaces (and possibly other punctuation but not hyphens '-') by using 'replace' 
Find all the patterns of 4 digits followed by a hyphen followed by three other characters, extract the 4 digits and add a leading zero - using 'match' (see other postings in this thread)
Find all the patterns of 5 digits followed by a hyphen followed by 5 more digits and extract the bits you want - using 'match' again
Then use a text filter to find all cells which don't just contain 5 digits (you can use a regular expression in the text filter). This will let you see what you are are left with and if these are all not needed you can just remove these (use a transformation with the expression "" or null)

Hope this is enough to get you started

Owen

Jonathon Paarlberg

unread,
Jul 15, 2015, 10:48:53 AM7/15/15
to openr...@googlegroups.com, google...@googlegroups.com
value.match(/(.*)(\d{6})(.*)/) = [ "hello ", "123456", " goodbye" ] (array with three captured values)

So for the question 'everything except the numbers' - you only need to capture the non-number bits of your regular expression

So I'd need to specify the first element of the array output by that last regex? I think I can figure that out.

However, it might be simpler (for my purposes) to just capture all letters and exclude all non-letters. (The numbers are inside square brackets at the end of the line. Alternatively, I suppose I could exclude brackets and everything between them, but I don't really understand how to exclude using GREL; is there a specific command, like NOT, that would do it?)


On Friday, September 16, 2011 at 4:35:01 PM UTC-4, Casey wrote:

Thad Guidry

unread,
Jul 15, 2015, 11:31:42 AM7/15/15
to openrefine, refine
Use replace()
and replace your non-letters with nothing, empty.

Such as:
value.replace(/\W\d+/,"")


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Owen Stephens

unread,
Jul 15, 2015, 12:03:48 PM7/15/15
to openr...@googlegroups.com, google...@googlegroups.com
The issue with 'match' is that you have to match the whole line with a pattern and then capture 'groups' of the match pattern - so 'capturing just non-numbers' as a problem isn't really well enough defined for designing a strategy for this using the 'match' option.

Thad's suggestion works where you've got unknown distribution of non-letters and you aren't interested in any of the non-letters. If you want to preserve the original value as well you can combine this with the 'add new column...' option to put your extracted string in a new column.

However if you have a more regular pattern (and it sounds like you have if the numbers are always in square brackets at the end of the line) you can write a match to pick out the other stuff more precisely:

value.match(/(.*)\[\d*\]/)[0]

would do the trick

Owen
Reply all
Reply to author
Forward
0 new messages