Trying to extract and clean UK postcode from address listing using java-style regexp

1,977 views
Skip to first unread message

Jeremy Kidwell

unread,
Mar 28, 2016, 3:07:28 PM3/28/16
to OpenRefine
I'm trying to tidy up some data obtained from a web scrape which contains site names and addresses. The problem is that the address format is very inconsistent, making a CSV-style filter impossible to do reliably. I know that UK postcodes are consistent enough to run against a regular expression (as per here: https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Regular-Expressions), and have found a wide range of regex examples to draw on, but I can't seem to make any of these work in openrefine. What I imagine doing is running a "create column based on this column" on the data scraped (each row represents a separate entry and name / address are stored in two separate fields) and then run a regex like the following:

^[A-Z]{1,2}[0-9]{1,2} ?[0-9][A-Z]{2}
"(GIR 0AA)|((([ABCDEFGHIJKLMNOPRSTUWYZ][0-9][0-9]?)|(([ABCDEFGHIJKLMNOPRSTUWYZ][ABCDEFGHKLMNOPQRSTUVWXY][0-9][0-9]?)|(([ABCDEFGHIJKLMNOPRSTUWYZ][0-9][ABCDEFGHJKSTUW])|([ABCDEFGHIJKLMNOPRSTUWYZ][ABCDEFGHKLMNOPQRSTUVWXY][0-9][ABEHMNPRVWXY])))) [0-9][ABDEFGHJLNPQRSTUWXYZ]{2})"


I realise that a regular expression in GREL needs to conform to javascript regexp style format (as per here: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html and here: https://regex101.com/r/eX5uW6/1#javascript) and I'm pretty sure the above conforms, but I can't for the life of me get this to work.

Some sample data:

Liddesdale Square Milton Glasgow G22 7BT
Ladybank Drive Glasgow G52 1EZ

Which I'd like to convert into two columns like the following (using value.replace etc.):
Liddesdale Square Milton Glasgow | G22 7BT
Ladybank Drive Glasgow | G52 1EZ

I suspect that if I can get it running, this would be a very useful recipe for others doing similar work (web scraping addresses in order to generate geocoded databases). Anyone with experience able to help here? I'd be most grateful for any insights offered!

Thad Guidry

unread,
Mar 28, 2016, 3:42:28 PM3/28/16
to openrefine
Matches just fine ... if you remove those double quotes...


^[A-Z]{1,2}[0-9]{1,2} ?[0-9][A-Z]{2}(GIR 0AA)|((([ABCDEFGHIJKLMNOPRSTUWYZ][0-9][0-9]?)|(([ABCDEFGHIJKLMNOPRSTUWYZ][ABCDEFGHKLMNOPQRSTUVWXY][0-9][0-9]?)|(([ABCDEFGHIJKLMNOPRSTUWYZ][0-9][ABCDEFGHJKSTUW])|([ABCDEFGHIJKLMNOPRSTUWYZ][ABCDEFGHKLMNOPQRSTUVWXY][0-9][ABEHMNPRVWXY])))) [0-9][ABDEFGHJLNPQRSTUWXYZ]{2})

See and play with it here: http://regexr.com/3d3o7

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thad Guidry

unread,
Mar 28, 2016, 3:52:15 PM3/28/16
to openrefine
There's also this one...which is semi-official coming from the UK's Geoplace public sector partnership...

https://www.geoplace.co.uk/addressbase/schema/addressbasebasictypes.xsd

which I found in regexr.com by searching the Community for "UK postcode".

Thad Guidry

unread,
Mar 28, 2016, 3:54:05 PM3/28/16
to openrefine
Also, it needs to be in Java RegEx, not Javascript RegEx.

Jeremy Kidwell

unread,
Mar 29, 2016, 4:52:14 AM3/29/16
to OpenRefine
Thanks Thad - I hadn't noted the differences between Java RegEx and Javascript RegEx so I've stored that bit of knowledge away for future use. AND thanks for noting my erroneous quote marks. Now, I must be missing something obvious, but why does:

value.match(/^[A-Z]{1,2}[0-9]{1,2} ?[0-9][A-Z]{2}(GIR 0AA)|((([ABCDEFGHIJKLMNOPRSTUWYZ][0-9][0-9]?)|(([ABCDEFGHIJKLMNOPRSTUWYZ][ABCDEFGHKLMNOPQRSTUVWXY][0-9][0-9]?)|(([ABCDEFGHIJKLMNOPRSTUWYZ][0-9][ABCDEFGHJKSTUW])|([ABCDEFGHIJKLMNOPRSTUWYZ][ABCDEFGHKLMNOPQRSTUVWXY][0-9][ABEHMNPRVWXY])))) [0-9][ABDEFGHJLNPQRSTUWXYZ]{2})/)

return null values in an "Add column based on..."?

Really grateful for your help on this.

Thad Guidry

unread,
Mar 29, 2016, 11:41:41 AM3/29/16
to openrefine
Match() outputs an array of the captured groups.  An Array is not the same as a String, in Java terms, an Array is a Collection or List of strings or integers like [apple,banana,carrot] and ["L1W XYZ", "M1Z 8GY]

OpenRefine is programmed in Java, so your asking OpenRefine to store an Array, but OpenRefine Cells cannot store Arrays, and so a null is stored.  In fact, if you opt to click the radio dot for the Store Error, then you might actually see what is going on  You forgot to tell OpenRefine in your GREL statement that your going to need to store your Array as a String. :)


What would happen if you added to the end of your existing match() GREL expression...

.join("")

Asking OpenRefine to Join your capture groups that are output from the match() function and join them with... nothing, double quotes together.

or join them with something...a simple single whitespace

.join(" ")

Jeremy Kidwell

unread,
Mar 30, 2016, 6:50:27 AM3/30/16
to OpenRefine
Ah - that makes perfect sense. Thanks for the explanation re: match() storing data in array format which seems a bit obvious now that you've pointed it out. 

So I've rewritten to take that into account using a join statement as you suggest, but get a blank output:

join([value.match(/^[A-Z]{1,2}[0-9]{1,2} ?[0-9][A-Z]{2}(GIR 0AA)|((([ABCDEFGHIJKLMNOPRSTUWYZ][0-9][0-9]?)|(([ABCDEFGHIJKLMNOPRSTUWYZ][ABCDEFGHKLMNOPQRSTUVWXY][0-9][0-9]?)|(([ABCDEFGHIJKLMNOPRSTUWYZ][0-9][ABCDEFGHJKSTUW])|([ABCDEFGHIJKLMNOPRSTUWYZ][ABCDEFGHKLMNOPQRSTUVWXY][0-9][ABEHMNPRVWXY])))) [0-9][ABDEFGHJLNPQRSTUWXYZ]{2})/)], " ")

It gives no results or error codes (I did "store error" and tried a few possible separators). Here's some sample data so you can see what I'm working with:

St John's Church Ponderlaw Street Arbroath DD11 1EU
Strathmore Centre for Youth Development (SCYD) Wellmeadow House, Gas Brae, Blairgowrie PH10 6AY
20 West Marketgait Dundee DD1 1QR

I suspect I'm just one step away from sorting this out...

Thanks again,
Jeremy

Thad Guidry

unread,
Mar 30, 2016, 10:22:39 AM3/30/16
to openrefine
You can just continue to use dot notation syntax in forming up your expression to make things easier sometimes.

Taking the output of match  ->  DOT -> into some other function, just like you did with value.match()

value.match(?).join(?).split(?).join(?)  could go on forever if you wanted to. I don't think we ever put hard limits into expression editor buffer but we probably should have :)

value.match(/^[A-Z]{1,2}[0-9]{1,2} ?[0-9][A-Z]{2}(GIR 0AA)|((([ABCDEFGHIJKLMNOPRSTUWYZ][0-9][0-9]?)|(([ABCDEFGHIJKLMNOPRSTUWYZ][ABCDEFGHKLMNOPQRSTUVWXY][0-9][0-9]?)|(([ABCDEFGHIJKLMNOPRSTUWYZ][0-9][ABCDEFGHJKSTUW])|([ABCDEFGHIJKLMNOPRSTUWYZ][ABCDEFGHKLMNOPQRSTUVWXY][0-9][ABEHMNPRVWXY])))) [0-9][ABDEFGHJLNPQRSTUWXYZ]{2})/).join(" ")

There was no need to wrap with square brackets [ ]  because .match() will output an array.  A lot of programming languages and references use square brackets just to show that they mean that your working with a list, collection or array.  But .match() always needs a RegEx expression in the form of /myexpression/ , and saying [/myexpression/] probably threw a silent error that we should probably mark as a bug that needs to be fixed.  OpenRefine ideally should have told you in Red color that your expression is not valid, instead of being silent about it.

Jeremy Kidwell

unread,
Mar 31, 2016, 2:42:00 AM3/31/16
to OpenRefine
Brilliant - that makes perfect sense. However, this statement (below) still gives me an "Error: join expects an array and a string" which is what had left me adding those brackets in the first place. Is this not replicated on your side?

value.match(/^[A-Z]{1,2}[0-9]{1,2} ?[0-9][A-Z]{2}(GIR 0AA)|((([ABCDEFGHIJKLMNOPRSTUWYZ][0-9][0-9]?)|(([ABCDEFGHIJKLMNOPRSTUWYZ][ABCDEFGHKLMNOPQRSTUVWXY][0-9][0-9]?)|(([ABCDEFGHIJKLMNOPRSTUWYZ][0-9][ABCDEFGHJKSTUW])|([ABCDEFGHIJKLMNOPRSTUWYZ][ABCDEFGHKLMNOPQRSTUVWXY][0-9][ABEHMNPRVWXY])))) [0-9][ABDEFGHJLNPQRSTUWXYZ]{2})/).join(" ")


Thad Guidry

unread,
Mar 31, 2016, 10:27:22 AM3/31/16
to openrefine

Thad Guidry

unread,
Mar 31, 2016, 11:28:28 AM3/31/16
to openrefine
Ah, for your particular strings ... you need to wrap as one big capture group ... and also assume the possibility of some characters preceding the zipcodes.

.* assumes possible chars preceding the pattern that your looking for.
( ) creates a capture group(s)
? is an optional pattern to find

This is a working GREL expression based off the official UK postcode regex:

value.match(/.*([A-IK-PR-UWYZ]?[A-H,K-Y][0-9]?[0-9A-HJKMNP-Y][ ][0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])/).join(" ")

And it works for me.

Thad Guidry

unread,
Mar 31, 2016, 11:40:05 AM3/31/16
to openrefine
Also, adding a facet or new column that has true or false to know if the zipcode is at the end of your strings:

value.endsWith(get(value.match(/.*([A-IK-PR-UWYZ]?[A-H,K-Y][0-9]?[0-9A-HJKMNP-Y][ ][0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])/),0))


And also you might create an additional column that asks if there might be more than 1 regex match... an easy way to do this is look for that 3rd index [2] , since match() only outputs on the 1st found match....but your string might have duplicate zipcodes in them, and that might be useful for you to know about ?

value.partition(/([A-IK-PR-UWYZ]?[A-H,K-Y][0-9]?[0-9A-HJKMNP-Y][ ][0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])/)[2]

Jeremy Gillies

unread,
Apr 26, 2017, 5:52:33 AM4/26/17
to OpenRefine
Hi Jeremy, I am having the same issue as you ("Error: join expects an array and a string"). Did you get this working? 

Thanks,

Jeremy 

Ettore Rizza

unread,
Apr 26, 2017, 6:59:36 AM4/26/17
to OpenRefine
This error simply means that value.match() returned "null", which is not an array and, thus, can't be joined.

To extract a UK Postal Code in the beginnig or the end of a string, you can tweak the last regular expressions that Thad suggested:

if(isNull(value.match(/([A-IK-PR-UWYZ]?[A-H,K-Y][0-9]?[0-9A-HJKMNP-Y][ ][0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]).+/)),


value
.match(/.*([A-IK-PR-UWYZ]?[A-H,K-Y][0-9]?[0-9A-HJKMNP-Y][ ][0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])/),


value
.match(/([A-IK-PR-UWYZ]?[A-H,K-Y][0-9]?[0-9A-HJKMNP-Y][ ][0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]).*/)

).join(" ")





Jeremy Gillies

unread,
Apr 26, 2017, 10:04:08 AM4/26/17
to OpenRefine
I'm not sure why it's not working for me. Could it be that the original data was from a csv which wrapped the address in double quotes? 
Screen Shot 2017-04-26 at 15.02.47.png

Ettore RIZZA

unread,
Apr 26, 2017, 10:09:53 AM4/26/17
to openr...@googlegroups.com
Strange...




--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.

Owen Stephens

unread,
Apr 27, 2017, 11:24:00 AM4/27/17
to OpenRefine
It looks to me like you might have newline characters in your original string - which means they don't work in the match expression (which doesn't work across lines).

Try amending to:

with(value.replace(/\n/," "),v,if(isNull(v.match(/([A-IK-PR-UWYZ]?[A-H,K-Y][0-9]?[0-9A-HJKMNP-Y][ ][0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]).+/)),


v
.match(/.*([A-IK-PR-UWYZ]?[A-H,K-Y][0-9]?[0-9A-HJKMNP-Y][ ][0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])/),


v
.match(/([A-IK-PR-UWYZ]?[A-H,K-Y][0-9]?[0-9A-HJKMNP-Y][ ][0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]).*/)


).join(" "))

I think that will work

Owen

Ettore Rizza

unread,
Apr 30, 2017, 3:46:43 AM4/30/17
to OpenRefine
Arf, obviously... Newlines are invisible in the Open Refine display, yet they have their importance in a regular expression.



Jeremy Gillies

unread,
Jul 18, 2017, 5:54:58 AM7/18/17
to OpenRefine
Thanks all for your help on this. Sorry for the late reply but I'm not getting notifications on this thread. Owen, that worked for me. Just a few exceptions where the original postcode didn't have a space in but I can correct them manually. Your help is really appreciated :)
Reply all
Reply to author
Forward
0 new messages