I'm trying to tidy up some data obtained from a web scrape which contains site names and addresses. The problem is that the address format is very inconsistent, making a CSV-style filter impossible to do reliably. I know that UK postcodes are consistent enough to run against a regular expression (as per here:
https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Regular-Expressions), and have found a wide range of regex examples to draw on, but I can't seem to make any of these work in openrefine. What I imagine doing is running a "create column based on this column" on the data scraped (each row represents a separate entry and name / address are stored in two separate fields) and then run a regex like the following:
^[A-Z]{1,2}[0-9]{1,2} ?[0-9][A-Z]{2}
"(GIR 0AA)|((([ABCDEFGHIJKLMNOPRSTUWYZ][0-9][0-9]?)|(([ABCDEFGHIJKLMNOPRSTUWYZ][ABCDEFGHKLMNOPQRSTUVWXY][0-9][0-9]?)|(([ABCDEFGHIJKLMNOPRSTUWYZ][0-9][ABCDEFGHJKSTUW])|([ABCDEFGHIJKLMNOPRSTUWYZ][ABCDEFGHKLMNOPQRSTUVWXY][0-9][ABEHMNPRVWXY])))) [0-9][ABDEFGHJLNPQRSTUWXYZ]{2})"
Some sample data:
Liddesdale Square Milton Glasgow G22 7BT
Ladybank Drive Glasgow G52 1EZ
Which I'd like to convert into two columns like the following (using value.replace etc.):
Liddesdale Square Milton Glasgow | G22 7BT
Ladybank Drive Glasgow | G52 1EZ
I suspect that if I can get it running, this would be a very useful recipe for others doing similar work (web scraping addresses in order to generate geocoded databases). Anyone with experience able to help here? I'd be most grateful for any insights offered!