Breaking addresses into capture groups is driving me nuts

18 views
Skip to first unread message

The Frog

unread,
May 11, 2013, 10:04:02 AM5/11/13
to re...@googlegroups.com
Hi Everyone,

I would like to ask for some hlp with what appears to be a simple problem but I am having a hell of a time getting this regex to function properly. Any help woul dbe greatly appreciated.

I have a list of addresses to scan through and break into appropriate parts. They take the following form(s):

Level 1, 100 SomeStreet St, Some Suburb State 1234
100 SomeStreet St, Some Suburb State 1234
Some Suburb State 1234

The 'Some Suburb' can also be a single word name eg/ 'Somesuburb State 1234', and could be of that form for any of the above examples. The states are all capitalised and might have a '.' character in some instances. Postcodes are all digits and are always at the end of the string.

I need to be able to separate out the street address (assuming it exists / is given), the suburb, the state, the postcode into separate pieces. I am using windows scripting regex engine but will later port this to something else, probably java.

Can anyone help me to get this right? Its driving me nuts. Here is where I gout up to but it doesnt work when the addresses are truncated from simple 123 Some Street,......, or extended such as the Level 1, ...... version.

(.*)?\,?\s?([\w|\W]*)\,?\s?(VIC|SA|WA|TAS|NSW|ACT|QLD|NT)\,?\s?(\d*)

(I'm in Australia - so the states are listed in 'raw' form at the moment but it would need to be changed to accommodate the possibility of the '.' character).

Any help or guidance would be greatly appreciated.

Cheers

The Frog

Prashant Patole

unread,
May 12, 2013, 7:44:43 AM5/12/13
to re...@googlegroups.com
Hi The Frog,

Greetings. 

I tried building it from back side (RtoL). and got it in 15 mins

I have a .Net based Editor, hence its working on .Net.. I guess you may need it to tune it a bit.
Please use ? on the group you feel like optional. I have just used optional for Level 1.

Please note the style (?:,| ) I used to capture , and  in two fields. this has actually made it easy.
Also, please note, if you use "Named Capture" instead of simple "number capture", things get quite easy, as compared to.


(?<lev1>(?:[\w]+[ ]*){1,})?(?:,| )*(?<streat>(?:[\w]+[ ]*){1,})(?:,| )+(?<suburb>(?:[\w]+[ ]*){1,})(?<state>VIC|SA|WA|TAS|NSW|ACT|QLD|NT)(?:,| )+(?<post>\d*)


Prashant




--
--
Sub, Unsub, Read-on-the-web, tune your personal settings for this Regex forum:
http://groups.google.com/group/regex?hl=en
 
---
You received this message because you are subscribed to the Google Groups "Regex" group.
To unsubscribe from this group and stop receiving emails from it, send an email to regex+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

The Frog

unread,
May 13, 2013, 12:45:32 AM5/13/13
to re...@googlegroups.com
Hi Prashant,

Thankyou for having a go at this. I appreciate your input and effort.

I have tried to implement this into my code and have had to make a few minor modifications to it. I am still stuck with one of the sections and was hoping you might be able to help with it. Using your approach to the regex (approaching form the end of the string) is a great idea. What works so far is this:

(?:[\s\,]+)(?<suburb>[\w+\s]+)(?:\s+)(?<state>[A-Z|.]{2,4}?)(?:,| )+(?<post>\d{4}$)

against this set of sample data it works nicely:
Level 1, 100 SomeStreet St, Some Suburb N.T. 1234
100 SomeStreet St, Some Suburb WA 1234
Some Suburb VIC 1234
Suite 16, 123 Whatever Rd, suburbia NSW 5678

where it is missing and I did not perhaps explain myself well enough, is in capturing the street level address as a single chunk. For example:

Suite 16, 123 Whatever Rd
|_____________________|
               Street

or this

100 SomeStreet
|____________|
       Street

or this

Whatever Wherever Rd
|__________________|
            Street

These <street> capture groups always end with a ',' and a space, but of course the first address type has a comma in it in the middle of the actual address. At the moment the regex above has a non-capturing group to eliminate capturing the comma and space that come before a suburb IF there is an address before it or not. What I am trying to achieve is a single <street> capture group like in the examples above that will put all the parts into one.

Do you have any thoughts on how to achieve this?

The Frog

Prashant Patole

unread,
May 13, 2013, 1:33:42 AM5/13/13
to re...@googlegroups.com
Hi Frog,

this is just a quick reply. this do not handle optional Level1 quite well.
However, this is just a reply on, what ever i could understand on a quick look at your response. 

(?<lev1>[\w+\s]+)(?:[\s\,]+)(?<street>(?:[\w+\s]+,?)+)(?:[\s\,]+)(?<suburb>[\w+\s]+)(?:\s+)(?<state>[A-Z|.]{2,4}?)(?:,| )+(?<post>\d{4}$)

hope this will help you to think some thing more. I have highlighted the catching portion. You have to capture each word with optional , in the street.

I will try to look at this in detail at evening :)


Regards,


The Frog

--
Reply all
Reply to author
Forward
0 new messages