Stuck on a three word street name regex

0 views
Skip to first unread message

Brian D

unread,
Jan 27, 2010, 7:28:35 PM1/27/10
to
I've tackled this kind of problem before by looping through a patterns
dictionary, but there must be a smarter approach.

Two addresses. Note that the first has incorrectly transposed the
direction and street name. The second has an extra space in it before
the street type. Clearly done by someone who didn't know how to
concatenate properly -- or didn't care.

1000 RAMPART S ST

100 JOHN CHURCHILL CHASE ST

I want to parse the elements into an array of values that can be
inserted into new database fields.

Anyone who loves solving these kinds of puzzles care to relieve my
frazzled brain?

The pattern I'm using doesn't keep the "CHASE" with the "JOHN
CHURCHILL":

>>> p = re.compile(r'(?P<streetnum>\d+)\s(?P<streetname>[A-Z\s]*)\s(?P<streetdir>\w*)\s(?P<streettype>\w{2})$')
>>> s = '1405 RAMPART S ST'
>>> m = re.search(p, s)
>>> m
<_sre.SRE_Match object at 0x011A4440>
>>> print m.groups()
('1405', 'RAMPART', 'S', 'ST')
>>> s = '45 JOHN CHURCHILL CHASE ST'
>>> m = re.search(p, s)
>>> m
<_sre.SRE_Match object at 0x011A43E8>
>>> print m.groups()
('45', 'JOHN CHURCHILL', 'CHASE', 'ST')

Paul Rubin

unread,
Jan 27, 2010, 7:35:27 PM1/27/10
to
Brian D <brian...@gmail.com> writes:
> I've tackled this kind of problem before by looping through a patterns
> dictionary, but there must be a smarter approach.>
> Two addresses. Note that the first has incorrectly transposed the
> direction and street name. ....

If you're really serious about it (e.g. you are the post office trying
to program automatic mail sorting machines) there is no simple regex
trick anything like what you want. A lot of addresses will be
ambiguous. You have use all the info you have about your entire address
corpus (e.g. you need a complete street directory of the whole US) and
do a bunch of Bayesian inference. As a very simple example, for an
address like "1000 RAMPART S ST" you'd use the zip code to identify the
address's geographic neighborhood, and then use your street directory to
find candidate correct addresses within that zip code. The USPS does
an amazing job of delivering mail to completely mangled addresses
based on methods like that.

MRAB

unread,
Jan 27, 2010, 8:27:42 PM1/27/10
to pytho...@python.org
Brian D wrote:
> I've tackled this kind of problem before by looping through a patterns
> dictionary, but there must be a smarter approach.
>
> Two addresses. Note that the first has incorrectly transposed the
> direction and street name. The second has an extra space in it before
> the street type. Clearly done by someone who didn't know how to
> concatenate properly -- or didn't care.
>
> 1000 RAMPART S ST
>
> 100 JOHN CHURCHILL CHASE ST
>
> I want to parse the elements into an array of values that can be
> inserted into new database fields.
>
> Anyone who loves solving these kinds of puzzles care to relieve my
> frazzled brain?
>
> The pattern I'm using doesn't keep the "CHASE" with the "JOHN
> CHURCHILL":
>
[snip]
Regex doesn't gain you much. I'd split the string and then fix the parts
as necessary:

>>> def parse_address(address):
... parts = address.split()
... if parts[-2] == "S":
... parts[1 : -1] = [parts[-2]] + parts[1 : -2]
... parts[1 : -1] = [" ".join(parts[1 : -1])]
... return parts
...
>>> print parse_address("1000 RAMPART S ST")
['1000', 'S RAMPART', 'ST']
>>> print parse_address("100 JOHN CHURCHILL CHASE ST")
['100', 'JOHN CHURCHILL CHASE', 'ST']

Brian D

unread,
Jan 27, 2010, 10:39:44 PM1/27/10
to
On Jan 27, 6:35 pm, Paul Rubin <no.em...@nospam.invalid> wrote:

Paul,

That's a sound methodology. I actually have a routine that will
compare an address to a list of all streets in the city using a Short
Distance function. I have used that in circumstances when there are a
lot of problems with addresses. In this case, however, the streets are
actually structured very well -- except for the transposed street
directions. I was really hoping to see if there's a solution that
handles one, two, and three word strings, followed by an occasional
single character, and then a two character suffix. I'm still hoping
for that kind of a solution if it exists. The reason? It's actually a
very small number of addresses that aren't being captured with the
current regex. I don't see the need for overkill, and I'm always
stretching to learn something I haven't already succeeded at
accomplishing. I may just make a second pass at the data with a
different regex.

Brian D

unread,
Jan 27, 2010, 10:57:25 PM1/27/10
to

This is a nice approach I wouldn't have thought to pursue. I've never
seen this referencing of list elements in reverse order with negative
values, so that certainly expands my knowledge of Python. Of course,
I'd want to check for other directionals -- probably with a list
check, e.g.,

if parts[-2] in ('E', 'W', 'N', 'S'):

Thanks for sharing your approach.

Brian D

unread,
Jan 28, 2010, 8:40:09 AM1/28/10
to
> > [snip]
> > Regex doesn't gain you much. I'd split the string and then fix the parts
> > as necessary:
>
> >  >>> def parse_address(address):
> > ...     parts = address.split()
> > ...     if parts[-2] == "S":
> > ...         parts[1 : -1] = [parts[-2]] + parts[1 : -2]
> > ...     parts[1 : -1] = [" ".join(parts[1 : -1])]
> > ...     return parts
> > ...
> >  >>> print parse_address("1000 RAMPART S ST")
> > ['1000', 'S RAMPART', 'ST']
> >  >>> print parse_address("100 JOHN CHURCHILL CHASE  ST")
> > ['100', 'JOHN CHURCHILL CHASE', 'ST']
>
> This is a nice approach I wouldn't have thought to pursue. I've never
> seen this referencing of list elements in reverse order with negative
> values, so that certainly expands my knowledge of Python. Of course,
> I'd want to check for other directionals -- probably with a list
> check, e.g.,
>
> if parts[-2] in ('E', 'W', 'N', 'S'):
>
> Thanks for sharing your approach.

After studying this again today, I realized the ingeniousness of
reverse slicing the list (or perhaps right slicing) -- that one
doesn't have to worry about the number of words in the string.

To translate for those who may follow, the expression "parts[1 : -1]"
means gather list items from position one in the list (index position
2) to one index position before the end of the list. The value in this
is that we already know the first list element after a split() will be
the street number. The last element will be the street type.
Everything in between, no matter how many words, will be the street
name -- excepting, of course, the instances where there's a street
direction added in, as captured in example above.

A very nice solution. Thanks again!

Brian D

unread,
Jan 28, 2010, 8:48:48 AM1/28/10
to

Correction:

[snip] the expression "parts[1 : -1]" means gather list items from the
second element in the list (index value 1) to one index position
before the end of the list. [snip]

Lie Ryan

unread,
Jan 28, 2010, 9:27:59 AM1/28/10
to
On 01/28/10 11:28, Brian D wrote:
> I've tackled this kind of problem before by looping through a patterns
> dictionary, but there must be a smarter approach.
>
> Two addresses. Note that the first has incorrectly transposed the
> direction and street name. The second has an extra space in it before
> the street type. Clearly done by someone who didn't know how to
> concatenate properly -- or didn't care.
>
> 1000 RAMPART S ST
>
> 100 JOHN CHURCHILL CHASE ST
>
> I want to parse the elements into an array of values that can be
> inserted into new database fields.
>
> Anyone who loves solving these kinds of puzzles care to relieve my
> frazzled brain?
>
> The pattern I'm using doesn't keep the "CHASE" with the "JOHN
> CHURCHILL":


How does the following perform?

pat =
re.compile(r'(?P<streetnum>\d+)\s+(?P<streetname>[A-Z\s]+)\s+(?P<streetdir>N|S|W|E|)\s+(?P<streettype>ST|RD|AVE?|)$')

or more legibly:

pat = re.compile(
r'''
(?P<streetnum> \d+ ) #M series of digits
\s+
(?P<streetname> [A-Z\s]+ ) #M one-or-more word
\s+
(?P<streetdir> S?E|SW?|N?W|NE?| ) #O direction or nothing
\s+
(?P<streettype> ST|RD|AVE? ) #M street type
$ #M END
''', re.VERBOSE)

Brian D

unread,
Jan 28, 2010, 12:36:52 PM1/28/10
to

> Correction:
>
> [snip] the expression "parts[1 : -1]" means gather list items from the
> second element in the list (index value 1) to one index position
> before the end of the list. [snip]

MRAB's solution was deserving of a more complete solution:

>>> def parse_address(address):
# Handles poorly-formatted addresses:
# 100 RAMPART S ST -- direction in wrong position
# 45 JOHN CHURCHILL CHASE ST -- two spaces before type
#addresslist = ['num', 'dir', 'name', 'type']
addresslist = ['', '', '', '']
parts = address.split()


if parts[-2] in ('E', 'W', 'N', 'S'):

addresslist[1] = parts[-2]
addresslist[2] = ' '.join(parts[1 : -2])
else:
addresslist[2] = ' '.join(parts[1 : -1])
addresslist[0] = parts[0]
addresslist[3] = parts[-1]
return addresslist

>>> parse_address('45 John Churchill Chase N St')
['45', 'N', 'John Churchill Chase', 'St']
>>> parse_address('45 John Churchill Chase St')
['45', '', 'John Churchill Chase', 'St']

Brian D

unread,
Jan 28, 2010, 12:50:37 PM1/28/10
to

Is that all? That little empty space after the "|" OR metacharacter?
Wow.

As a test, to create a failure, if I remove that last "|"
metacharacter from the "N|S|W|E|" string (i.e., "N|S|W|E"), the match
fails on addresses that do not have that malformed direction after the
street name (e.g., '45 JOHN CHURCHILL CHASE ST')

Very clever. I don't think I've ever seen documentation showing that
little trick.

Thanks for enlightening me!

Reply all
Reply to author
Forward
0 new messages