Two addresses. Note that the first has incorrectly transposed the
direction and street name. The second has an extra space in it before
the street type. Clearly done by someone who didn't know how to
concatenate properly -- or didn't care.
1000 RAMPART S ST
100 JOHN CHURCHILL CHASE ST
I want to parse the elements into an array of values that can be
inserted into new database fields.
Anyone who loves solving these kinds of puzzles care to relieve my
frazzled brain?
The pattern I'm using doesn't keep the "CHASE" with the "JOHN
CHURCHILL":
>>> p = re.compile(r'(?P<streetnum>\d+)\s(?P<streetname>[A-Z\s]*)\s(?P<streetdir>\w*)\s(?P<streettype>\w{2})$')
>>> s = '1405 RAMPART S ST'
>>> m = re.search(p, s)
>>> m
<_sre.SRE_Match object at 0x011A4440>
>>> print m.groups()
('1405', 'RAMPART', 'S', 'ST')
>>> s = '45 JOHN CHURCHILL CHASE ST'
>>> m = re.search(p, s)
>>> m
<_sre.SRE_Match object at 0x011A43E8>
>>> print m.groups()
('45', 'JOHN CHURCHILL', 'CHASE', 'ST')
If you're really serious about it (e.g. you are the post office trying
to program automatic mail sorting machines) there is no simple regex
trick anything like what you want. A lot of addresses will be
ambiguous. You have use all the info you have about your entire address
corpus (e.g. you need a complete street directory of the whole US) and
do a bunch of Bayesian inference. As a very simple example, for an
address like "1000 RAMPART S ST" you'd use the zip code to identify the
address's geographic neighborhood, and then use your street directory to
find candidate correct addresses within that zip code. The USPS does
an amazing job of delivering mail to completely mangled addresses
based on methods like that.
>>> def parse_address(address):
... parts = address.split()
... if parts[-2] == "S":
... parts[1 : -1] = [parts[-2]] + parts[1 : -2]
... parts[1 : -1] = [" ".join(parts[1 : -1])]
... return parts
...
>>> print parse_address("1000 RAMPART S ST")
['1000', 'S RAMPART', 'ST']
>>> print parse_address("100 JOHN CHURCHILL CHASE ST")
['100', 'JOHN CHURCHILL CHASE', 'ST']
Paul,
That's a sound methodology. I actually have a routine that will
compare an address to a list of all streets in the city using a Short
Distance function. I have used that in circumstances when there are a
lot of problems with addresses. In this case, however, the streets are
actually structured very well -- except for the transposed street
directions. I was really hoping to see if there's a solution that
handles one, two, and three word strings, followed by an occasional
single character, and then a two character suffix. I'm still hoping
for that kind of a solution if it exists. The reason? It's actually a
very small number of addresses that aren't being captured with the
current regex. I don't see the need for overkill, and I'm always
stretching to learn something I haven't already succeeded at
accomplishing. I may just make a second pass at the data with a
different regex.
This is a nice approach I wouldn't have thought to pursue. I've never
seen this referencing of list elements in reverse order with negative
values, so that certainly expands my knowledge of Python. Of course,
I'd want to check for other directionals -- probably with a list
check, e.g.,
if parts[-2] in ('E', 'W', 'N', 'S'):
Thanks for sharing your approach.
After studying this again today, I realized the ingeniousness of
reverse slicing the list (or perhaps right slicing) -- that one
doesn't have to worry about the number of words in the string.
To translate for those who may follow, the expression "parts[1 : -1]"
means gather list items from position one in the list (index position
2) to one index position before the end of the list. The value in this
is that we already know the first list element after a split() will be
the street number. The last element will be the street type.
Everything in between, no matter how many words, will be the street
name -- excepting, of course, the instances where there's a street
direction added in, as captured in example above.
A very nice solution. Thanks again!
Correction:
[snip] the expression "parts[1 : -1]" means gather list items from the
second element in the list (index value 1) to one index position
before the end of the list. [snip]
How does the following perform?
pat =
re.compile(r'(?P<streetnum>\d+)\s+(?P<streetname>[A-Z\s]+)\s+(?P<streetdir>N|S|W|E|)\s+(?P<streettype>ST|RD|AVE?|)$')
or more legibly:
pat = re.compile(
r'''
(?P<streetnum> \d+ ) #M series of digits
\s+
(?P<streetname> [A-Z\s]+ ) #M one-or-more word
\s+
(?P<streetdir> S?E|SW?|N?W|NE?| ) #O direction or nothing
\s+
(?P<streettype> ST|RD|AVE? ) #M street type
$ #M END
''', re.VERBOSE)
MRAB's solution was deserving of a more complete solution:
>>> def parse_address(address):
# Handles poorly-formatted addresses:
# 100 RAMPART S ST -- direction in wrong position
# 45 JOHN CHURCHILL CHASE ST -- two spaces before type
#addresslist = ['num', 'dir', 'name', 'type']
addresslist = ['', '', '', '']
parts = address.split()
if parts[-2] in ('E', 'W', 'N', 'S'):
addresslist[1] = parts[-2]
addresslist[2] = ' '.join(parts[1 : -2])
else:
addresslist[2] = ' '.join(parts[1 : -1])
addresslist[0] = parts[0]
addresslist[3] = parts[-1]
return addresslist
>>> parse_address('45 John Churchill Chase N St')
['45', 'N', 'John Churchill Chase', 'St']
>>> parse_address('45 John Churchill Chase St')
['45', '', 'John Churchill Chase', 'St']
Is that all? That little empty space after the "|" OR metacharacter?
Wow.
As a test, to create a failure, if I remove that last "|"
metacharacter from the "N|S|W|E|" string (i.e., "N|S|W|E"), the match
fails on addresses that do not have that malformed direction after the
street name (e.g., '45 JOHN CHURCHILL CHASE ST')
Very clever. I don't think I've ever seen documentation showing that
little trick.
Thanks for enlightening me!