Use the address_standardizer extension, particularly for north american addressing.
https://postgis.net/docs/postgis_installation.html#installing_pagc_address_standardizer
Or use an ML trained standardizer like this one.
https://github.com/pramsey/pgsql-postal
Or gate out to a geocoding service using a web service call.
https://docs.google.com/presentation/d/1Fgc_2dzWAzT--HdMEiWj2fFLJNnpxPXmnYXx9Js3xjE/edit
To handball some fuzzy stuff, use the functions in the postgresql contrib module,
create extension fuzzystrmatch;
The python utility is really just using different ratios of string length and levenstein distance, it ain't rocket science.
P .
> _______________________________________________
> postgis-users mailing list
> postgi...@lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
_______________________________________________
postgis-users mailing list
postgi...@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users
https://github.com/woodbri/address-standardizer
If you are trying to write a geocoder I have open sourced the imaptools
geocoder which you can find here:
https://github.com/woodbri/imaptools.com
Start by reading these:
https://github.com/woodbri/imaptools.com/blob/master/README-geocoder-design.md
https://github.com/woodbri/imaptools.com/blob/master/README-geocoder.md
Sorry things are a little chaotic because I just dumped all my code up
here, but I have been documenting stuff in README files and trying to
reorg things to make more sense.
-Steve
At the simplest, you're trying to catch common lexicographic differences, like off-by-one addresses or alternate spellings. This is the realm of trigrams and levenstein distances.
Then you start dealing with different forms of abbreviations (rd, road, r) and formats (unit 4, #4, apt 4, 4). This is the realm of data-less standardizers.
Then you start dealing with larger forms of aliasing. Standard city and state names, recognizing major components of addresses. This is the real of dictionary-backed standardizers.
Then you start dealing with all forms of aliasing. This is actually getting down into geocoding, and address standardizers backed by complete address and road databases.
What people mean when they say "can you do address standardization" can vary massively. It can also be frustrating because the performance of something like Google's geocoder algorithm, backed by the largest, and most up-to-data geographic database in the world gets casually compared to simple format and dictionary backed standardizer by folks with no understanding of the complexity or amount of data that lives under the covers of this "simple" task. I think people are far more understanding of something like machine translation and "get" that it's a really hard problem, because learning a new language is hard. But geocoding is "easy", anyone can look at an address and then look that address up in a map book (ha ha, well anyone over 40 at least).
P.