_______________________________________________
postgis-users mailing list
postgi...@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users
-Steve W
On 4/24/2020 9:54 PM, Imre Samu wrote:
> > handle addresses in postgresql
>
> maybe you can use the https://github.com/openvenues/libpostal library
> with your favorite language bindings ( Python / Ruby / Go / PHP / Node
> / R / Java ...)
>
> or as a Postgres database extension:
> https://info.crunchydata.com/blog/quick-and-dirty-address-matching-with-libpostal
> https://github.com/pramsey/pgsql-postal
>
> Regards,
> Imre
>
>
>
>
> Shaozhong SHI <shisha...@gmail.com <mailto:shisha...@gmail.com>>
> ezt írta (időpont: 2020. ápr. 25., Szo, 2:49):
>
> I find this is a simple, but important question.
>
> How best to split numbers and the rest of address?
>
> For instance, one tricky one is as follows:
>
> 21-1 Great Avenue, a city, a country, this planet
>
> How to turn this into the following:
>
> column 1, column 2
>
> 21-1 Great Avenue, a city, a country, this planet
>
> Note: there is a hyphen in 21-1
>
> Any clue?
>
> Regards,
>
> Shao
> _______________________________________________
> postgis-users mailing list
> postgi...@lists.osgeo.org <mailto:postgi...@lists.osgeo.org>
> https://lists.osgeo.org/mailman/listinfo/postgis-users
-Steve W
'^( *Building *[0-9]+)?[- 0-9]*'
or something like that should do it. But I think you will find that a
more robust solution is to use parse_address() and/or
standardize_address() as they will recognize a lot of other address
constructs, like "apt 3a" for for example.
parse_address() that a text field and breaks it into "house number
street name" and "city state zip", but only works well in North America.
standardize_address() that comes with postGIS, breaks the address down
into its components and can separate out things like buildings, and
apartment/unit specifiers so you can then take the fields you are
interested in and recombine just them in a new string. Again, this works
best in North America.
My github address-standardizer is built to recognize address for most
counties, but it can also be configured to recognize address standards
for any county without too much effort. It compiles and installs as
postgresql extension.
Addresses are generally very messy and unless your addresses are vary
simple you will be constantly fighting with this or that exception.
-Steve
If you already have postGIS installed then
create extension address_standardizer;
# \df parse_address
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+---------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+--------
public | parse_address | record | text, OUT num text, OUT
street text, OUT street2 text, OUT address1 text, OUT city text, OUT
state text, OUT zip text, OUT zipplus text, OUT country text | normal
(1 row)
# select * from parse_address('123-2 main street city ny');
num | street | street2 | address1 | city | state | zip
| zipplus | country
-------+-------------+---------+-------------------+------+-------+-----+---------+---------
123-2 | main street | | 123-2 main street | city | NY |
| | US
(1 row)
# \df standardize*
List of functions
Schema | Name | Result data type |
Argument data types | Type
--------+---------------------+------------------+---------------------------------------------------------------+--------
public | standardize_address | stdaddr | lextab text, gaztab
text, rultab text, address text | normal
public | standardize_address | stdaddr | lextab text, gaztab
text, rultab text, micro text, macro text | normal
You need tables for the lexicon, gazetteer, and rules, which should be
included in the extension but I'm not seeing them. So you can grab these
from:
https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-gaz.sql
https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-lex.sql
https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-rules.sql
and load them like:
psql mydb -f us-gaz.sql
psql mydb -f us-lex.sql
psql mydb -f us-rules.sql
# select * from standardize_address('lex', 'gaz', 'rules', '123-2 main
street city ny');
building | house_num | predir | qual | pretype | name | suftype |
sufdir | ruralroute | extra | city | state | country | postcode | box
| unit
----------+-----------+--------+------+---------+--------+---------+--------+------------+-------+------+----------+---------+----------+-----+------
| 123 | | | | 2 MAIN | STREET
| | | | CITY | NEW YORK | USA | | |
(1 row)
This is a good example of why parsing addresses is so difficult. The
rules for standardize_address do not account for a house number like
"123-2", but the regexp in parse_address do handle it. It is easy to get
the 80% right and very hard to get it much above that.
-Steve
>
> I need to find these first before test-running.
>
> Regards,
>
> Shao
>
> On Sat, 25 Apr 2020 at 21:20, Stephen Woodbridge
> <stephenwo...@gmail.com <mailto:stephenwo...@gmail.com>>
> > <mailto:shisha...@gmail.com <mailto:shisha...@gmail.com>>>
> wrote:
> >
> > I find this is a simple, but important question.
> >
> > How best to split numbers and the rest of address?
> >
> > For instance, one tricky one is as follows:
> >
> > 21-1 Great Avenue, a city, a country, this planet
> >
> > How to turn this into the following:
> >
> > column 1, column 2
> >
> > 21-1 Great Avenue, a city, a country, this planet
> >
> > Note: there is a hyphen in 21-1
> >
> > Any clue?
> >
> > Regards,
> >
> > Shao
> >
> >
> > _______________________________________________
> > postgis-users mailing list
> > postgi...@lists.osgeo.org <mailto:postgi...@lists.osgeo.org>
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>
> _______________________________________________
> postgis-users mailing list
> postgi...@lists.osgeo.org <mailto:postgi...@lists.osgeo.org>
'^( *(Buildings?|Bldg|Room) *[0-9]+)?[- 0-9]*[a-zA-Z]? '
> 2. Once extensions created, can these functions be adapted? Are
> codes available ? I will see whether to put it into a project, so
> that our programmers can have work to do.
The code for these in in the PostGIS repository, but if you are serious
about making changes to them, then I recommend working with
https://github.com/woodbri/address-standardizer because I rewrote these
function to be easier to modify. The existing code is very hard to
understand and very difficult to make changes to without breaking
things. My new code is hopefully well documented in the repository and
should be straight forward to work with. FYI, I have never built/tested
with PostgreSQL 12+ so it might need some changes to support that, but
it should be good for 9, 10 and 11 versions.
-Steve
>
> Regards,
>
> Shao
>
> On Sun, 26 Apr 2020 at 03:09, Stephen Woodbridge
> <stephenwo...@gmail.com <mailto:stephenwo...@gmail.com>>
> <mailto:stephenwo...@gmail.com
> <mailto:postgi...@lists.osgeo.org
> <mailto:postgi...@lists.osgeo.org>>
> > > https://lists.osgeo.org/mailman/listinfo/postgis-users
> >
> > _______________________________________________
> > postgis-users mailing list
> > postgi...@lists.osgeo.org
> <mailto:postgi...@lists.osgeo.org>
> <mailto:postgi...@lists.osgeo.org
I just remembered the lex, gaz, and rules data is in a separate
extension. The correct way to install it is with:
create extension address_standardizer_data_us;
-Steve
On 4/26/2020 8:09 AM, Shaozhong SHI wrote:
> Hi, Steve,
>
> Thanks.
>
> 2 questions.
>
> 1. How can we remove things like Room 2a, Buildings 2-6b and etc with
> regexp replace?
> 2. Once extensions created, can these functions be adapted? Are
> codes available ? I will see whether to put it into a project, so
> that our programmers can have work to do.
>
> Regards,
>
> Shao
>
> On Sun, 26 Apr 2020 at 03:09, Stephen Woodbridge
> <stephenwo...@gmail.com <mailto:stephenwo...@gmail.com>>
> <mailto:stephenwo...@gmail.com
> <mailto:postgi...@lists.osgeo.org
> <mailto:postgi...@lists.osgeo.org>>
> > > https://lists.osgeo.org/mailman/listinfo/postgis-users
> >
> > _______________________________________________
> > postgis-users mailing list
> > postgi...@lists.osgeo.org
> <mailto:postgi...@lists.osgeo.org>
> <mailto:postgi...@lists.osgeo.org
As I tried to indicate, the postGIS address_standardizer extension is
extremely hard to understand and nearly impossible to customize the
files. I migrated the PAGC code into this extension but I barely
understood the code. Conceptually, the ideas are straight forward, but
the code is very hard to follow and understand so I portted it as a
black box.
If the exisiting address_standardizer_data_us does not work for you or
you need to support other countries, you will need to build and install
the one here:
https://github.com/woodbri/address-standardizer
This was written as a replacement
This one has sample files for 25 countries here:
https://github.com/woodbri/address-standardizer/tree/develop/data/sample
that you can customize and has (hopefully) good documentation here:
https://github.com/woodbri/address-standardizer/blob/develop/README.md
https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
https://github.com/woodbri/address-standardizer/tree/develop/data
Also, this code is written in C++ and I hope it is written to be easier
to understand and review.
Not sure I can help much more than this unless you have questions on
https://github.com/woodbri/address-standardizer which we should probably
take off the PostGIS list as this is not part of PostGIS.
-Steve
>
> Regards,
>
> Shao
>
> On Sun, 26 Apr 2020 at 17:23, Stephen Woodbridge
> <stephenwo...@gmail.com <mailto:stephenwo...@gmail.com>>
> wrote:
>
> Shao,
>
> I just remembered the lex, gaz, and rules data is in a separate
> extension. The correct way to install it is with:
>
> create extension address_standardizer_data_us;
>
> -Steve
>
> On 4/26/2020 8:09 AM, Shaozhong SHI wrote:
> > Hi, Steve,
> >
> > Thanks.
> >
> > 2 questions.
> >
> > 1. How can we remove things like Room 2a, Buildings 2-6b and
> etc with
> > regexp replace?
> > 2. Once extensions created, can these functions be adapted? Are
> > codes available ? I will see whether to put it into a
> project, so
> > that our programmers can have work to do.
> >
> > Regards,
> >
> > Shao
> >
> > On Sun, 26 Apr 2020 at 03:09, Stephen Woodbridge
> > <stephenwo...@gmail.com
> <mailto:stephenwo...@gmail.com>
On Jan 9, 2021, at 4:42 AM, Darafei Komяpa Praliaskouski <m...@komzpa.net> wrote:
On Jan 9, 2021, at 10:00 AM, Stephen Woodbridge <stephenwo...@gmail.com> wrote:
David,
This is the link to the address standardizer:
https://github.com/woodbri/address-standardizer
This is a link to all my code that I developed consulting. It includes a
few SQL geocoders based on the code above. And has some README files
discussing how to build a geocoder which is the basis for how the
geocoders work.
https://github.com/woodbri/imaptools.com
this is the geocoder for Tiger data, but the code is essentially the
same for every country because the when you load country specific data
into the database it goes into its own table and then you standardize
that data into stdstreets table and all queries are done against the
stdstreets table and you only have to tweak the address range
interpolation function which needs to access the source streets table
for the geometry and house number ranges.
https://github.com/woodbri/imaptools.com/blob/master/sql-scripts/geocoder/prep-tiger-geo-new.sql
I would approach this by:
1. get the address standardizer compiled and installed. I can help if
you run into problems or have questions.
2. load your UK street data into rawdata schema, ideally it would be
best if we can create a table/view that presents this data as a single
table where each record represents one side of the street and one
jurisdiction this may mean that a single record in your source data will
generate multiple records in this table/view (this greatly simplifies
the coding and performance later)
3. look at the prep-tiger-geo-new.sql file
4. create a stdstreets table and standardize your table/view data into it
5. look at standardization failures and adjust lexicon and grammar as needed
6. loop back to 4 until good enough
7. load functions from prep-tiger-geo-new.sql file and adjust any for
your data
8. try it out!
-Steve
On 1/9/2021 10:22 AM, Shaozhong SHI wrote:
> Hi, Stephen,
>
> Many thanks. We are interested in it is working with the UK addresses.
>
> Please send me the link to this.
>
> Regards,
>
> David
>
> On Sat, 9 Jan 2021 at 15:00, Stephen Woodbridge
> <stephenwo...@gmail.com <mailto:stephenwo...@gmail.com>>
> wrote:
>
> David,
>
> Yup and this is just one a dozens of cases that you have to deal
> with. You are dealing with a natural language processing problem.
> And you have to deal with human input that has typos and
> abbreviations.
>
> These issues are what the address standardizer fixes. It tokenized
> the address and uses the gazette to standardize the terms and then
> classifies each term and assigns it to part of the address based
> on a grammar.
>
> So there is a simple solution, use my address standardizer, it is
> free, MIT license, it has a sample lexicon/ gazette and grammar
> for the UK, it is easy to modify these to fit your needs, and it
> just works. Oh if you want to do another county it also has sample
> files for 25 countries.
>
> Sent from my iPhone
>
>> On Jan 9, 2021, at 4:42 AM, Darafei Komяpa Praliaskouski
>> <m...@komzpa.net <mailto:m...@komzpa.net>> wrote:
>>
>>
>> Hello,
>>
>> People make neural networks for this kind of task:
>>
>> https://github.com/openvenues/libpostal
>> <https://github.com/openvenues/libpostal>
>>
>> сб, 9 сту 2021, 12:40 карыстальнік Shaozhong SHI
>> <shisha...@gmail.com <mailto:shisha...@gmail.com>> напісаў:
>>
>> Hi, Steve W,
>>
>> it is easy to parse addresses as tokens. But it is difficult
>> to put tokens in right columns, due to that the same address
>> could be expressed with partial address or full address.
>>
>> The same address can be written like, Flat 1 122 Great Avenue
>> London UK, or Flat 1 122 Greet Avenue Central London London
>> United Kingdom.
>>
>> When this happens, each address has different number of
>> tokens, so different numbers of tokens. Is there a way to
>> deal with this issue so that each token can get into right
>> column?
>>
>> Please enlighten me.
>>
>> Regards,
>>
>> David
>>
>> On Sat, 25 Apr 2020 at 05:09, Stephen Woodbridge
>> <stephenwo...@gmail.com
>> <mailto:stephenwo...@gmail.com>> wrote:
>>
>> And I have create an address-standardizer project here
>> https://github.com/woodbri/address-standardizer
>> <https://github.com/woodbri/address-standardizer> which
>> is user
>> configurable. I might be over kill is you just want to
>> strip off the
>> number, in which case you might just use a SQL regexp
>> replace to remove it.
>>
>> -Steve W
>>
>> On 4/25/2020 12:04 AM, Stephen Woodbridge wrote:
>> > PostGIS has address_standardizer extension that includes
>> > parse_address() and standardize_address() functions.
>> >
>> > -Steve W
>> >
>> > On 4/24/2020 9:54 PM, Imre Samu wrote:
>> >> > handle addresses in postgresql
>> >>
>> >> maybe you can use the
>> https://github.com/openvenues/libpostal
>> <https://github.com/openvenues/libpostal> library
>> >> with your favorite language bindings ( Python / Ruby /
>> Go / PHP /
>> >> Node / R / Java ...)
>> >>
>> >> or as a Postgres database extension:
>> >>
>> https://info.crunchydata.com/blog/quick-and-dirty-address-matching-with-libpostal
>> <https://info.crunchydata.com/blog/quick-and-dirty-address-matching-with-libpostal>
>>
>> >>
>> >> https://github.com/pramsey/pgsql-postal
>> <https://github.com/pramsey/pgsql-postal>
>> >>
>> >> Regards,
>> >> Imre
>> >>
>> >>
>> >>
>> >>
>> >> Shaozhong SHI <shisha...@gmail.com
>> <mailto:shisha...@gmail.com>
>> >> <mailto:shisha...@gmail.com
>> <mailto:shisha...@gmail.com>>> ezt írta (időpont:
>> 2020. ápr. 25.,
>> >> Szo, 2:49):
>> >>
>> >> I find this is a simple, but important question.
>> >>
>> >> How best to split numbers and the rest of address?
>> >>
>> >> For instance, one tricky one is as follows:
>> >>
>> >> 21-1 Great Avenue, a city, a country, this planet
>> >>
>> >> How to turn this into the following:
>> >>
>> >> column 1, column 2
>> >>
>> >> 21-1 Great Avenue, a city, a
>> country, this planet
>> >>
>> >> Note: there is a hyphen in 21-1
>> >>
>> >> Any clue?
>> >>
>> >> Regards,
>> >>
>> >> Shao
>> >> _______________________________________________
>> >> postgis-users mailing list
>> >> postgi...@lists.osgeo.org
>> <mailto:postgi...@lists.osgeo.org>
>> <mailto:postgi...@lists.osgeo.org
>> <mailto:postgi...@lists.osgeo.org>>
>> >> https://lists.osgeo.org/mailman/listinfo/postgis-users
>> <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>> >>
>> >>
>> >> _______________________________________________
>> >> postgis-users mailing list
>> >> postgi...@lists.osgeo.org
>> <mailto:postgi...@lists.osgeo.org>
>> >> https://lists.osgeo.org/mailman/listinfo/postgis-users
>> <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>> >
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgi...@lists.osgeo.org
>> <mailto:postgi...@lists.osgeo.org>
>> https://lists.osgeo.org/mailman/listinfo/postgis-users
>> <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgi...@lists.osgeo.org
>> <mailto:postgi...@lists.osgeo.org>
>> https://lists.osgeo.org/mailman/listinfo/postgis-users
>> <https://lists.osgeo.org/mailman/listinfo/postgis-users>
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgi...@lists.osgeo.org <mailto:postgi...@lists.osgeo.org>
>> https://lists.osgeo.org/mailman/listinfo/postgis-users
>> <https://lists.osgeo.org/mailman/listinfo/postgis-users>
> _______________________________________________
> postgis-users mailing list
> postgi...@lists.osgeo.org <mailto:postgi...@lists.osgeo.org>
> https://lists.osgeo.org/mailman/listinfo/postgis-users
David,
This is the link to the address standardizer:
https://github.com/woodbri/address-standardizer
This is a link to all my code that I developed consulting. It includes a few SQL geocoders based on the code above. And has some README files discussing how to build a geocoder which is the basis for how the geocoders work.
https://github.com/woodbri/imaptools.com
this is the geocoder for Tiger data, but the code is essentially the same for every country because the when you load country specific data into the database it goes into its own table and then you standardize that data into stdstreets table and all queries are done against the stdstreets table and you only have to tweak the address range interpolation function which needs to access the source streets table for the geometry and house number ranges.
https://github.com/woodbri/imaptools.com/blob/master/sql-scripts/geocoder/prep-tiger-geo-new.sql
I would approach this by:
1. get the address standardizer compiled and installed. I can help if you run into problems or have questions.
2. load your UK street data into rawdata schema, ideally it would be best if we can create a table/view that presents this data as a single table where each record represents one side of the street and one jurisdiction this may mean that a single record in your source data will generate multiple records in this table/view (this greatly simplifies the coding and performance later)
3. look at the prep-tiger-geo-new.sql file
4. create a stdstreets table and standardize your table/view data into it
5. look at standardization failures and adjust lexicon and grammar as needed
6. loop back to 4 until good enough
7. load functions from prep-tiger-geo-new.sql file and adjust any for your data
8. try it out!
-Steve
On 1/9/2021 10:22 AM, Shaozhong SHI wrote:
Hi, Stephen,
Many thanks. We are interested in it is working with the UK addresses.
Please send me the link to this.
Regards,
David
@building @house @street @city @prov @country @postal
@house @street @city @prov @country @postal
@house @street @city @prov @postal
Here is an example of setting up 3 variations of an address definition
and the tokens will get compared to all of them with the best match
sorting to the top of the list.
"@name" is a meta definition that can point to an explicit rules set or
another meta definition
If @build is not important then you can assign the related tokens to
standard field that you ignore in the queries. If it is important then
you assign it to an appropriate field in the standardized address table
for this query.
Here is the sample grammar for great britian (ignore the fact it say
Germany, copy and past error)
https://github.com/woodbri/address-standardizer/blob/develop/data/sample/greatbritain.gmr
And here is the sample lexicon used to classify and standardize tokens
for great britian.
https://github.com/woodbri/address-standardizer/blob/develop/data/sample/greatbritain.lex
These can be edited as required.
Be aware that by default most tokens are classified as WORD, which is
ok, but the more specific classification the better it does to
accurately assign tokens to the correct grammar terms.
Also, this thread is a little off topic for postgis, so unless others
are interested in following this we should not continue on the list. So
if you want to use it, then start with the list of steps I posted
previously and read my docs. Geocoding is not easy as I think you have
already seen. I'm willing to help you but you need to get this built on
your system so we can talk about concrete issues and steps you are
having in implementing it rather than potential problems around
geocoding which are many, and many(most?) of them have been dealt in the
code.
Also be aware that geocoding will never be 100% because it is language
processing problem, but I have been able to get 95+% of a reference data
set to be recognized and matched correctly with multiple different data
sets.
-Steve
On 1/10/2021 9:57 AM, Shaozhong SHI wrote:
> Hi, Steve,
>
> Another solution appeals me most is as follows:
>
> Given a space delimited full address line, we can parse it to correct
> BS7666 format.
>
> Something like house number, street, area, city, postcode
>
> E.g., Nest 2 my nesting place 1B Great Avenue Forest Park London WS22 5TT
> Can you enlighten me about that?
>
> Regards
> ,
> David
>
> On Saturday, 9 January 2021, Stephen Woodbridge
> <stephenwo...@gmail.com <mailto:stephenwo...@gmail.com>>
> <mailto:stephenwo...@gmail.com>
> <mailto:stephenwo...@gmail.com
> <mailto:m...@komzpa.net <mailto:m...@komzpa.net>>> wrote:
>
>
> Hello,
>
> People make neural networks for this kind of task:
>
> https://github.com/openvenues/libpostal
> <https://github.com/openvenues/libpostal>
> <https://github.com/openvenues/libpostal
> <https://github.com/openvenues/libpostal>>
>
> сб, 9 сту 2021, 12:40 карыстальнік Shaozhong SHI
> <shisha...@gmail.com
> <mailto:shisha...@gmail.com>
> <mailto:shisha...@gmail.com
> <mailto:stephenwo...@gmail.com>
> <mailto:stephenwo...@gmail.com