[postgis-users] Geocoders - General thoughts and questions

65 views
Skip to first unread message

Stephen Woodbridge

unread,
Dec 9, 2012, 10:08:26 AM12/9/12
to PostGIS Users Discussion
Hi Regina, and others interested in geocoders,

There has been a more general OSGeo Labs discussion that has included
OpenGeocoder. OpenGeocoder was a list I started a while back that never
grew into project. I want to have a more focused discussion here.

Some background:

Ten years ago I wrote a geocoder that used the old Tiger/Line data as
input. Back then I converted Tiger/Line to shapefiles, and wrote a
cgi-bin geocoder written in C that searched the shapefiles. It worked
very well and returned results typically in 0.5-2.0 sec responses which
was pretty good considering the HW back then. The problem with it was
that a lot of code I wrote was implementing indexes and searching that
is done way better in a database. That product was abandoned when I
decided it was too hard to morph the Tiger shapefiles into the data
structures that I used back then and the code was too structured around
those old data structures.

In the interim, I supported and enhanced a proprietary geocoder written
in Perl and eventually helped migrate these legacy tools over to Oracle
Spatial so I got some experience with that also.

Fast forward to the last few years, where I was looking for an
alternative, to my old code and I started looking at the postgis
geocoder, PAGC geocoder, and a few others. At that time the Tiger
geocoder looked like what I wanted but was a little too oriented to
Tiger data and was a little rough around the edges, I think Regina and
other have made some significant progress in smoothing that out and
improving performance, but I have not tried it recently. PAGC had some
very good technology internally for fuzzy match address standardization,
result scoring, etc, but when I tried to scale it up from a few counties
or a state to a national data set it ran into serious performance
problems. I have been working over the past year with the developer to
fix these issues and it has become more apparent that the code needs to
be refactored and modularized. Similar to the geocoder that I wrote,
much of the PAGC code is dedicated to doing database like things.

So what does this have to do with postgis geocoding?

If we break down geocoding into a series a process steps it looks
something like this:

Loading data:
1. prep and load data for a reference set
2. standardize the reference data
3. build indexes and supporting infrastructure as required

Geocoding a request:
1. parse request and standardize it
2. generate a query plan to match the standardized request to the
standardized reference set, this might include fuzzy searching
3. execute the query plan
4. score the results and order them best to worst
5. compute the location for each result

So from this there appear to be the following sub processes that are not
handle directly by the database:

a. parse the request, for example a complete address in a single field
b. standardize an address or reference record
c. dynamically building a query plan based on the standardized request
d. scoring results
e. computing the resultant location
f. the prep and load data initially

Using this model I was able to write a new geocoder very quickly by:

1. I extracted the address standardizer code from PAGC and created an
simple API and library. This code is checked into a branch in PAGC svn.
2. I wrote a simple postgresql wrapper for the standardizer, not yet
checked into svn
3. I wrote a similar postgresql wrapper for the single line parser.
4. I wrote the rest of the geocoder c, d, and e in plpgsql.
5. for step f, I reused a process that I developed for load Tiger
shapefiles into PAGC, only loaded the records into a postgis table instead.

Most of my effort was in extracting the standardizer code and creating a
library. This is still pretty rough code as it was more of a prototype
to see if I could do it. Step 2 still has a problem that I need some
help on but I have a temporary work around. Step 3. was a trivial after
thought. And step 4. took a few days to work out, code, and test.

This prototype was able to geocode a table of about 216K addresses in St
Paul, MN on average in 44 ms each. Obviously there was a lot of caching
happening because all the queries were focused in one fairly small area,
but the Tiger data is represented in 50+ Million records and cold cache
hits are typically 100-200 ms. This varies based on doing fuzzy searches
or not. This is on a 3-4 year old server with 4GB of memory and SATA1
disks in software RAID, running Postgresql 8.3.

I also spent a day and added intersection query support and have spent
some time writing a PHP wrapper to expose it as a webservice. I plan to
add support for geocoding to parcel or site addresses, ie: where you are
not interpolating an address along a segment.

So my thoughts related to the Tiger Geocoder is that some of this might
be able to be used there. Also, there is nothing that is Tiger specific
other than the initial loading of the data and preprocessing that might
need to be done. The standardizer is table driven and it loads a
lexicon, gazeteer, and the address parsing rules from tables. So if the
data changes or the locale changes you only need to change these tables
in theory. In reality, more work needs to be done on the address parser
so that it will support UTF8 for internationalization but I have ideas
for that.

So if this is of interest, I would be happy to answer questions, compare
results with the Tiger geocoder, etc.

Does the Tiger geocoder have a set of test address that could be used
for comparison? Or do any of the users have this? I would be happy to
invest some effort into loading a test set of addresses, especially if
they have geocodes already so I can compare results.

Thoughts?
-Steve W
_______________________________________________
postgis-users mailing list
postgi...@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

Paragon Corporation

unread,
Dec 9, 2012, 1:59:03 PM12/9/12
to PostGIS Users Discussion
Steve,


> Does the Tiger geocoder have a set of test address that could be used for
comparison?
As far as test address we use, in the PostGIS code base trunk, there is a:

extras\tiger_geocoder\tiger_2011\regress folder that has some normalizer,
geocode,reverse_geocode tests we run. These are mostly ones from tickets
that people complained about that we fixed.



Loading data:
> 1. prep and load data for a reference set
PostGIS tiger geocoder does this for tiger set the 2.1 can handle Tiger
2011/2012 load as well via a table driven pseudo rule approach
, doing the wget download, unzip, and the shp2pgsql load.

It also abstracts away Linux vs. Windows all in sql. Basically it uses SQL
to generate a Linux/Unix sh script or windows batch script.

Does yours currently do that? Just wondering. When we took on maintenance
of tiger geocoder, that was the main stickler
Was that windows users were in essence forgotten so it was no good for
windows users, and since our main client supporting our work
Was on windows, this was REALLY important to fix to make it easy for windows
users to load data without much thinking,
But we also needed to make sure the process was easy for Linux users.



> 2. standardize the reference data
I think George Silva was interested in this too. He and I think David
Bitner were working on similar.
I am a bit bothered that postgis tiger is so tied to TIGER structure and
hope to change that, but speed and normalizing are highest priority at
moment.



> Geocoding a request:
> 1. parse request and standardize it
This I think is one area where we can probably share a good chunk of effort.
I think it's the hardest and consumes most of our time.
What we have for normalizing is not modular and as Leo would say is a huge
mess and eyesore he wants badly to modularize.

When we took it over, I discovered that a good chunk of the time was being
wasted normalizing. As much as 200 ms and someitmes a minute. This for most
cases
I think we reduced to 4-20 ms by getting rid of some of the regex checks in
the WHERE conditions so that normalizing could use indexes.

I would be interested in seeing what the performance is like if we can swap
out that with you C extension and maybe we can package that along with the
postgis geocoder.



> 2. generate a query plan to match the standardized request to the
standardized reference set, this might include fuzzy searching

I think the inheritance model we have and building an adhoc query works
fairly well. I haven't loaded the whole US in a while since I'm usually only
interested in one state like MA or PA, but I suspect it scales well.
But for 1 - 8 (with CA, TX,AZ loaded) states even on a low end windows
desktop we were getting around 30-100ms speeds warm and 200ms cold.
Cloud offerings which we usually run on have varying disk speeds so hard to
benchmark.

I suspect since we do a state hop first, performance should be even better
in 9.2 and 9.3 since improvements to inheritance planning have been made in
those.



> scoring results
Tiger geocoder currently does a hard-coded scoring. I forget the order of
preference. I think one thing to consider is to have a flexible scoring
system.
This is something several people have asked for. E.g. if I know I'm dealing
with LA data penalize anything not in a specific region.




> Most of my effort was in extracting the standardizer code and creating a
library. This is still pretty rough code as it was more of a prototype to
see if I could do it.
Yap we'd love to see this piece and see if we can reuse it.

> This prototype was able to geocode a table of about 216K addresses in St
Paul, MN on average in 44 ms each. Obviously there was a lot of caching
happening because all the queries were focused in one fairly small area, but
the Tiger data is represented in 50+ Million records and cold cache hits are
typically 100-200 ms. This varies based on doing fuzzy searches or not. This
is on a 3-4 year old server with 4GB of memory and SATA1 disks in software
RAID, running Postgresql 8.3.

Again would love to see benchmarks. We haven't been benchmarking much on
Linux since we service mostly windows PostGIS users. Though from the short
tests we've done I think Linux is generally a bit faster (but not by much)
with the same number of cores again, but we weren't able to do an apple and
apple test on exact server config, disk etc to say one way or other.

I think the timings you are getting sound about similar to ours, but given
that we haven't loaded the whole tiger set and that we often run on cloud
servers which generally have worse speed than physical discs, can't be sure.



> So my thoughts related to the Tiger Geocoder is that some of this might be
able to be used there. Also, there is nothing that is Tiger specific other
than the initial loading of the data and preprocessing that might need to be
done. The standardizer is table driven and it loads a lexicon, gazeteer, and
the address parsing rules from tables. So if the data changes or the locale
changes you only need to change these tables in theory. In reality, more
work needs to be done on the address parser so that it will support UTF8 for
internationalization but I have ideas for that.

Definitely and again we would like to focus on building something that is
more internationally useful. We like the basic philosophy of the postgis
tiger geocoder
In that it really only requires a machine running PostGIS (no external
dependencies like PHP etc. needed) and that you can do geocoding with it
without leaving the database.


Thanks,
Regina and Leo
http://www.postgis.us

Stephen Woodbridge

unread,
Dec 9, 2012, 9:23:57 PM12/9/12
to postgi...@lists.osgeo.org
On 12/9/2012 1:59 PM, Paragon Corporation wrote:
> Steve,
>
>
>> Does the Tiger geocoder have a set of test address that could be used for
> comparison?
> As far as test address we use, in the PostGIS code base trunk, there is a:
>
> extras\tiger_geocoder\tiger_2011\regress folder that has some normalizer,
> geocode,reverse_geocode tests we run. These are mostly ones from tickets
> that people complained about that we fixed.

If you want a large set of geocoded addresses like the one I ran, I can
check if we can check it into the project and use it for testing.

In the past I have just used a large set of addresses that I geocode and
save, then I repeat that after code changes can compare the results
including the distances from the previous run. This can be used as a
sanity check for performance and for changes from a baseline. Obviously
you have to account for the fact the reference set changing over time,
but spot checking difference is usually adequate to determine if
something is really wrong.

> Loading data:
>> 1. prep and load data for a reference set
> PostGIS tiger geocoder does this for tiger set the 2.1 can handle Tiger
> 2011/2012 load as well via a table driven pseudo rule approach
> , doing the wget download, unzip, and the shp2pgsql load.
>
> It also abstracts away Linux vs. Windows all in sql. Basically it uses SQL
> to generate a Linux/Unix sh script or windows batch script.
>
> Does yours currently do that? Just wondering.

No, I have Perl and C code that I use to process the Tiger shapefile
into a standard table structure. I can then you that standard table
structure for loading PAGC, my reverse geocoder or this new geocoder. I
have also written another variant on this that will map Navteq data into
this same table structure. The C code acutally uses a SQLite database
and a lot of SQL during the process. So it read the raw vendor data
files and creates new shapefiles using my standard table structure that
looks like this:

Filename: /u/srcdata/tiger2012-pagc/01/001/Streets.dbf
Version: 0x03 (ver. 3)
Num of records: 11837
Header length: 737
Record length: 547
Last change: 1995/7/26
Num fields: 22
Field info:
Num Name Type Len Decimal
1. TLID N 10 0 -- Tiger TLID
2. REFADDR N 10 0
3. NREFADDR N 10 0
4. SQLNUMF C 12 0 -- house num format string
5. SQLFMT C 16 0 -- house num format string
6. CFORMAT C 16 0 -- house num format string
7. NAME C 80 0
8. PREDIRABRV C 15 0
9. PRETYPABRV C 50 0
10. PREQUALABR C 15 0
11. SUFDIRABRV C 15 0
12. SUFTYPABRV C 50 0
13. SUFQUALABR C 15 0
14. SIDE N 1 0 -- 1=right, 2=left
15. TFID N 10 0 -- Tiger TFID
16. USPS C 35 0 -- USPS preferred city name
17. AC5 C 35 0 -- place || cousub
18. AC4 C 35 0 -- cousub
19. AC3 C 35 0 -- county
20. AC2 C 35 0 -- state
21. AC1 C 35 0 -- country
22. POSTCODE C 11 0 -- Tiger Zipcode


> When we took on maintenance
> of tiger geocoder, that was the main stickler
> Was that windows users were in essence forgotten so it was no good for
> windows users, and since our main client supporting our work
> Was on windows, this was REALLY important to fix to make it easy for windows
> users to load data without much thinking,
> But we also needed to make sure the process was easy for Linux users.

Right, this makes perfect sense. For most of my clients I load and prep
the data and create a dump for them and that is trivial to load on
Windows or Linux.

>> 2. standardize the reference data
> I think George Silva was interested in this too. He and I think David
> Bitner were working on similar.
> I am a bit bothered that postgis tiger is so tied to TIGER structure and
> hope to change that, but speed and normalizing are highest priority at
> moment.

One of the things that I did on this most recent geocoder was to load my
shapefiles into a table. Then standardize them into a separate table
link by the gid field. This table tends to be smaller and does not
include the geometry column. This separate standardized table is where
all the queries are done so it has the structure of the standardizer
output and not of the source data. You only access the source data when
you score the results and compute location.

>> Geocoding a request:
>> 1. parse request and standardize it
> This I think is one area where we can probably share a good chunk of effort.
> I think it's the hardest and consumes most of our time.
> What we have for normalizing is not modular and as Leo would say is a huge
> mess and eyesore he wants badly to modularize.
>
> When we took it over, I discovered that a good chunk of the time was being
> wasted normalizing. As much as 200 ms and someitmes a minute. This for most
> cases
> I think we reduced to 4-20 ms by getting rid of some of the regex checks in
> the WHERE conditions so that normalizing could use indexes.
>
> I would be interested in seeing what the performance is like if we can swap
> out that with you C extension and maybe we can package that along with the
> postgis geocoder.

This was my epiphany, which lead to building the standardizer API using
the PAGC code. This code has some issues but I can probably get is
checked into svn so you can play it. It currently has the following
signature:

select * from standardize_address(
'select seq, word::text, stdword::text, token from lex order by id',
'select seq, word::text, stdword::text, token from gaz order by id',
'select * from rules order by id',
'select 0::int4 as id,
''116A commonwealth ave''::text as micro,
''west concord, ma 01742''::text as macro');

The first three arguments are SQL queries used to load the lexicon,
gazeteer and parser rules. The 4th argument is a query to retrieve the
the addresses you want to standardize in three fields of id, micro,
marco. We do not want to use the preparsed fields from Tiger because we
want the standardizer code to decide how to break the fields into
components. So even if it parses them incorrectly, we want it parse
consistently on both the reference set and the geocode request.

One problem I have with this wrapper is it uses up all the memory if I
try to standardize 50M records. It is not a leak, I just have not
figured out how to handle something like a cursor over successive calls
in a SRF or how to properly chunk the rows. So I currently call it via
Perl and standardize 1M rows at a wack selecting them with gid between A
and B.

Also, there might be room for improvement if we could compile the lex,
gaz and rules into a blob and retrieve that, rather than reconstructing
them each call.

>> 2. generate a query plan to match the standardized request to the
> standardized reference set, this might include fuzzy searching
>
> I think the inheritance model we have and building an adhoc query works
> fairly well. I haven't loaded the whole US in a while since I'm usually only
> interested in one state like MA or PA, but I suspect it scales well.
> But for 1 - 8 (with CA, TX,AZ loaded) states even on a low end windows
> desktop we were getting around 30-100ms speeds warm and 200ms cold.
> Cloud offerings which we usually run on have varying disk speeds so hard to
> benchmark.
>
> I suspect since we do a state hop first, performance should be even better
> in 9.2 and 9.3 since improvements to inheritance planning have been made in
> those.

Yeah, my queries and table structure do not have this optimiaztion build
into them and it would probably help for some queries.

>> scoring results
> Tiger geocoder currently does a hard-coded scoring. I forget the order of
> preference. I think one thing to consider is to have a flexible scoring
> system.
> This is something several people have asked for. E.g. if I know I'm dealing
> with LA data penalize anything not in a specific region.

I have thought about creating a metadata table for my geocoder that
would be like a config file. It would be trivial to store key-value
pairs that could be used to control how scoring is done.

>> Most of my effort was in extracting the standardizer code and creating a
> library. This is still pretty rough code as it was more of a prototype to
> see if I could do it.
> Yap we'd love to see this piece and see if we can reuse it.

Ok, I will see about getting this code checked in, include a simple test
case to demonstrate how to use it. Next is going to get busy for me so
that my slow things down a little. PAGC does compile under Windows if
I'm not mistaken, but I only have tools for Linux.

The standardizer code the I built into a library is here:
http://pagc.svn.sourceforge.net/viewvc/pagc/branches/sew-refactor/pagclib/api/

This everything you need to start on linux.
Start by looking at test_main.c this implements a simple CLI for
testing. This was made to mirror a similar command pagc_stand so that I
could verify that my extracted library work identically to the the
original PAGC code. The postgresql binding is a little cleaner and
easier to understand. I will get that checked in.

>> This prototype was able to geocode a table of about 216K addresses in St
> Paul, MN on average in 44 ms each. Obviously there was a lot of caching
> happening because all the queries were focused in one fairly small area, but
> the Tiger data is represented in 50+ Million records and cold cache hits are
> typically 100-200 ms. This varies based on doing fuzzy searches or not. This
> is on a 3-4 year old server with 4GB of memory and SATA1 disks in software
> RAID, running Postgresql 8.3.
>
> Again would love to see benchmarks. We haven't been benchmarking much on
> Linux since we service mostly windows PostGIS users. Though from the short
> tests we've done I think Linux is generally a bit faster (but not by much)
> with the same number of cores again, but we weren't able to do an apple and
> apple test on exact server config, disk etc to say one way or other.
>
> I think the timings you are getting sound about similar to ours, but given
> that we haven't loaded the whole tiger set and that we often run on cloud
> servers which generally have worse speed than physical discs, can't be sure.

Yeah getting apples to apples comparisons can be tuff. At some point if
we need to do it maybe someone can setup a couple of VMs one with each
OS, but I would assume that they should be reasonably comparable
assuming the OS's and postgresql are optimized appropriately.

I think the real value of having performance tests to to track changes
from change to change and release to release.

>> So my thoughts related to the Tiger Geocoder is that some of this might be
> able to be used there. Also, there is nothing that is Tiger specific other
> than the initial loading of the data and preprocessing that might need to be
> done. The standardizer is table driven and it loads a lexicon, gazeteer, and
> the address parsing rules from tables. So if the data changes or the locale
> changes you only need to change these tables in theory. In reality, more
> work needs to be done on the address parser so that it will support UTF8 for
> internationalization but I have ideas for that.
>
> Definitely and again we would like to focus on building something that is
> more internationally useful. We like the basic philosophy of the postgis
> tiger geocoder

I would love to get some funding so we could get the author of PAGC to
extend the standardizer along international needs or do that our selves
I already have a lot of data on what is needed for most of the countries
in Europe that could be the starting point for doing some work along
these lines.

> In that it really only requires a machine running PostGIS (no external
> dependencies like PHP etc. needed) and that you can do geocoding with it
> without leaving the database.

This is what I like also. It is trivial to access the geocoder via PHP,
.NET, Perl, Java, etc if you can connect to the database and issue a
simple query.

Thanks,
-Steve

Stephen Woodbridge

unread,
Dec 9, 2012, 10:11:05 PM12/9/12
to PostGIS Users Discussion
Regina,

I just checked in the wrapper code and some README files to get you
started. the wrappers are in .../api/psql/ folder. See the link to svn
below.

Thanks,
-Steve

White, Richard

unread,
Dec 10, 2012, 4:01:45 AM12/10/12
to PostGIS Users Discussion
Hi,

I am having a problem Installing PostGIS 1.5.8 or 2.0.2 on RHEL 6.1 by
compiling from source.

Compiling with the configuration "./configure
--with-pgsql=/usr/local/pgsql/bin/pg_config" works fine.

The problem is the Post Compile PostGIS Configuration.

"/usr/local/pgsql/bin/createlang plpgsql test" This works ok.

"/usr/local/pgsql/bin/psql -d test -f
/usr/local/pgsql/share/postgresql/contrib/postgis.sql" This fails with
the following error message;

"psql:/usr/share/pgsql/contrib/postgis-2.0/postgis.sql:3368: ERROR:
current transaction is aborted, commands ignored until end of
transaction block"

The above has worked on Ubuntu and SUSE in the past.

Any ideas or advice.

Thanks.

R.

Infoterra Ltd. Is part of the Astrium GEO-Information Services Division and a wholly owned subsidiary of Astrium, Europe's leading space systems and services specialist.

Disclaimer. The information contained in this e-mail and its attachments are confidential and intended only for the use of the named addressee(s). If you are not the intended addressee, please do not read, copy, use or disclose this message or its attachments. If you have received this message in error, please notify the sender immediately and delete or destroy all copies of this message and attachments in all media. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Infoterra Ltd and shall not form part of any binding agreement.

Infoterra Limited a company registered in England under number 2359955 and having its registered office at Atlas House, 41 Wembley Road, Leicester, LE3 1UT. VAT number GB 476 0468 27.

P Before printing, think about the environment

Devrim GÜNDÜZ

unread,
Dec 10, 2012, 4:19:43 AM12/10/12
to PostGIS Users Discussion

Hi,

On Mon, 2012-12-10 at 09:01 +0000, White, Richard wrote:
> I am having a problem Installing PostGIS 1.5.8 or 2.0.2 on RHEL 6.1 by
> compiling from source.

FWIW, PostGIS 2 is available in RPM format, along with PostgreSQL.
Anyway:

> "/usr/local/pgsql/bin/psql -d test -f
> /usr/local/pgsql/share/postgresql/contrib/postgis.sql" This fails with
> the following error message;
>
> "psql:/usr/share/pgsql/contrib/postgis-2.0/postgis.sql:3368: ERROR:
> current transaction is aborted, commands ignored until end of
> transaction block"

There should be a message before this one. Could you please send us
that?

Regards,
--
Devrim GÜNDÜZ
Principal Systems Engineer @ EnterpriseDB: http://www.enterprisedb.com
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org Twitter: http://twitter.com/devrimgunduz
signature.asc

Tim-Hinnerk Heuer

unread,
Dec 10, 2012, 4:25:22 AM12/10/12
to PostGIS Users Discussion
You could try http://postgis.refractions.net/download/postgis-2.0.3SVN.tar.gz or http://postgis.refractions.net/download/postgis-2.1.0SVN.tar.gz .
The first one worked for my while I had problems with the "stable" release.

Cheers,
Tim

Tim-Hinnerk Heuer

Twitter: @geekdenz



Reply all
Reply to author
Forward
0 new messages