[postgis-users] Tigerdata for AZ, AS and VI

1 view
Skip to first unread message

Ravi ada

unread,
Dec 12, 2011, 9:18:22 AM12/12/11
to postgi...@postgis.refractions.net

Hello All,

Has anyone experienced loading tigerdata into postgis database for Arizona, American Samoa and Virgin Islands. I getting “*addr.dbf” cannot find errors. All the other states are loaded fine. I tried to download the shape files again thinking that they might have been corrupted during the transmission, but even after that I am getting the same error.

 

Any ideas?

 

Thanks

Ravi Ada

 

Stephen Woodbridge

unread,
Dec 12, 2011, 9:50:29 AM12/12/11
to postgi...@postgis.refractions.net

My download of Tiger has all the *addr* files for Arizona and I believe
I have accessed them all without a problem.


In general, the *addr* files are optional, and there are none for Guam,

American Samoa and Virgin Islands.

Typically if the county or county equivalent does not have roads with
address ranges in it, then it will not have any *addr* files. So it is
possible that a county in Arizona in say the desert might not have any
address ranges and therefore not have that file, but looking at the list
of counties in Arizona it looks like they all have those files.

-Steve W
_______________________________________________
postgis-users mailing list
postgi...@postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users

Dan Putler

unread,
Dec 12, 2011, 1:18:42 PM12/12/11
to postgi...@postgis.refractions.net
To complete Steve's list, the Northern Mariana Islands (abbreviation MP,
FIPS 69) also do not have "addr" files.

Dan

Ravi ada

unread,
Dec 12, 2011, 8:04:38 PM12/12/11
to PostGIS Users Discussion
Thank you Steve. I downloaded AZ files again and loaded fine but others are
still the same problem. According to your explanation that should be ok.

I got the postgis database loaded for all states now. I have about 3 mil
addresses, may not all be normalized, which I am trying to batch geocode
them. I am using the example mentioned in this link.
http://www.postgresonline.com/journal/archives/181-pgscript_intro.html

I am even using 100 as a batch, my update query is too slow. Its updating at
1500 per hour. That's too slow, I will never be able to finish them.

I have 16GB RAM, and 7200 rpm disk partitioned to hold the postgres table
spaces. I am not sure what makes it run faster. Anybody has done so many
addresses before? What makes the performance go faster? I am attaching the
query and query plan here. www.pastie.org/3008194

Any help is appreciated.

Thanks
Ravi Ada

Ravi ada

unread,
Dec 12, 2011, 9:50:40 PM12/12/11
to PostGIS Users Discussion
In these examples, they used only 2GB memory and 3GHz machine but still
achieved a blazing fast results. The same queries mentioned in the link
taking 10 and even 100 times more time to query a particular address. I am
using a 16GB, 6 Core AMD machine, dedicated to this process. I did the
tuning on postgresql config file based on the recommendations. I am
attaching my file here.. Please let me know if the tuning parameters look
good.
http://postgis.refractions.net/documentation/manual-svn/Geocode.html

This query is supposed to take only (61ms) but on my machine is was (734ms).
SELECT g.rating, ST_X(g.geomout) As lon, ST_Y(g.geomout) As lat,
(addy).address As stno, (addy).streetname As street,
(addy).streettypeabbrev As styp, (addy).location As city,
(addy).stateabbrev As st,(addy).zip
FROM geocode('75 State Street, Boston MA 02109') As g;

postgresql.conf

Stephen Woodbridge

unread,
Dec 12, 2011, 10:04:19 PM12/12/11
to postgi...@postgis.refractions.net
Hi Ravi,

I do not have this setup on my machine, but I am willing to hazard a
guess that you are missing an index, but then I have no idea which
one(s) that might be. Leo and Regina are probably the experts in this,
so I would look over their past posts on the geocoder. You might also
look at the load and prep scripts in svn and see if there is an index
there that you do not have on your tables.

Regards,
-Steve

Ravi ada

unread,
Dec 12, 2011, 11:25:26 PM12/12/11
to PostGIS Users Discussion
Thanks Steve, That's what I thought too, I ran the 'install_missing_indexes"
function, it ran for a few minutes and returned 't'. I am assuming it ran
successfully. The performance is still same. I increased the work_mem to 4GB
in postgresql.conf. It is still not acceptable.

Leo/Regina, anything specific that you want me to verify on my system?
Performance is terrible, I can never finish geocoding 3million addresses
with this performance.

Any help is highly appreciated.

Greg Williamson

unread,
Dec 13, 2011, 1:37:16 PM12/13/11
to PostGIS Users Discussion
Ravi --

Could you run this with "EXPLAIN ANALYZE ..." and post the results; that might give something of a clue as to what issues the planner is encountering.

Greg W.

----- Original Message -----
> From: Ravi ada <rav...@dobeyond.com>
> To: 'PostGIS Users Discussion' <postgi...@postgis.refractions.net>
> Cc:
> Sent: Monday, December 12, 2011 8:25 PM
> Subject: Re: [postgis-users] Tigerdata for AZ, AS and VI
>

>T hanks Steve, That's what I thought too, I ran the

Andy Colson

unread,
Dec 13, 2011, 3:31:34 PM12/13/11
to Ravi ada, PostGIS Users Discussion
And instead of running the update, try running:

explain analyze
SELECT ag.id,
(geocode(ag.address1||','||ag.city||','||ag.state||','||ag.zip)) As geo
FROM qliq.geo_biz_addr As ag
WHERE ag.rating IS NULL
ORDER BY zip
LIMIT 100


Also, the order by zip, combined with the limit, means it has to pull
every record, then sort by zip, then pull the first 100. If you can
drop one or the other it would run faster.

-Andy

Ravi Ada

unread,
Dec 13, 2011, 6:28:39 PM12/13/11
to Andy Colson, PostGIS Users Discussion
Andy,
Here is the explain analyze output.
"Limit  (cost=0.00..14.10 rows=100 width=73) (actual time=4824.392..98929.180
rows=100 loops=1)"
"  ->  Index Scan using geo_biz_addr_zip_idx on geo_biz_addr ag
(cost=0.00..219048.99 rows=1553779 width=73) (actual time=4824.381..98925.304
rows=100 loops=1)"
"        Filter: (rating IS NULL)"
"Total runtime: 98930.371 ms"


Here is the output for the query without ORDER BY zip.
"Limit  (cost=0.00..7.06 rows=100 width=73) (actual time=63022.583..279475.286
rows=100 loops=1)"
"  ->  Seq Scan on geo_biz_addr ag  (cost=0.00..109741.62 rows=1553779
width=73) (actual time=63022.571..279474.529 rows=100 loops=1)"
"        Filter: (rating IS NULL)"
"Total runtime: 279475.678 ms"

Surprisingly it took longer without the where clause, that may be because the addresses are scattered around all the states or cities. but in any case, 100 to 300 secs to geocode 100 addresses is too long. I got the work_mem to set to 4GB in postgresql.conf.


Thanks
Ravi Ada

On Tue, 13 Dec 2011 14:31:34 -0600, Andy Colson wrote
Thanks,
Ravi Ada
918-630-7381

Andy Colson

unread,
Dec 13, 2011, 7:41:30 PM12/13/11
to rav...@dobeyond.com, PostGIS Users Discussion
Oops, my bad, the index on zip made it work well.

The explain analyze is kinda useless. You're probably not gonna like my next request.

How would you feel about pulling one address out of geo_biz_addr and then copy/pasting each select from the function geocode() and running it by hand?

Said another way, the function might look like:

create function geocode(...) as $$
begin
select a from ... where;
select b from somethingElse ...;
end; $$

Get an address from geo_biz_addr, then run:

explain analyze select a from ... where;

Then:
explain analyze select b from somethingElse ...;

You'll have to substitute the arguments by hand.

OH.. Or, there is a plugin to auto-explain queries. I have no idea if it'll do each select inside a function separately or not.

Where would one find this geocode function? Is it part of postgis? Or a different package you downloaded?

-Andy

> > >>> attaching the query and query plan here. www.pastie.org <http://www.pastie.org/>/3008194

_______________________________________________

Greg Williamson

unread,
Dec 13, 2011, 8:33:55 PM12/13/11
to rav...@dobeyond.com, Andy Colson, PostGIS Users Discussion
Have you run "analyze" recently on this table ? (since the last index build or the last major change in data)

The work_mem setting is fairly meaningless for this -- it applies when building indexes and the like; sort_mem controls how much RAM the system will try to use before it starts using disk; you might try tinkering with that unless it is already large (but remember that each sort in a query uses this much RAM so too aggressive a setting is bad).

HTH,

Greg W.


>________________________________
> From: Ravi Ada <rav...@dobeyond.com>
>To: Andy Colson <an...@squeakycode.net>; PostGIS Users Discussion <postgi...@postgis.refractions.net>
>Sent: Tuesday, December 13, 2011 3:28 PM

time=4824.392..98929.180
>rows=100

time=4824.381..98925.304
>rows=100

loops=1)"
>"        Filter: (rating IS

NULL)"
>"Total runtime: 98930.371

time=63022.583..279475.286
>rows=100

loops=1)"
>"        Filter: (rating IS

NULL)"
>"Total runtime: 279475.678

Stephen Woodbridge

unread,
Dec 13, 2011, 9:25:54 PM12/13/11
to postgi...@postgis.refractions.net
On 12/13/2011 8:33 PM, Greg Williamson wrote:
> Have you run "analyze" recently on this table ? (since the last index
> build or the last major change in data)
>
> The work_mem setting is fairly meaningless for this -- it applies
> when building indexes and the like; sort_mem controls how much RAM
> the system will try to use before it starts using disk; you might try
> tinkering with that unless it is already large (but remember that
> each sort in a query uses this much RAM so too aggressive a setting
> is bad).

What is:
shared_buffers = ...
set to? this is the one that needs to be set. You should google:
"postgresql tuning" and read how to set the postgresql.conf file. You
may also need to change your kernel parameters as you increase the
shared_buffers.

setting work_mem to a large value will not help much. You need to read
what each of the parameters do and then set them appropriately.

-Steve

_______________________________________________

Ravi Ada

unread,
Dec 13, 2011, 11:02:49 PM12/13/11
to PostGIS Users Discussion
Thanks Steve.

Here are the values from the postgresql.conf

max_connections = 140
shared_buffers = 2GB
temp_buffers = 8MB
work_mem = 4GB
maintenance_work_mem = 1GB
wal_buffers = 8MB
checkpoint_segments = 128
effective_cache_size = 6GB
cpu_tuple_cost = 0.0030
cpu_index_tuple_cost = 0.0010
cpu_operator_cost = 0.0005
fsync = off
checkpoint_timeout = 1h

I just don't understand why the geocode function takes so long to return the
coordinates. I am sure some of you on this list might have done the batch
geocoding millions of addresses. I may be missing just a simple configuration
which might make a whole lot of difference in the speed. I don't know what it
is. I am following the examples exactly from this link
(http://postgis.refractions.net/documentation/manual-svn/Geocode.html)

If someone is familiar with the problem willing to help me using GoTo Meeting
connection to my machine, I can arrange that too. I just have to move along
with my project and meet the deadlines. I am already delayed, everybody in my
team asking me for this everyday.


Thanks,
Ravi Ada

On Tue, 13 Dec 2011 21:25:54 -0500, Stephen Woodbridge wrote


Thanks,
Ravi Ada
918-630-7381

Stephen Woodbridge

unread,
Dec 13, 2011, 11:54:38 PM12/13/11
to rav...@dobeyond.com, PostGIS Users Discussion
You have 16 GB of mem so set your shared_buffer=12GB
This will probably fail to start because your kernel may not have SHMMAX
set high enough, so you will need to set it higher:

You can see the current settings with:
ipcs -l

You can change it by editing:
/etc/sysctl.conf

then reload the new paramters with:
sysctl -p

and check it again with:
ipcs -l

and finally restart postgres
/etc/init.d/postgresql-<version> restart

and then try it again. You might need to process a few queries for it to
load up the page cache after you restart before you start seeing any
performance increases.

Also you should sort you addresses by zipcode before you geocode them so
that the data you need is in the page cache from the last record you
geocoded.

-Steve

Michael Papet

unread,
Dec 14, 2011, 1:00:12 AM12/14/11
to postgi...@postgis.refractions.net

> Message: 5
> Date: Mon, 12 Dec 2011 19:04:38 -0600
> From: "Ravi ada" <rav...@dobeyond.com>

> Subject: Re: [postgis-users] Tigerdata for AZ, AS and VI
> To: "'PostGIS Users Discussion'"
>     <postgi...@postgis.refractions.net>
> Message-ID:
> <009401ccb933$31d5fd40$9581f7c0$@dobeyond.com>
> Content-Type: text/plain;   
> charset="us-ascii"

>
> Thank you Steve. I downloaded AZ files again and loaded
> fine but others are
> still the same problem. According to your explanation that
> should be ok.
>
> I got the postgis database loaded for all states now. I
> have about 3 mil
> addresses, may not all be normalized, which I am trying to
> batch geocode
> them. I am using the example mentioned in this link.
> http://www.postgresonline.com/journal/archives/181-pgscript_intro.html
>
> I am even using 100 as a batch, my update query is too
> slow. Its updating at
> 1500 per hour. That's too slow, I will never be able to
> finish them.
>
> I have 16GB RAM, and 7200 rpm disk partitioned to hold the
> postgres table
> spaces. I am not sure what makes it run faster. Anybody has
> done so many
> addresses before? What makes the performance go faster? I
> am attaching the
> query and query plan here. www.pastie.org/3008194
>
> Any help is appreciated.
>
> Thanks
> Ravi Ada
>
Is the computer a desktop configuration or a real server?
Is the database tables on their own disk?
What's the swap config? What's memory consumption like?
Debian has an application to make modifying the postgresql config a whole lot easier. Maybe your distro does too?
I use perl to do the geocoding queries and write results to file. I use another script to do the inserts. I found this faster than doing it all in PostgreSQL's scripting. My employer's dual single-core DL380 geocodes about 15000 addresses an hour.
This may be a dumb question, but have the indexes been rebuilt on the tables recently?

I hope that helps.
mpapet

Paragon Corporation

unread,
Dec 14, 2011, 4:18:27 AM12/14/11
to PostGIS Users Discussion

> I just don't understand why the geocode function takes so
> long to return the coordinates. I am sure some of you on this
> list might have done the batch geocoding millions of
> addresses. I may be missing just a simple configuration which
> might make a whole lot of difference in the speed. I don't
> know what it is. I am following the examples exactly from this link
> (http://postgis.refractions.net/documentation/manual-svn/Geocode.html)
>
> If someone is familiar with the problem willing to help me
> using GoTo Meeting connection to my machine, I can arrange
> that too. I just have to move along with my project and meet
> the deadlines. I am already delayed, everybody in my team
> asking me for this everyday.
>
>
> Thanks,
> Ravi Ada
>
>
Ravi,

Sorry been busy with raster stuff so haven't been tuned into this
discussion.

1) The indexes the loader generates are not the only ones needed. Initially
I was constantly changing the loader script,
but since we were changing decisions as we changed code and optimal indexes
needed with aeach change required changing indexes, which indexes would be
best, I created a function that would put them in
rather than bothering with the loader (since a lot of people would already
have their data loaded)

Have you tried running that. I suspect you are just missing indexes as the
timings you are getting are what I used to get earlier on.

If you haven't run the update script (which runs this routine anyway) or run
this to get generated script for indexes you are missing
you should.

http://www.postgis.org/documentation/manual-svn/Missing_Indexes_Generate_Scr
ipt.html

2) There are a couple of other things to note: First address you do around
an area can take a lot more time because of the data caching effects
in postgresql. So for the example in the docs you describe.

I can do a geocode of 75 State Street,Boston, MA -- and if I haven't done
any geocoding in a while that takes like 1-3 seconds

Then if I do 80 State Street, Boston, MA -- that subsequent takes anywhere
from 60 ms - 150 ms.
I also don't have all the states loaded since I only needed it for about 6
states. thought that should just increase the planner time rather than
later times.

3) For debugging performance there is a variable in the geocode_address
function called var_debug. Its false by default, change it to true.
That spits out the sql being run and is a better sql to pass to the planner
to check.

We were hoping to make these debugging features more publically exposed e.g
via a config table, but haven't had the time to do that.

Hope this all helps,
Regina
http://www.postgis.us

Ravi Ada

unread,
Dec 14, 2011, 9:06:44 AM12/14/11
to PostGIS Users Discussion
Regina,

Thanks so much for the reply. I ran the missing_indexes_generate_script(),
actually it did not return anything, I am assuming all the indexes are in
place. That may be because I ran install_missing_indexes() earlier. I changed
the debug flag in geocode_address and it produced a very long query that it
runs to geocode the address. I tried to cut and paste the query to run the
plan, I am getting errors, I will figure that out.

My question is, do we use gecode or geocode_address for faster querying? I
noticed that geocode_address takes the normalized address where as geocode
takes address as string parameter. By adding additional normalize_address
function when doing the geocode_address akes it run any faster?


Thanks
Ravi Ada


On Wed, 14 Dec 2011 04:18:27 -0500, Paragon Corporation wrote


Thanks,
Ravi Ada
918-630-7381

Stephen Woodbridge

unread,
Dec 14, 2011, 9:14:25 AM12/14/11
to postgi...@postgis.refractions.net
Ravi,

The process for geocoding follows this:

Load the data:
1. get a reference set of streets (ie: the Tiger data)
2. normalize the names as you load the data
3. build the indexes you need for the queries

Query for an address:
1. normalize the address on input
2. query the normalized reference

Ok, so you know most of this because you have already done it, but the
important part here is the you normalize BOTH the reference data and the
input to a query. This resolves things like:

main street != main st

because the normalize parses the addresses and converts them into a
normalized standard form so that you can match. Yes it takes time to
normalize the request, but if you don't normalize it, then there is a
good change that you will not match an appropriate street in the
reference set.

-Steve

Ravi ada

unread,
Dec 14, 2011, 10:12:32 AM12/14/11
to PostGIS Users Discussion
Thanks Steve.
I did not get when you said '2. normalize the names as you load the data',
is this the step we need to do manually or load scripts automatically do
this step?

I am normalizing my addresses before querying the reference data (tiger
data). Please clarify.

Thanks
Ravi Ada

-----Original Message-----
From: postgis-us...@postgis.refractions.net
[mailto:postgis-us...@postgis.refractions.net] On Behalf Of Stephen
Woodbridge

Sent: Wednesday, December 14, 2011 8:14 AM
To: postgi...@postgis.refractions.net
Subject: Re: [postgis-users] Tigerdata for AZ, AS and VI

Stephen Woodbridge

unread,
Dec 14, 2011, 2:30:22 PM12/14/11
to postgi...@postgis.refractions.net
On 12/14/2011 10:12 AM, Ravi ada wrote:
> Thanks Steve.
> I did not get when you said '2. normalize the names as you load the data',
> is this the step we need to do manually or load scripts automatically do
> this step?

the load scripts do this automatically.

> I am normalizing my addresses before querying the reference data (tiger
> data). Please clarify.

You should use geocode() not geocode_address(), because geocode() will
normalize the address in the same way the the reference addresses are
normalized. If you split the address into fields and call
geocode_address() this is NOT the same as normalizing the address. If
you want to be able to get good matches, you have to use the same
normalize function for both the reference and the input addresses.

These functions are not about performance they are about using the tool
the correct way.

-Steve

Ravi ada

unread,
Dec 15, 2011, 12:34:53 PM12/15/11
to PostGIS Users Discussion
Thank you. I am able to speed up a bit by launching the query for each state
and doing 6 states at a time. I am able to see all processors are being used
with a 80-85% memory utilization. However I noticed that geocode function is
taking forever to return for some addresses that are like '100 29TH EAST
ST.' or 'I-35 HIGHWAY'. Some states have the convention of using numbers for
the street names, it takes forever to geocode these addresses. Numbered
streets and Highway service roads are the one taking the most time. I wonder
why? Any ideas what we need to speed these up?

Thanks,

Paragon Corporation

unread,
Dec 15, 2011, 8:52:00 PM12/15/11
to PostGIS Users Discussion
Ravi,

I thought I optimized those, but I may have missed something. The best way
to see what is going on is to use

normalize_address to see how its reading the values.

Anyrate -- please put in a bug ticket for this with some examples you are
trying and how long its taking.

http://trac.osgeo.org/postgis/newticket

Make sure to set the component to "tiger geocoder" in the ticket so it gets
assigned to me.

Thanks,
Regina
http://www.postgis.us


> -----Original Message-----
> From: postgis-us...@postgis.refractions.net
> [mailto:postgis-us...@postgis.refractions.net] On

> Behalf Of Ravi ada
> Sent: Thursday, December 15, 2011 12:35 PM
> To: 'PostGIS Users Discussion'

Reply all
Reply to author
Forward
0 new messages