Hi
PostGISers,
sorry
this is a long one ;)
I am
trying to use the ST_DWithin and ST_Distance functions to get a list of
nearest neighboring roads in relation to a point location. The dataset is big
its the tiger shape files which I loaded into PostGIS and clustered on the
gist_index, SRID is 9102003 (or EPSG code 102003) an Albers-Equal-Area
projection which has units in meters. I researched
earlier posts and it seemed that several recommendations pointed out that the
use of ST_DWithin and ST_Distance would do exactly
what I wanted to do.
I
created an spatial index on the roads layer and afterwards ran vacuum
analyze on the roads layer.
CREATE
INDEX roads_gidx ON roads USING gist (the_geom); ALTER TABLE roads CLUSTER ON
roads_gidx;
and a
second index on the transformed geom:
CREATE
INDEX roads_tranform_9102003_gidx ON roads USING gist (st_transform(the_geom,
9102003)) WHERE the_geom IS NOT NULL;
My "dream" SQL query which I wanted to run is
"Find all road segments within 50 meter from a point"" e.g. like
this
SELECT
* FROM roads WHERE ST_DWithin(transform(roads.the_geom,9102003),
GeomFromText('POINT(-2026631.3499604 1355302.2855377)', 9102003), 50)
ORDER
BY ST_Distance(transform(roads.the_geom,9102003),
GeomFromText('POINT(-2026631.3499604 1355302.2855377)', 9102003)) limit
1;
However this query was so slow that I never managed yet
to wait it out yet :( .. I believe it would be a couple
of hours ...
Thus I
tried to run it in parts to fnd out what is =going on
Since
no index was used I tried to convince the query planner to use a bounding box
(which ST_DWithin should use anyway) by adding one manually:
SELECT
* FROM roads WHERE ST_DWithin(transform(roads.the_geom,9102003),
GeomFromText('POINT(-2260485.2563774 251816.03973006)', 9102003), 50) and
transform(roads.the_geom,9102003) &&
GeomFromText('POINT(-2260485.2563774 251816.03973006)', 9102003) LIMIT
1;
The
query returns one row and
Explain Analyze give me:
"Limit (cost=0.00..2450234.77 rows=1 width=879)
(actual time=1277669.052..1277669.053 rows=1 loops=1)"
" ->
Seq Scan on roads (cost=0.00..2450234.77 rows=1 width=879) (actual
time=1277669.051..1277669.051 rows=1
loops=1)"
" Filter:
((transform(the_geom, 9102003) &&
'0103000020B3E28A000100000005000000000000C01B3F41C100000040B0BB0E41000000C01B3F41C100000060D0BE0E41000000A0E93E41C100000060D0BE0E41000000A0E93E41C100000040B0BB0E41000000C01B3F41C100000040B0BB0E41'::geometry)
AND _st_dwithin(transform(the_geom, 9102003),
'0101000020B3E28A0082F9D0A0023F41C163FE5D5140BD0E41'::geometry, 50::double
precision) AND (transform(the_geom, 9102003) &&
'0101000020B3E28A0082F9D0A0023F41C163FE5D5140BD0E41'::geometry) AND
('0101000020B3E28A0082F9D0A0023F41C163FE5D5140BD0E41'::geometry &&
st_expand(transform(the_geom, 9102003), 50::double precision)))"
"Total
runtime: 1277669.134 ms"
woaahh!
I am wondering why the spatial index was
not used which - at least in theory - should be obvious to the query
planner ....
This discussion forum thread from August 2009 discusses similar
issues:
Namely
that PostGIS performs an sequential scan instead of an index scan.
From what I understand in the above thread it seems that the query
planner for large/huge tables (like my tables of all US roads loaded from TIGER)
does somehow not get the correct cost inputs to make the decision to use the
spatial index and this might get solved in PostGIS 1.1 or 1.5 possibly (I'm
using PostGIS 1.3.6RC1)?
I also tried
to change the cost for the functions used to convince PostGIS to do the bounding
box operation first but with no success. I also temporarily to "disabling"
or better "discouraging" seq scans by: set
enable_seqscan=false
and after the query
setting it back to "normal": set enable_seqscan=true; - still PostGIS does the seq scan first.
So
would anybody have suggestions how I can work around this behavior and still get
the closets roads from a point location or will that just not work right now on
huge data sets?
Is
there any way to really force PostGIS to use the gist index
?
I am
using the following software versions:
PostgreSQL 8.3.7 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real
(Ubuntu 4.3.3-5ubuntu4) 4.3.3
POSTGIS="1.3.6RC1" GEOS="3.1.0-CAPI-1.5.0" PROJ="Rel. 4.6.1, 21 August
2008" USE_STATS
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
More
details:
Strangely the above query also works
faster when using SRID 4326 in degress which doesn't seem to make sense
(radius = 50 degrees - no way I don't get it)
SELECT
* FROM roads WHERE ST_DWithin(roads.the_geom,
transform(GeomFromText('POINT(-2260485.2563774 251816.03973006)',
9102003),4326), 50) and
roads.the_geom &&
transform(GeomFromText('POINT(-2260485.2563774 251816.03973006)', 9102003),4326)
LIMIT 1;
This
returns one record in 4607 ms.
Explain of this query spits out:
"Limit (cost=0.00..9.24 rows=1 width=879) (actual
time=437.187..437.188 rows=1 loops=1)"
" -> Index Scan using
roads_gidx on roads (cost=0.00..9.24 rows=1 width=879) (actual
time=437.184..437.184 rows=1
loops=1)"
" Index Cond: ((the_geom
&&
'0103000020E6100000010000000500000000000060837E65C000000080970C2AC000000060837E65C0000000206DBE5540000000A006FD51C0000000206DBE5540000000A006FD51C000000080970C2AC000000060837E65C000000080970C2AC0'::geometry)
AND (the_geom &&
'0101000020E610000018925BB2067D5EC0C8CF2D22DA7C4240'::geometry))"
"
Filter: ((the_geom &&
'0103000020E6100000010000000500000000000060837E65C000000080970C2AC000000060837E65C0000000206DBE5540000000A006FD51C0000000206DBE5540000000A006FD51C000000080970C2AC000000060837E65C000000080970C2AC0'::geometry)
AND _st_dwithin(the_geom,
'0101000020E610000018925BB2067D5EC0C8CF2D22DA7C4240'::geometry, 50::double
precision) AND (the_geom &&
'0101000020E610000018925BB2067D5EC0C8CF2D22DA7C4240'::geometry) AND
('0101000020E610000018925BB2067D5EC0C8CF2D22DA7C4240'::geometry &&
st_expand(the_geom, 50::double precision)))"
"Total runtime: 437.500
ms"
I
also tried to use a bounding box with a manually created polygon but I am
running into the same issues - its running slow doing a seq scan: 1519138
ms:
like:
SELECT
* FROM roads WHERE transform(roads.the_geom,9102003) &&
ST_Polygon(ST_GeomFromText('LINESTRING(-2026098.7705385
1354954.1944117,-2026979.4410859 1354954.1944117,-2026979.4410859
1356078.528747,-2026098.7705385 1356078.528747,-2026098.7705385
1354954.1944117)'), 9102003) and
ST_intersects(transform(roads.the_geom,9102003),
ST_Polygon(ST_GeomFromText('LINESTRING(-2026098.7705385
1354954.1944117,-2026979.4410859 1354954.1944117,-2026979.4410859
1356078.528747,-2026098.7705385 1356078.528747,-2026098.7705385
1354954.1944117)'), 9102003));
Explain Analyze of the query above gives me:
"Seq
Scan on roads (cost=0.00..2144881.90 rows=1 width=879) (actual
time=1491788.995..1519081.606 rows=45 loops=1)"
" Filter:
((transform(the_geom, 9102003) &&
'0103000020B3E28A000100000005000000DA0242C572EA3EC115F7C431CAAC34416B01EB70E3ED3EC115F7C431CAAC34416B01EB70E3ED3EC1A1F65B872EB13441DA0242C572EA3EC1A1F65B872EB13441DA0242C572EA3EC115F7C431CAAC3441'::geometry)
AND (transform(the_geom, 9102003) &&
'0103000020B3E28A000100000005000000DA0242C572EA3EC115F7C431CAAC34416B01EB70E3ED3EC115F7C431CAAC34416B01EB70E3ED3EC1A1F65B872EB13441DA0242C572EA3EC1A1F65B872EB13441DA0242C572EA3EC115F7C431CAAC3441'::geometry)
AND _st_intersects(transform(the_geom, 9102003),
'0103000020B3E28A000100000005000000DA0242C572EA3EC115F7C431CAAC34416B01EB70E3ED3EC115F7C431CAAC34416B01EB70E3ED3EC1A1F65B872EB13441DA0242C572EA3EC1A1F65B872EB13441DA0242C572EA3EC115F7C431CAAC3441'::geometry))"
"Total
runtime: 1519138.528 ms"