[postgis-users] trouble getting nearest neighbor with ST_DWithin and ST_Distance

1 view
Skip to first unread message

karsten vennemann

unread,
Sep 16, 2009, 3:45:18 AM9/16/09
to postgi...@postgis.refractions.net
Hi PostGISers,
sorry this is a long one ;)
 
I am trying to use the ST_DWithin and ST_Distance functions to get a list of nearest neighboring roads in relation to a point location. The dataset is big its the tiger shape files which I loaded into PostGIS and clustered on the gist_index, SRID is 9102003 (or EPSG code 102003) an Albers-Equal-Area projection which has units in meters. I researched earlier posts and it seemed that several recommendations pointed out that the use of ST_DWithin and ST_Distance would do exactly what I wanted to do.
 
I created an spatial index on the roads layer and afterwards ran vacuum analyze on the roads layer.
CREATE INDEX roads_gidx ON roads USING gist (the_geom); ALTER TABLE roads CLUSTER ON roads_gidx;
 
and a second index on the transformed geom:
CREATE INDEX roads_tranform_9102003_gidx ON roads USING gist (st_transform(the_geom, 9102003)) WHERE the_geom IS NOT NULL;
 
My "dream" SQL query which I wanted to run is "Find all road segments within 50 meter from a point"" e.g. like this
SELECT * FROM roads WHERE ST_DWithin(transform(roads.the_geom,9102003), GeomFromText('POINT(-2026631.3499604 1355302.2855377)', 9102003), 50)
ORDER BY ST_Distance(transform(roads.the_geom,9102003), GeomFromText('POINT(-2026631.3499604 1355302.2855377)', 9102003)) limit 1;
 
However this query was so slow that I never managed yet to wait it out yet :(  .. I believe it would be a couple of hours  ...
 
Thus I tried to run it in parts to fnd out what is =going on
Since no index was used I tried to convince the query planner to use a bounding box (which ST_DWithin should use anyway) by adding one manually:
 
SELECT * FROM roads WHERE ST_DWithin(transform(roads.the_geom,9102003),  GeomFromText('POINT(-2260485.2563774 251816.03973006)', 9102003), 50) and 
transform(roads.the_geom,9102003) && GeomFromText('POINT(-2260485.2563774 251816.03973006)', 9102003) LIMIT 1;
The query returns one row and
Explain Analyze give me:
 
"Limit  (cost=0.00..2450234.77 rows=1 width=879) (actual time=1277669.052..1277669.053 rows=1 loops=1)"
"  ->  Seq Scan on roads  (cost=0.00..2450234.77 rows=1 width=879) (actual time=1277669.051..1277669.051 rows=1 loops=1)"
"        Filter: ((transform(the_geom, 9102003) && '0103000020B3E28A000100000005000000000000C01B3F41C100000040B0BB0E41000000C01B3F41C100000060D0BE0E41000000A0E93E41C100000060D0BE0E41000000A0E93E41C100000040B0BB0E41000000C01B3F41C100000040B0BB0E41'::geometry) AND _st_dwithin(transform(the_geom, 9102003), '0101000020B3E28A0082F9D0A0023F41C163FE5D5140BD0E41'::geometry, 50::double precision) AND (transform(the_geom, 9102003) && '0101000020B3E28A0082F9D0A0023F41C163FE5D5140BD0E41'::geometry) AND ('0101000020B3E28A0082F9D0A0023F41C163FE5D5140BD0E41'::geometry && st_expand(transform(the_geom, 9102003), 50::double precision)))"
"Total runtime: 1277669.134 ms"
 
woaahh!
I am wondering why the spatial index was not used which - at least in theory - should be obvious to the query planner ....
 
This discussion forum thread from August 2009 discusses similar issues:
 
Namely that PostGIS performs an sequential scan instead of an index scan. From what I understand in the above thread it seems that the query planner for large/huge tables (like my tables of all US roads loaded from TIGER) does somehow not get the correct cost inputs to make the decision to use the spatial index and this might get solved in PostGIS 1.1 or 1.5 possibly (I'm using PostGIS 1.3.6RC1)?
 
I also tried to change the cost for the functions used to convince PostGIS to do the bounding box operation first but with no success. I also temporarily to "disabling" or better "discouraging" seq scans by: set enable_seqscan=false and after the query setting it back to "normal": set enable_seqscan=true;  - still PostGIS does the seq scan first.
 
So would anybody have suggestions how I can work around this behavior and still get the closets roads from a point location or will that just not work right now on huge data sets?
Is there any way to really force PostGIS to use the gist index ?
 
I am using the following software versions:
PostgreSQL 8.3.7 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Ubuntu 4.3.3-5ubuntu4) 4.3.3
POSTGIS="1.3.6RC1" GEOS="3.1.0-CAPI-1.5.0" PROJ="Rel. 4.6.1, 21 August 2008" USE_STATS
 
Cheers
Karsten

Karsten Vennemann
Terra GIS LTD
SeattleWA, USA 
www.terragis.net
 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
More details:
 
Strangely the above query also works faster when using SRID 4326 in degress which doesn't seem to make sense (radius = 50 degrees - no way I don't get it)
SELECT * FROM roads WHERE ST_DWithin(roads.the_geom, transform(GeomFromText('POINT(-2260485.2563774 251816.03973006)', 9102003),4326), 50) and 
roads.the_geom && transform(GeomFromText('POINT(-2260485.2563774 251816.03973006)', 9102003),4326) LIMIT 1;
This returns one record in 4607 ms.
 
Explain of this query spits out:
"Limit  (cost=0.00..9.24 rows=1 width=879) (actual time=437.187..437.188 rows=1 loops=1)"
"  ->  Index Scan using roads_gidx on roads  (cost=0.00..9.24 rows=1 width=879) (actual time=437.184..437.184 rows=1 loops=1)"
"        Index Cond: ((the_geom && '0103000020E6100000010000000500000000000060837E65C000000080970C2AC000000060837E65C0000000206DBE5540000000A006FD51C0000000206DBE5540000000A006FD51C000000080970C2AC000000060837E65C000000080970C2AC0'::geometry) AND (the_geom && '0101000020E610000018925BB2067D5EC0C8CF2D22DA7C4240'::geometry))"
"        Filter: ((the_geom && '0103000020E6100000010000000500000000000060837E65C000000080970C2AC000000060837E65C0000000206DBE5540000000A006FD51C0000000206DBE5540000000A006FD51C000000080970C2AC000000060837E65C000000080970C2AC0'::geometry) AND _st_dwithin(the_geom, '0101000020E610000018925BB2067D5EC0C8CF2D22DA7C4240'::geometry, 50::double precision) AND (the_geom && '0101000020E610000018925BB2067D5EC0C8CF2D22DA7C4240'::geometry) AND ('0101000020E610000018925BB2067D5EC0C8CF2D22DA7C4240'::geometry && st_expand(the_geom, 50::double precision)))"
"Total runtime: 437.500 ms"

I also tried to use a bounding box with a manually created polygon but I am running into the same issues - its running slow doing a seq scan: 1519138 ms:
 
like:
SELECT * FROM roads WHERE transform(roads.the_geom,9102003) && ST_Polygon(ST_GeomFromText('LINESTRING(-2026098.7705385 1354954.1944117,-2026979.4410859 1354954.1944117,-2026979.4410859 1356078.528747,-2026098.7705385 1356078.528747,-2026098.7705385 1354954.1944117)'), 9102003) and ST_intersects(transform(roads.the_geom,9102003), ST_Polygon(ST_GeomFromText('LINESTRING(-2026098.7705385 1354954.1944117,-2026979.4410859 1354954.1944117,-2026979.4410859 1356078.528747,-2026098.7705385 1356078.528747,-2026098.7705385 1354954.1944117)'), 9102003));
 
Explain Analyze of the query above gives me:
"Seq Scan on roads  (cost=0.00..2144881.90 rows=1 width=879) (actual time=1491788.995..1519081.606 rows=45 loops=1)"
"  Filter: ((transform(the_geom, 9102003) && '0103000020B3E28A000100000005000000DA0242C572EA3EC115F7C431CAAC34416B01EB70E3ED3EC115F7C431CAAC34416B01EB70E3ED3EC1A1F65B872EB13441DA0242C572EA3EC1A1F65B872EB13441DA0242C572EA3EC115F7C431CAAC3441'::geometry) AND (transform(the_geom, 9102003) && '0103000020B3E28A000100000005000000DA0242C572EA3EC115F7C431CAAC34416B01EB70E3ED3EC115F7C431CAAC34416B01EB70E3ED3EC1A1F65B872EB13441DA0242C572EA3EC1A1F65B872EB13441DA0242C572EA3EC115F7C431CAAC3441'::geometry) AND _st_intersects(transform(the_geom, 9102003), '0103000020B3E28A000100000005000000DA0242C572EA3EC115F7C431CAAC34416B01EB70E3ED3EC115F7C431CAAC34416B01EB70E3ED3EC1A1F65B872EB13441DA0242C572EA3EC1A1F65B872EB13441DA0242C572EA3EC115F7C431CAAC3441'::geometry))"
"Total runtime: 1519138.528 ms"
 

Mark Cave-Ayland

unread,
Sep 16, 2009, 4:39:39 AM9/16/09
to PostGIS Users Discussion
karsten vennemann wrote:

> "Total runtime: *1277669*.134 *ms*"


Hi Karsten,

What is the SRID of the geometry column in your roads table? The reason
I ask is that in your query you're forcing PostGIS to reproject every
single row on the roads table which is why it seq scans.

Assuming that all the geometries in your roads table are SRID 9102003,
then as a starting point, the following query should be quick:

SELECT * FROM roads WHERE ST_DWithin(roads.the_geom,

GeomFromText('POINT(-2026631.3499604 1355302.2855377)', 9102003), 50)


HTH,

Mark.

--
Mark Cave-Ayland - Senior Technical Architect
PostgreSQL - PostGIS
Sirius Corporation plc - control through freedom
http://www.siriusit.co.uk
t: +44 870 608 0063

Sirius Labs: http://www.siriusit.co.uk/labs
_______________________________________________
postgis-users mailing list
postgi...@postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users

Karsten Vennemann

unread,
Sep 17, 2009, 3:55:05 AM9/17/09
to postgi...@postgis.refractions.net

>>> What is the SRID of the geometry column in your roads table? >>>
The SRID is 4326. That' why I used the transform to 9102003.

>>> The reason I ask is that in your query you're forcing PostGIS to
>>> reproject every single row on the >>> roads table which is why it seq
>>> scans.

Oh ok. I was assuming that step would only do the transform on original
geometries found within the bounding box...

>>> Assuming that all the geometries in your roads table are SRID 9102003,
>>> then as a starting point, the following query should be quick:
>>> SELECT * FROM roads WHERE ST_DWithin(roads.the_geom,
>>> GeomFromText('POINT(-2026631.3499604 1355302.2855377)', 9102003), 50)

I added a second geometry column the_geom2 with SRID 9102003, and created an
gist index on it (took a total of 6 hours or so for both) and voila the
whole query is pretty fast and retrieves one row in 310ms:
SELECT * FROM roads WHERE ST_DWithin(roads.the_geom2,
GeomFromText('POINT(-2260366.9053948 253261.48812865)', 9102003), 150)
ORDER BY ST_Distance(roads.the_geom2, GeomFromText('POINT(-2260366.9053948
253261.48812865)', 9102003))
limit 1;
Thanks
Karsten


--
View this message in context: http://www.nabble.com/trouble-getting-nearest-neighbor-with-ST_DWithin-and-ST_Distance-tp25467509p25486772.html
Sent from the PostGIS - User mailing list archive at Nabble.com.

Reply all
Reply to author
Forward
0 new messages