[postgis-users] Adaptive k-nearest neigbourh in PostGIS

Vilem Ded

unread,

Dec 20, 2017, 6:30:49 AM12/20/17

to postgi...@lists.osgeo.org

Hi,

I have big table of spatial points in 2D (or 3D) (PostGIS) and distance defined between them (let say just real spatial distance). When I create GIST index on these points, I can efficiently get k-nearest neighbors for each point using `<->` operator in ORDER BY statement with constant k (e.g. 20) specified in `LIMIT` statement([Postgis-official][1]).

    SELECT id, st_distance(geom,'SRID=32633;POINT(404419.4 5599450.3)'::geometry ) as dist, geom
    FROM positions
    ORDER BY geom <-> 'SRID=32633;POINT(404419.4 5599450.3)'::geometry limit 20;

Explain analyze for that:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
     Limit (cost=0.28..2.64 rows=10 width=36) (actual time=0.243..0.313 rows=10 loops=1)
       -> Index Scan using positons2_gix on positions2 (cost=0.28..12059.82 rows=51146 width=36) (actual time=0.243..0.310 rows=10 loops=1)
             Order By: (up_geom <-> '0101000020797F00009A9999990DAF184133333393365C5541'::geometry)
     Planning time: 0.109 ms
     Execution time: 0.368 ms

The thing is, that I would like to get different (dynamic) number k of neighbors for each point such that cumulative sum ([How to compute cumulative sum][2]) of distances to these points is less than parameter a (let say 500m) of course maximizing k. So called adaptive k-nearest neighbor algorithm. I could compute distance to all positions and filter it by WHERE clause:

    select * from
    (select *, sum(dist) over(order by dist) distcum from
        (select id, st_distance(geom,'SRID=32633;POINT(404419.4 5599450.3)'::geometry ) as dist, geom from
        positions ORDER BY geom <-> 'SRID=32633;POINT(404419.4 5599450.3)'::geometry) a
    ) b where distcum < 500;

explain analyze for that:

    QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
     Subquery Scan on b (cost=14073.70..15608.08 rows=17049 width=52) (actual time=155.427..186.656 rows=26 loops=1)
       Filter: (b.distcum < '500'::double precision)
       Rows Removed by Filter: 51120
       -> WindowAgg (cost=14073.70..14968.76 rows=51146 width=44) (actual time=155.421..181.934 rows=51146 loops=1)
             -> Sort (cost=14073.70..14201.57 rows=51146 width=44) (actual time=155.410..157.985 rows=51146 loops=1)
                   Sort Key: a.dist
                   Sort Method: quicksort Memory: 5532kB
                   -> Subquery Scan on a (cost=9434.16..10073.49 rows=51146 width=44) (actual time=130.931..145.989 rows=51146 loops=1)
                         -> Sort (cost=9434.16..9562.03 rows=51146 width=36) (actual time=130.930..136.711 rows=51146 loops=1)
                               Sort Key: ((positions.geom <-> '0101000020797F00009A9999990DAF184133333393365C5541'::geometry))
                               Sort Method: quicksort Memory: 8729kB
                               -> Seq Scan on positions (cost=0.00..5433.95 rows=51146 width=36) (actual time=0.023..95.801 rows=51146 loops=1)
     Planning time: 0.146 ms
     Execution time: 186.705 ms

but if I am not mistaken, this is sorting all points ( Sort Key: ((positions.geom <->....) using GIST index, runs window function on all points to get cumulative sum and then uses filter.

I would like to apply this adaptive knn for all points in my table with about 2 000 000 rows. My proposed query is too slow for that.

Is there any way how to implement adaptive knn in PotgreSQL/PostGIS so the computation is faster than my solution? (probably by avoiding sorting of all points for each point)

Thank you
Vil

[1]: https://postgis.net/docs/geometry_distance_knn.html
[2]: https://stackoverflow.com/questions/22841206/calculating-cumulative-sum-in-postgresql

Darafei "Komяpa" Praliaskouski

unread,

Dec 20, 2017, 6:43:48 AM12/20/17

to PostGIS Users Discussion

Hi,

it seems to me that cumulative distance is guaranteed to always be more or equal to straight line distance, thus transforming your KNN query to essentially KNN-in-box query.

You can add `where ST_DWithin(geom, 'SRID=32633;POINT(404419.4 5599450.3)'::geometry, distcum)` to your select clause, and you'll get a box-based index scan for your non-index-using query. Depending on your data distribution it might give enough performance already.

ср, 20 дек. 2017 г. в 14:30, Vilem Ded <De...@seznam.cz>:

_______________________________________________
postgis-users mailing list
postgi...@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users

Vilem Ded

unread,

Jan 8, 2018, 10:26:31 AM1/8/18

to PostGIS Users Discussion

Thanks,
very nice indeed! I will definitely give that a try.
But this is really very dependent on data distribution. My dataset is clustered so it will be better, but probably not good enough:/
As I understood KNN in Postgis is implemented with priority queue?
If so, there must be condition of type while i < k {next} , where k is our K in KNN. I was hoping for some way how to alter this part of code - "just" change those few lines..
Thank you for any other hints
Vil

Giuseppe Broccolo

unread,

Jan 8, 2018, 10:48:24 AM1/8/18

to PostGIS Users Discussion

Hi Vilem,

2018-01-08 16:26 GMT+01:00 Vilem Ded <De...@seznam.cz>:

Thanks,
very nice indeed! I will definitely give that a try.
But this is really very dependent on data distribution. My dataset is clustered so it will be better, but probably not good enough:/
As I understood KNN in Postgis is implemented with priority queue?
If so, there must be condition of type while i < k {next} , where k is our K in KNN. I was hoping for some way how to alter this part of code - "just" change those few lines..

Yes, kNN in PostgreSQL is based on pairing heap algorithm, that provides a priority queue of index nodes at some point. Consider that kNN is supported in GiST indexing more in general - i.e., it is not just a part of PostGIS.