[SpatiaLite Tips] Find nearest objects from two tables

692 views
Skip to first unread message

romain

unread,
Feb 15, 2011, 4:32:30 AM2/15/11
to SpatiaLite Users
Hi,
Here is a little tip for SpatiaLite 2.4 that I've found recently

Goal: For each object from a table (table1), find the nearest object
of another table (table2), within 1Km (1000m)

Data:
- table1 (id,name,geometry)
- table2 (id,name,geometry) + Index Spatial R*Tree

Easy...

SELECT
table1.name as name1,
table2.name as name2,
min(st_distance(table1.geometry,table2.geometry)) as distance
FROM table 1
JOIN table2 ON ( ....dist...)
GROUP BY 1

But this query doesn't give the right result. For each object from
table1 we have the min distance with table2's objects. However, the
field "name2" will not contain the object corresponding to this min
distance, but to the first object encountered.

In order to have the right results, we have to proceed like this:

1/ Create a first VIEW containing every couples from the two tables:
(distances)

CREATE VIEW distances AS
SELECT
table1.name as name1,
table2.name as name2,
st_distance( table1.geometry,table2.geometry ) as distance
FROM table1
JOIN table2 ON (
st_distance( table1.geometry,table2.geometry )<=1000
AND table2.rowid IN (
SELECT pkid FROM idx_table2_geometry
WHERE pkid MATCH RTreeintersects(
MBRminX(table1.geometry)-1000,
MBRminY(table1.geometry)-1000,
MBRmaxX(table1.geometry)+1000,
MBRmaxY(table1.geometry)+1000
)))
GROUP BY 1

Please note that we use the intersection between the R*Tree Spatial
Index from table 2 and the MBR of table1's objects extended by 1000 m.

2/ Create another VIEW countaining the min distance for every single
table1's objects: (distances_min)

CREATE VIEW distances_min AS
SELECT
name1 as name1,
min( distance ) as "distance min"
FROM distances
GROUP BY 1

3/ To conclude, use those two VIEWS in order to have the correct
result

SELECT
distances.name1 as name1,
distances.name2 as name2,
distances.distance as distance
FROM distances
JOIN distances_min ON (
distances.nom1=distances_min.nom1
AND
distances.distance=distances_min."distance min"
)

I've tested it and it works fine. Perhaps someone know a way to go
faster ?

By

markb

unread,
Feb 15, 2011, 11:27:50 PM2/15/11
to spatiali...@googlegroups.com
A million ways to do this, here is my one step approach, updated & tweaked towards your use case/example (if I'm following your logic?).  I've used this on pretty large datasets with good speed.  One thing to note, one can't guarantee a single geom being returned for a particular minimum distance.  Since ties will and do happen; usually at zero meters (or whatever the units).  I've had this happen plenty, with certain datasets in particular.  Of course issues of tolerance come to mind, which can easily be handled in the outer most Where clause.  Enjoy!  -Mark 

Select t1.id as T1Id, t1.name as T1Name, t1.geometry as T1Geom,
  t2.id as T2Id, t2.name as T2Name, t2.geometry as T2Geom,
  ST_Distance(t1.geometry,t2.geometry) as PairDist
From t1,
  t2 Inner Join
  idx_t2_geometry As t2_idx
  ON t2.PK_UID = t2_idx.pkid
Where t2_idx.xmin <= (MbrMaxX(t1.geometry) + 1000)
  AND t2_idx.xmax >= (MbrMinX(t1.geometry) - 1000)
  AND t2_idx.ymin <= (MbrMaxY(t1.geometry) + 1000)
  AND t2_idx.ymax >= (MbrMinY(t1.geometry) - 1000)
  AND ST_Distance(t1.geometry, t2.geometry) =
  (
  Select Min(ST_Distance(t1_sub.geometry, t2_sub.geometry))
  From t1 As t1_sub,
    t2 As t2_sub
  Where t1_sub.id = t1.id
    AND t2_sub.PK_UID IN
    (
    Select pkid From idx_t2_geometry
    Where pkid MATCH RTreeIntersects(
      MbrMinX(t1_sub.geometry) - 1000,
      MbrMinY(t1_sub.geometry) - 1000,
      MbrMaxX(t1_sub.geometry) + 1000,
      MbrMaxY(t1_sub.geometry) + 1000)
    )
  AND ST_Distance(t1_sub.geometry, t2_sub.geometry) <= 1000
  );

romain

unread,
Feb 16, 2011, 12:21:10 AM2/16/11
to SpatiaLite Users
Hi markb.

Perhaps I've missed something in your query, but it seems to be very
slow:
On my dataset, my query takes 4s to be achieved, whereas the above
query takes more than 4 min ...

Bye

Richard Males

unread,
Nov 13, 2013, 10:31:48 AM11/13/13
to spatiali...@googlegroups.com
I tried Romain's approach, it did not always correctly identify the closest point from table 2.  When I removed the Group By 1 clause, it did.   Perhaps that is why your query runs faster than markb's query (which I have not tested).

Dick

Richard Males

unread,
Nov 13, 2013, 11:29:42 AM11/13/13
to spatiali...@googlegroups.com
Also, I am interested in updating this technique for use with Spatialite 4, where RTreeIntersects is no longer available.  My problem is finding the closest port (from table MajorPorts) given a vessel 'ping', i.e. a location identifier (in Legs200k).   I am working in WGS84, lat/lon.  The best I could think of to bound the spatial search was to use 0.5 degree lat and lon on the bounding box.   Any suggestions much appreciated.

Dick


a.fu...@lqt.it

unread,
Nov 13, 2013, 2:45:50 PM11/13/13
to spatiali...@googlegroups.com
Hi Dick,

I've simply used the Cities1000 worldwide dataset downloaded form
here: http://download.geonames.org/export/dump/cities1000.zip

I've populated a "cities" table containing about 140K Points
(srid=4326), and then I've used the following query in order
to identify which is the nearest city to another one (I've
restricted my search to Italy only so to simplify the
problem: anyway there are about 8K rows corresponding to
Italy)

SELECT country_1, name_1, country_2, name_2,
Min(ST_Distance(geom_1, geom_2)) AS dist
FROM (
SELECT a.country AS country_1, a.name AS name_1,
b.country AS country_2, b.name AS name_2,
a.geom AS geom_1, b.geom AS geom_2
FROM cities AS a, cities AS b
WHERE a.ROWID <> b.ROWID AND b.ROWID IN (
SELECT ROWID FROM SpatialIndex
WHERE f_table_name = 'cities' AND search_frame =
BuildCircleMbr(X(a.geom), Y(a.geom), 0.5))
)
WHERE country_1 = 'IT'
GROUP BY name_1;

few points worth to be highlighted:
- this is an example of how to use the most recent
SpatialIndex instead of the obsolete RTreeIntersects

- why I've used an inner sub-query ?
it's basically simple to explain: when SQLite processes
a Min() or Max() aggregate function, you can safely expect
that any other column value will correspond to the same row
exposing the min (or max) value.
this is surely true if a single table is involved; so
by declaring an inner sub-query I've forced SQLite to
generate first all candidate pairs; the outer query will
then simply process just a single table.

bye Sandro

Richard Males

unread,
Nov 17, 2013, 2:15:34 PM11/17/13
to spatiali...@googlegroups.com
Thank you.  I have tried this approach, but I am afraid that I have failed to properly create the spatial index.   Do I understand correctly that the table to be indexed should have an integer autonumber primary key, be named in lower case, with the geometry column also named in lower case?   Also, I am unclear as to how to examine the spatial index once it has been created, as the notes indicate that SpatialIndex is not directly queryable.   The examples require a parameter for search_frame, and I am not clear as to what that should be.

As well, I am interested in limiting the search to those points that are within 100km, i.e. the closest point in table2 to a point in table1, as long as that point is within 100k geodetic distance.  In my test situation, table1 has 200,000 rows, but the real table has 9 million rows.   table2 has 5000 rows, and is the table that I am trying to spatially index.   I am perfectly happy to process table1 in smaller sets (i.e. 100,000 rows at a time), but again, not sure how best to do that without a lot of manual work.

Sorry for not grasping the best way to do this, all suggestions appreciated.

Dick


a.fu...@lqt.it

unread,
Nov 17, 2013, 3:33:28 PM11/17/13
to spatiali...@googlegroups.com
Hi Dick,

> I have tried this approach, but I am afraid that I have
> failed to properly create the spatial index. Do I understand
> correctly
> that the table to be indexed should have an integer autonumber
> primary
> key, be named in lower case, with the geometry column also named in
> lower case?
>

none of the above is strictly necessary:
- using upper- or lowercase names for tables and columns is not at all
relevant; SQL functions such as AddGeometryColumn() or
CreateSpatialIndex()
will then silently apply any case conversion when and if required.
- using an integer primary key is surely a good idea; otherwise the
spatial index could eventually become corrupted/damaged immediately
after performing a VACUUM; but in this case you can actually recover
a properly working spatial index by invoking CheckSpatialIndex and
RecoverSpatialIndex.
- using an Integer AutoIncrement Primary Key is a requirement for the
QGIS data provider, but there is nothing compelling to absolutely
declare an AutoIncrement clause from pure SQL perspective.


> Also, I am unclear as to how to examine the spatial index
> once it has been created, as the notes indicate that SpatialIndex is
> not directly queryable. The examples require a parameter for
> search_frame, and I am not clear as to what that should be.
>

the spatial index simply is a spatial filter; only the features/rows
declaring an MBR (aka BBOX) intersecting the "search_frame" will
pass such filter.
more precisely: the ROWID corresponding to any row whose Geometry
will intersect the "search frame" will be returned from a Spatial
Index query.

so "search_frame" simply is any generic Geometry intended to set
up a filter MBR (aka BBOX); it could eventually be a Point or
Linestring, but more often it's some Polygon (i.e. a "frame").

if I intend well you are mainly interested in some kind of
filtering based on a distance (i.e. a radius): so the the best
way to take profit from the Spatial Index is to set a square
"search_frame" circumscribing you target circle, then using
ST_Distance() or PtDistWithin() in order to get a very
precise selection.


> As well, I am interested in limiting the search to those points that
> are within 100km, i.e. the closest point in table2 to a point in
> table1, as long as that point is within 100k geodetic distance. In my
> test situation, table1 has 200,000 rows, but the real table has 9
> million rows. table2 has 5000 rows, and is the table that I am trying
> to spatially index.
>

a spatial index can effectively optimize the speed of your query
only if it really is a "very strict/selective" filter.
it's not at all relevant if the target table contains few rows or
100 million rows: the unique relevant factor is how effective is
the filter you are attempting to impose.

a very effective spatial index will extract let say only 1% (or
even less than this) of the whole table while discarding the other
99% because it's clearly not interesting.
on the other hand a spatial index returning about 50% of the
whole table will be practically useless; and a spatial index
returning about 25% of the total rows will probably perform
very poorly.

if your data are reasonably well structured and if your search_frame
is enough restricted/selective, don't be afraid.
a spatial index query will be damn fast even when accessing a
table containing many million rows: it's purposely designed for
a task like this.

bye Sandro
Reply all
Reply to author
Forward
0 new messages