Nearest Neighbour Problem: Customer nearest Water Pipes

NiceDogE

unread,

Apr 17, 2009, 1:24:15 AM4/17/09

to SpatiaLite Users

Firstly, new to SQLite/SpatiaLite but well done Alessandro on a great
extension! Seems a perfect fit for some of our custom spatial data
analysis tools.

My problem is a variation on Nearest Neighbour.

I have ~300,000 customers (Polygons) and ~160,000 water pipes
(Polylines) across a City. For each customer, I need to find the
nearest water pipe.

I'm wondering what SQL will give the fastest results. I tried a
sample below, but I'm not sure this is the appropriate way of trying
to use a spatial index.

I realise the sample isn't the full SQL req'd, in this case I'm just
trying to solve the main headache in first getting an approximate (ie.
using MBRs?) 1<->Many table of Customer <->Nearest Pipes within a user-
defined search radius (for example below, 30 metres)

CREATE VIEW NearestPipes AS
SELECT Customers.Service_ID,WaterPipes.PipeID
FROM Customers,WaterPipes WHERE
WaterPipes.RowID IN(SELECT pkid FROM idx_WaterPipes_Geometry WHERE
MbrOverLaps(BuildMBR(xmin-30,ymin-30,xmax+30,ymax
+30,28356),Customers.Geometry))

NiceDogE

unread,

Apr 20, 2009, 3:27:10 AM4/20/09

to SpatiaLite Users

To add to this, managed to do a bit of trial and error on different
methods.

Following observations:
1. SpatiaLite seems to HATE having a geometric operator in the SQL
accessing the spatial index virtual table. (eg. Using the MBROverlaps
() operator in above example) Performance seems to be slower than if
chose to not use the spatial index at all. Best performance yielded
if use simple

2. Don't know whether this is from SQLite or the SpatiaLite
extensions, but the nested SQL statement pulling pkIDs from the
spatial index table HATES "OR" operators.

For example, the below sample is what I considered to have the most
efficient syntax:

SELECT Customers.Service_ID, Pipes.Label, X(Customers.Geometry)-30 As
CustXmin, Y(Customers.Geometry)-30 As CustYmin, X(Customers.Geometry)
+30 As CustXmax, Y(Customers.Geometry)+30 As CustYmax
FROM Customers, Pipes
WHERE Pipes.RowID IN(Select pkID From idx_Pipes_Geometry WHERE
((xmin>CustXmin AND xmin<CustXmax) OR
(xmax>CustXmin AND xmax<CustXmax) OR
(xmin<CustXmin AND xmax>CustXmax)) AND
((ymin>CustYMin AND ymin<CustYmax) OR
(ymax>CustYMin AND ymax<CustYmax) OR
(ymin<CustYMin AND ymax>CustYmax)))
AND Crosses(BuildMBR(CustXmin,CustYmin,CustXmax,CustYmax,
28356),Pipes.Geometry)

However, this had very slow performance.

Weirdly (to me at least, but SQL newbie), using an expansion method +
using UNION SELECT yields a much faster query. Doing this in this
solution has 3 x 3 permutations. So re-writing the above I get this:

SELECT Customers.Service_ID, Pipes.Label, X(Customers.Geometry)-30 As
CustXmin, Y(Customers.Geometry)-30 As CustYmin, X(Customers.Geometry)
+30 As CustXmax, Y(Customers.Geometry)+30 As CustYmax
FROM Customers, Pipes
WHERE Pipes.RowID IN(Select pkID From idx_Pipes_Geometry WHERE
xmin>CustXmin AND xmin<CustXmax AND ymin>CustYMin AND ymin<CustYmax
UNION SELECT pkID From idx_Pipes_Geometry WHERE
xmin>CustXmin AND xmin<CustXmax AND ymax>CustYMin AND ymax<CustYmax
UNION SELECT pkID From idx_Pipes_Geometry WHERE
xmin>CustXmin AND xmin<CustXmax AND ymin<CustYMin AND ymax>CustYmax

UNION SELECT pkID From idx_Pipes_Geometry WHERE
xmax>CustXmin AND xmax<CustXmax AND ymin>CustYMin AND ymin<CustYmax
UNION SELECT pkID From idx_Pipes_Geometry WHERE
xmax>CustXmin AND xmax<CustXmax AND ymax>CustYMin AND ymax<CustYmax
UNION SELECT pkID From idx_Pipes_Geometry WHERE
xmax>CustXmin AND xmax<CustXmax AND ymin<CustYMin AND ymax>CustYmax

UNION SELECT pkID From idx_Pipes_Geometry WHERE
xmin<CustXmin AND xmax>CustXmax AND ymin>CustYMin AND ymin<CustYmax
UNION SELECT pkID From idx_Pipes_Geometry WHERE
xmin<CustXmin AND xmax>CustXmax AND ymax>CustYMin AND ymax<CustYmax
UNION SELECT pkID From idx_Pipes_Geometry WHERE
xmin<CustXmin AND xmax>CustXmax AND ymin<CustYMin AND ymax>CustYmax)
AND Crosses(BuildMBR(CustXmin,CustYmin,CustXmax,CustYmax,
28356),Pipes.Geometry)

As I said, seems to be much faster than SQL statment above it, but
have no idea why(?). Still interested if anyone has a better
performing solution!

Alessandro Furieri

unread,

Apr 21, 2009, 9:27:21 AM4/21/09

to spatiali...@googlegroups.com

I tried to reproduce the problem you pose.
So I loaded a DB using:
- PopulatedPlaces: about 5,000 polygons
[footprints for Tuscany's towns and villages]
- Streets: about 17,000 linestrings
[Tuscany's highways and first class roads]

The problem I actually tried to solve was:
"Which ones streets may be associated to each populated place
within a 100 meters range ?"

More or less it seems quite the same of your original problem
involving pipes and customers.

===============================================================

First I tried a "raw" brutal force approach;
not using any Spatial Index at all

SELECT pp.Name, st.Name, Distance(pp.Geometry, st.Geometry)
FROM PopulatedPlaces AS pp, Streets AS st
WHERE Distance(pp.Geometry, st.Geometry) < 100.0

After waiting some 20 minutes I decided to kill SpatiaLite,
because it was quite obvious such a solution requires a long
time [and obviously, lots and lots of CPU cycles] to be computed.

--------------------

So I tried a "sophisticated" approach ...
I built an R*Tree Spatial Index for Streets.Geometry,
and then I executed the following SQL query:

SELECT pp.Name, st.Name, Distance(pp.Geometry, st.Geometry)
FROM PopulatedPlaces AS pp, Streets AS st
WHERE Distance(pp.Geometry, st.Geometry) < 100.0 AND st.ROWID IN
(
SELECT pkid FROM idx_Streets_Geometry
WHERE xmin < (MbrMaxX(pp.Geometry) + 100) AND
xmax > (MbrMinX(pp.Geometry) - 100) AND
ymin < (MbrMaxY(pp.Geometry) + 100) AND
ymax > (MbrMinY(pp.Geometry) - 100)
)

this time I was able to get a result set in about 1 minute;
and this is not at all surprising, because in this case I was
taking full profit from the R*Tree Spatial Index.

bye, Sandro

NiceDogE

unread,

Apr 23, 2009, 3:25:30 AM4/23/09

to SpatiaLite Users

Thankyou so much Alessandro!

I'm actually pleased that my approach was very close to your
approach. Both my syntax and your syntax was looking in the Spatial
Index table for overlaps between the Pipe/Streets indexed MBRs, and
extended MBRs from Customers/PopulatedPlaces.

However, your syntax is a LOT more efficient than mine for checking
overlaps in the X and Y dimensions. Also, performance-wise, your
syntax was about twice as fast as executing.

As a sidenote, there is another technique I've used previously on the
Nearest Neighbour problem that can yield even more speed.

In my real case, the maximum search radius from customer to pipe will
be around 500 metres, but in most cases the distance is less than 6
metres. Programatically, the best method I've found to get good speed
is NOT to search out for pipes in a 500 metre radius in the first
pass. The best method seems to be to iteratively run SQL statements
that gradually expand the "search" MBR bounding box from a very small
distance to larger distances.

Then use the returned results to UPDATE an "Customers.IsProcessed"
Field. Then run the next SELECT query with an expanded search MBR
bounding box but ignoring Customers that have already been previously
processed.

Nominally I have previously done this in about 10 iterations with
progressing larger search bounding boxes.

But you've helped me solve the hardest part of the problem, thanks
again!

Reply all

Reply to author

Forward