speed to do spatial selection using derived geometries and spatial index

58 views
Skip to first unread message

David Anderson

unread,
Feb 2, 2016, 1:47:11 PM2/2/16
to SpatiaLite Users
I've noticed a significant difference in the time to do a spatial query using a geometry derived via spatial operations

The query that I want to do is to split a polygon by a line, use the left hand portion of the split polygon as the area to select from a larger table that has about 1.7 million records.

select v.*
from vmap4simpplle v,
(select splitleft(g.shape,c.shape) as shape
 
from geoarea_fullvmap g,ContinentalDivideNatlTrail c
 
where g.GEOGRAPHIC_AREA='Divide' and st_intersects(g.shape,c.shape)
) as dw
where instr(species,'PF')
and instr(fmz,'divide')
and st_intersects(dw.shape,v.shape)
and v.rowid in
(select rowid from spatialindex where f_table_name='VMap4SIMPPLLE' and search_frame=dw.shape)
limit
10


I put the limit statement in to try to get the query to return in  a reasonable time.  The limited query takes 36 seconds.  Without the limit it takes longer than I feel like waiting.

After some thought about the results I realized that an exact answer is not that important so I recast the query in this form, removing the st_intersects
select v.*
from vmap4simpplle v,
(select splitleft(g.shape,c.shape) as shape
 
from geoarea_fullvmap g,ContinentalDivideNatlTrail c
 
where g.GEOGRAPHIC_AREA='Divide' and st_intersects(g.shape,c.shape)
) as dw
where instr(species,'PF')
and instr(fmz,'divide')
and v.rowid in
(select rowid from spatialindex where f_table_name='VMap4SIMPPLLE' and search_frame=dw.shape)


This query takes about 3 seconds.


To test things I decided to remove the derived geometry to see the speed of using a stored geometry.
select v.*
from vmap4simpplle v,
(select g.shape as shape
 
from geoarea_fullvmap g,ContinentalDivideNatlTrail c
 
where g.GEOGRAPHIC_AREA='Divide' and st_intersects(g.shape,c.shape)
) as dw
where instr(species,'PF')
and instr(fmz,'divide')
and st_intersects(dw.shape,v.shape)
and v.rowid in
(select rowid from spatialindex where f_table_name='VMap4SIMPPLLE' and search_frame=dw.shape)

This takes less than a second to run.

My question is why using a geometry coming from a table runs far more efficiently than using a geometry derived from an operation and presumably stored in memory?  I wonder how the query optimizer handles having the derived geometry in the st_intersects portion of the query?  Does it have to be recreated numerous times?


I am using the 4.4.0-RC0 version.

David Anderson













a.fu...@lqt.it

unread,
Feb 2, 2016, 6:56:02 PM2/2/16
to spatiali...@googlegroups.com
On Tue, 2 Feb 2016 10:47:11 -0800 (PST), David Anderson wrote:
> I've noticed a significant difference in the time to do a spatial
> query using a geometry derived via spatial operations
>
> The query that I want to do is to split a polygon by a line, use the
> left hand portion of the split polygon as the area to select from a
> larger table that has about 1.7 million records.
>
> ---------------- <snip> ---------------
>
> My question is why using a geometry coming from a table runs far more
> efficiently than using a geometry derived from an operation and
> presumably stored in memory?  I wonder how the query optimizer
> handles having the derived geometry in the st_intersects portion of
> the query?  Does it have to be recreated numerous times?
>
> I am using the 4.4.0-RC0 version.
>

Hi David,

your question is a really interesting one, and requires some
degree of insight about the sqlite's own query planner.

I've tested a table of about 1.2 million polygons (buildings)
using an arbitrary administrative boundary (a single Municipality)
as a spatial filter, and here are my practical findings:

1. directly executing any spatial function like ST_Split() within
the context of a subquery definitely isn't a good idea.
SQLite will then call ST_Split() for every row found into the main
table, and repeatedly executing a computationally heavy function
many and many times will certainly cause a noticeable slowness.
as you discovered by yourself storing somewhere (may be into a
temporary table) the pre-computed result of ST_Split() seems
to be a practical necessity imposed by sqlite itself.

2. there is a second bad new: pretending to filter the same table
by the spatial index _and_ by some other column value at the
same time easily can be another performance killer.
recall: in sqlite/spatialite an R*Tree spatial index isn't
at all an _index_, its imply is another table as any other.
under such circumstances the sqlite's optimizer will
always decide to build an autoindex on the most promising
column, and will then use that column index in order to
scan the table; the spatial index sub-query will be then
evaluated only in a second time, but this will negate
any benefit from using the spatial index itself.
(it really depends on the actual distribution of key
values, but in the average case you'll almost certainly
get very bad performances).

SELECT e.* FROM edifici AS e,
(SELECT geometry AS shape FROM com) AS dw
WHERE d_ty_edi = 'campanile' AND
ST_Intersects(dw.shape, e.geom) = 1
AND e.rowid in (
SELECT rowid
FROM spatialindex
WHERE f_table_name='edifici' AND
search_frame = dw.shape);
... about 2 seconds

SELECT e.* FROM edifici AS e,
(SELECT geometry AS shape FROM com) AS dw
WHERE d_ty_edi = 'chiesa/basilica' AND
ST_Intersects(dw.shape, e.geom) = 1
AND e.rowid in (
SELECT rowid
FROM spatialindex
WHERE f_table_name='edifici' AND
search_frame = dw.shape);
... about 12 seconds

SELECT e.* FROM edifici AS e,
(SELECT geometry AS shape FROM com) AS dw
WHERE d_ty_edi = 'edificio industriale' AND
ST_Intersects(dw.shape, e.geom) = 1
AND e.rowid in (
SELECT rowid
FROM spatialindex
WHERE f_table_name='edifici' AND
search_frame = dw.shape);
... about 1 minute

it's interesting to note that in my test dataset there
are 1.500 'campanile' (bell tower), 7.500 'chiesa' (church)
and 65.000 'edificio industriale' (industrial building);
the query time grows as the number of the candidates to
be evaluated by ST_Intersects() grows; the spatial index
is practically ignored by the query planner and plays no
role at all.

in this case we can luckily apply an useful SQL trick:
we'll simply add an effectless GROUP BY clause on the
Primary Key, then filtering any other column value in the
HAVING clause, and not in the WHERE clause as before
(so they'll be evaluated only _after_ creating the
resultset, and will have no negative impact on the query
planner decisions):

SELECT e.* FROM edifici AS e,
(SELECT geometry AS shape FROM com) AS dw
WHERE ST_Intersects(dw.shape, e.geom) = 1
AND e.rowid in (
SELECT rowid
FROM spatialindex
WHERE f_table_name='edifici' AND
search_frame = dw.shape)
GROUP BY e.pk_uid
HAVING d_ty_edi = 'campanile';
... about 6 seconds

SELECT e.* FROM edifici AS e,
(SELECT geometry AS shape FROM com) AS dw
WHERE ST_Intersects(dw.shape, e.geom) = 1
AND e.rowid in (
SELECT rowid
FROM spatialindex
WHERE f_table_name='edifici' AND
search_frame = dw.shape)
GROUP BY e.pk_uid
HAVING d_ty_edi = 'chiesa/basilica';
... about 6 seconds

SELECT e.* FROM edifici AS e,
(SELECT geometry AS shape FROM com) AS dw
WHERE ST_Intersects(dw.shape, e.geom) = 1
AND e.rowid in (
SELECT rowid
FROM spatialindex
WHERE f_table_name='edifici' AND
search_frame = dw.shape)
GROUP BY e.pk_uid
HAVING d_ty_edi = 'edificio industriale';
... about 6 seconds

now we effectively have nicely _flat_ timings;
we are really exploiting the someway cumbersome
(but very fast) R*Tree spatial index at the best
of its possibilities.

SQLite usually confirms to be a fast performer;
anyway writing efficient SQL queries on SQLite is
not always so easy and simple as you can expect on
other "heavy-weight" DBMSes.
It's more an art than a science; and it usually is
a very crude "bare metal" activity.

bye Sandro

David Anderson

unread,
Feb 3, 2016, 4:12:16 PM2/3/16
to SpatiaLite Users
Sandro,
Thanks for the thorough and insightful response.
For your point 1, your response was what I had suspected was going on.  I was hoping that someone more knowledgeable than me about SQLite knew a method that did not involve a temporary table.  Doing the logic in one SQL statement is to me more elegant than a two step solution with a temporary table and SQL statement.  Sometimes speed wins over elegance.

Your solution to point 2 is both elegant and quick.  I haven't run across that use of the group by and having clause.  A useful trick to be sure.

I spent some time tinkering with the query because I couldn't figure out why it was running so slow.  I thought a few seconds to do the split function, then a few seconds to do the search would be about right.  A cup of coffee later the original query is still running.

Anyhow, once again thanks for the help.

David
Reply all
Reply to author
Forward
0 new messages