On Tue, 2 Feb 2016 10:47:11 -0800 (PST), David Anderson wrote:
> I've noticed a significant difference in the time to do a spatial
> query using a geometry derived via spatial operations
>
> The query that I want to do is to split a polygon by a line, use the
> left hand portion of the split polygon as the area to select from a
> larger table that has about 1.7 million records.
>
> ---------------- <snip> ---------------
>
> My question is why using a geometry coming from a table runs far more
> efficiently than using a geometry derived from an operation and
> presumably stored in memory? I wonder how the query optimizer
> handles having the derived geometry in the st_intersects portion of
> the query? Does it have to be recreated numerous times?
>
> I am using the 4.4.0-RC0 version.
>
Hi David,
your question is a really interesting one, and requires some
degree of insight about the sqlite's own query planner.
I've tested a table of about 1.2 million polygons (buildings)
using an arbitrary administrative boundary (a single Municipality)
as a spatial filter, and here are my practical findings:
1. directly executing any spatial function like ST_Split() within
the context of a subquery definitely isn't a good idea.
SQLite will then call ST_Split() for every row found into the main
table, and repeatedly executing a computationally heavy function
many and many times will certainly cause a noticeable slowness.
as you discovered by yourself storing somewhere (may be into a
temporary table) the pre-computed result of ST_Split() seems
to be a practical necessity imposed by sqlite itself.
2. there is a second bad new: pretending to filter the same table
by the spatial index _and_ by some other column value at the
same time easily can be another performance killer.
recall: in sqlite/spatialite an R*Tree spatial index isn't
at all an _index_, its imply is another table as any other.
under such circumstances the sqlite's optimizer will
always decide to build an autoindex on the most promising
column, and will then use that column index in order to
scan the table; the spatial index sub-query will be then
evaluated only in a second time, but this will negate
any benefit from using the spatial index itself.
(it really depends on the actual distribution of key
values, but in the average case you'll almost certainly
get very bad performances).
SELECT e.* FROM edifici AS e,
(SELECT geometry AS shape FROM com) AS dw
WHERE d_ty_edi = 'campanile' AND
ST_Intersects(dw.shape, e.geom) = 1
AND e.rowid in (
SELECT rowid
FROM spatialindex
WHERE f_table_name='edifici' AND
search_frame = dw.shape);
... about 2 seconds
SELECT e.* FROM edifici AS e,
(SELECT geometry AS shape FROM com) AS dw
WHERE d_ty_edi = 'chiesa/basilica' AND
ST_Intersects(dw.shape, e.geom) = 1
AND e.rowid in (
SELECT rowid
FROM spatialindex
WHERE f_table_name='edifici' AND
search_frame = dw.shape);
... about 12 seconds
SELECT e.* FROM edifici AS e,
(SELECT geometry AS shape FROM com) AS dw
WHERE d_ty_edi = 'edificio industriale' AND
ST_Intersects(dw.shape, e.geom) = 1
AND e.rowid in (
SELECT rowid
FROM spatialindex
WHERE f_table_name='edifici' AND
search_frame = dw.shape);
... about 1 minute
it's interesting to note that in my test dataset there
are 1.500 'campanile' (bell tower), 7.500 'chiesa' (church)
and 65.000 'edificio industriale' (industrial building);
the query time grows as the number of the candidates to
be evaluated by ST_Intersects() grows; the spatial index
is practically ignored by the query planner and plays no
role at all.
in this case we can luckily apply an useful SQL trick:
we'll simply add an effectless GROUP BY clause on the
Primary Key, then filtering any other column value in the
HAVING clause, and not in the WHERE clause as before
(so they'll be evaluated only _after_ creating the
resultset, and will have no negative impact on the query
planner decisions):
SELECT e.* FROM edifici AS e,
(SELECT geometry AS shape FROM com) AS dw
WHERE ST_Intersects(dw.shape, e.geom) = 1
AND e.rowid in (
SELECT rowid
FROM spatialindex
WHERE f_table_name='edifici' AND
search_frame = dw.shape)
GROUP BY e.pk_uid
HAVING d_ty_edi = 'campanile';
... about 6 seconds
SELECT e.* FROM edifici AS e,
(SELECT geometry AS shape FROM com) AS dw
WHERE ST_Intersects(dw.shape, e.geom) = 1
AND e.rowid in (
SELECT rowid
FROM spatialindex
WHERE f_table_name='edifici' AND
search_frame = dw.shape)
GROUP BY e.pk_uid
HAVING d_ty_edi = 'chiesa/basilica';
... about 6 seconds
SELECT e.* FROM edifici AS e,
(SELECT geometry AS shape FROM com) AS dw
WHERE ST_Intersects(dw.shape, e.geom) = 1
AND e.rowid in (
SELECT rowid
FROM spatialindex
WHERE f_table_name='edifici' AND
search_frame = dw.shape)
GROUP BY e.pk_uid
HAVING d_ty_edi = 'edificio industriale';
... about 6 seconds
now we effectively have nicely _flat_ timings;
we are really exploiting the someway cumbersome
(but very fast) R*Tree spatial index at the best
of its possibilities.
SQLite usually confirms to be a fast performer;
anyway writing efficient SQL queries on SQLite is
not always so easy and simple as you can expect on
other "heavy-weight" DBMSes.
It's more an art than a science; and it usually is
a very crude "bare metal" activity.
bye Sandro