A way to use ST_Intersects and RTreeIntersects with a lot of results

453 views
Skip to first unread message

da be

unread,
Sep 8, 2022, 8:07:44 PM9/8/22
to SpatiaLite Users
Hi,
I'm using spatialite to check intersection of two tables (using your guide's code). This works great for most of the use cases.
But when one layer contains a lot of features that intersects all of the features in the second layer the query can take a lot of time, and sometime the process must be killed.

For example, suppose I have one layer contain 10K buildings, and a second layer contains 10K of large rectangles, that covers the entire area where the buildings exists.

My question:
Is there more efficient way to make such queries?
If not, Is there a way to identify ahead of time and prevent or limit such catastrophic queries?

PS. I don't need row for each intersection, it's okay to only know the ids (distinct) of the features that have (any) intersection from each table.

Thank a lot

Brad Hards

unread,
Sep 8, 2022, 9:03:51 PM9/8/22
to spatiali...@googlegroups.com
On Friday, 9 September 2022 10:07:44 AM AEST da be wrote:

> Is there more efficient way to make such queries?

I would not expect the performance to be that bad.
It might help if you can show your query and explain more about your data. If
the rectangles are basically everything, merge them first.

However I'd also try it without the RTree. Does that help?


> If not, Is there a way to identify ahead of time and prevent or limit such
> catastrophic queries?

You'll always be able to construct a query that has bad performance. Avoiding
it is probably domain / use-case specific.

> PS. I don't need row for each intersection, it's okay to only know the ids
> (distinct) of the features that have (any) intersection from each table.

Is the RTree intersection close enough?

Brad


da be

unread,
Sep 8, 2022, 9:52:37 PM9/8/22
to SpatiaLite Users
Thanks for your reply.

> It might help if you can show your query and explain more about your data. If
the rectangles are basically everything, merge them first.

I'm using this in a service where I have no control on which kind of data users may upload, therefore I can't know if features can be merged.
I can limit users input feature count (10k per layer in my example above) and can limit maximum number of results (in my example above, 20k) but the time to get the maximum possible result is varying largely based on the shape of the input features.

CODE:
```
SELECT DISTINCT t1.src_id from 
     table1 AS t1, 
     table2 as t2
WHERE ST_Intersects(t1.geom, t2.geom)
AND t2.id IN (
  SELECT pkid FROM idx_table2_geom
     WHERE pkid MATCH RTreeIntersects(
        MbrMinX(t1.geom),
        MbrMinY(t1.geom),
       MbrMaxX(t1.geom),
       MbrMaxY(t1.geom)))
UNION
<the same but with DISTINCT t2.src_id>
```

> However I'd also try it without the RTree. Does that help?
This does help with this extreme example but badly hurts performance with most of the use cases.

> Is the RTree intersection close enough?
Unfortunately no, but would it help if it was? 

a.fu...@lqt.it

unread,
Sep 9, 2022, 1:40:01 AM9/9/22
to spatiali...@googlegroups.com
On Thu, 8 Sep 2022 18:52:37 -0700 (PDT), da be wrote:
>
> SELECT DISTINCT t1.src_id from
> table1 AS t1,
> table2 as t2
>
> WHERE ST_Intersects(t1.geom, t2.geom)
>
> AND t2.id IN (
> SELECT pkid FROM idx_table2_geom
> WHERE pkid MATCH RTreeIntersects(
> MbrMinX(t1.geom),
> MbrMinY(t1.geom),
> MbrMaxX(t1.geom),
> MbrMaxY(t1.geom)))
> UNION
>
> ```

just few quick comments:

1. using DISTICT, GROUP BY or ORDER BY in a
Spatial Query using the RTree Spatial
Index could easily cause slowness.
This could mislead the SQLite Query Optimizer
by inducing it to use other indexes with higher
priority than the Spatial Index which thus ends
up being completely ignored.

2. the same is for UNION

short conclusion: a query strongly based on
the RTree Spatial Index should always be the
most simple as possibile to be really efficient.

a basically simple and really usefull trick:

a) always avoid to write very complex queries
based on the RTree Spatial Index

b) write instead more simpler queries each one
of them storing its partial results into a
temporary table

c) only in a very final step collect alltogether
the above temporary tables so to refine the
preliminary results in a definitive form.
put any aventual DISTINCT / GROUP BY / ORDER
BY or UNION in this final pass.

you'll easily find that running many simpler queries
instead of one complex query very often takes a
significantly lesser cumulative execution time.

===================================================

last remark:

> WHERE ST_Intersects(t1.geom, t2.geom)
>
> AND t2.id IN (
> SELECT pkid FROM idx_table2_geom
> WHERE pkid MATCH RTreeIntersects(
> MbrMinX(t1.geom),
> MbrMinY(t1.geom),
> MbrMaxX(t1.geom),
> MbrMaxY(t1.geom))
>

note that the above is a very old and obsolete
syntax for querying the Spatial Index.
since many years the modern syntax is:

WHERE ST_Intesects(t1.geom, t2.geom)
AND t2.id IN (
SELECT rowid FROM SpatialIndex
WHERE f_table_name = 'table2'
AND serch_frame = t1.geom)

bye Sandro



mj10777

unread,
Sep 9, 2022, 2:10:55 AM9/9/22
to SpatiaLite Users
Whereby the (more expensive/time consuming) ST_Intesects should come after the SpatialIndex portion, so that ST_Intesects will only be executed against likely candidates (the SpatialIndex having filtered out those that are not) and speed up the result (having less to do).

WHERE 
 t2.id IN 
 (
  SELECT rowid FROM SpatialIndex
  WHERE 
  (
   (f_table_name = 'table2') AND
   (search_frame = t1.geom)
  ) 
 ) AND
 ST_Intesects(t1.geom, t2.geom)

Mark 


bye Sandro



da be

unread,
Sep 9, 2022, 5:13:59 AM9/9/22
to SpatiaLite Users
Thanks you very much sandro and Mark.
I will use your comments to rewrite my code but first I want to understand it completely.

> 1. using DISTICT, GROUP BY or ORDER BY in a
Spatial Query using the RTree Spatial
Index could easily cause slowness.
This could mislead the SQLite Query Optimizer
by inducing it to use other indexes with higher
priority than the Spatial Index which thus ends
up being completely ignored.

Probably I'm missing something but if I don't use the DISTINCT in the inner  intersection query I could end up with 100M (10k X 10k) rows with the extreme use case mentioned in the example above. Is there a way to achieve the effect of the DISTINCT in the intersection query without using it? 

> b) write instead more simpler queries each one
of them storing its partial results into a
temporary table

My code actually use CTE for the inner intersection query (I simplified the code above). Is this OK or should I create actual table?

Thank you very much!

mj10777

unread,
Sep 9, 2022, 5:43:05 AM9/9/22
to SpatiaLite Users
On Friday, 9 September 2022 at 11:13:59 UTC+2 ddd01...@gmail.com wrote:
Thanks you very much sandro and Mark.
I will use your comments to rewrite my code but first I want to understand it completely.

> 1. using DISTICT, GROUP BY or ORDER BY in a
Spatial Query using the RTree Spatial
Index could easily cause slowness.
This could mislead the SQLite Query Optimizer
by inducing it to use other indexes with higher
priority than the Spatial Index which thus ends
up being completely ignored.

Probably I'm missing something but if I don't use the DISTINCT in the inner  intersection query I could end up with 100M (10k X 10k) rows with the extreme use case mentioned in the example above. Is there a way to achieve the effect of the DISTINCT in the intersection query without using it? 

> b) write instead more simpler queries each one
of them storing its partial results into a
temporary table

My code actually use CTE for the inner intersection query (I simplified the code above). Is this OK or should I create actual table?
Try this (not tested)

CREATE TEMPORARY TABLE temp_t1_t2_scr_id AS 
SELECT 
 t1.src_id,
 t2.src_id 
FROM
 table1 AS t1, 
 table2 AS t2
WHERE 
 t2.id IN 
 (
  SELECT rowid FROM SpatialIndex
  WHERE 
  (
   (f_table_name = 'table2') AND
   -- return only results within BoundingBox of search_frame
   (search_frame = t1.geom)
  ) 
 ) AND
 ( -- execute only when within BoundingBox
  ST_Intesects(t1.geom, t2.geom)
 )

You could then query temp_t1_t2_scr_id, retrievingthe DISTINCT values without further spatial (expensive) activity.

The temp table should exist untill the connection is closed.

da be

unread,
Sep 9, 2022, 8:30:50 AM9/9/22
to SpatiaLite Users
Mark thanks. 
This seems to work but unless I did something completely wrong, there is still  the problem of 100M rows in the temporary table (in which I only need maximum of 20k distinct rows). This prevents me of doing things in in-memory db, and it causes the queries to run in an unpredictable (sometimes very long) time. It happens only in extremely rare queries (such as this I mentioned above) but unfortunately such queries can crash my server.

So I would ask again (and I'm sorry if this is a stupid question):
Is there is a way to efficiently retrieve and store only distinct (by id) intersected features?
Or if not, how do you recommend to identify such queries before running them?

mj10777

unread,
Sep 9, 2022, 9:14:06 AM9/9/22
to SpatiaLite Users
On Friday, 9 September 2022 at 14:30:50 UTC+2 ddd01...@gmail.com wrote:
Mark thanks. 
This seems to work but unless I did something completely wrong, there is still  the problem of 100M rows in the temporary table (in which I only need maximum of 20k distinct rows). This prevents me of doing things in in-memory db, and it causes the queries to run in an unpredictable (sometimes very long) time. It happens only in extremely rare queries (such as this I mentioned above) but unfortunately such queries can crash my server.
Unless you buy some very specific equipment from Diagon Alley, you will have to accept the fact that in between a large subset (with large memory/hard disk consumption) will exist.

As for the identification of such queries before running them, again only specialists from Diagon Alley can help you. But they can only see into the future what the results (that must nevertheless run at some point) will bring.

We can only assist you in making these queries as efficient as possible when running against your existing sources.

Here, I'm afraid, it is a matter on how your existing sources are organized which doesn't to be optional for your needs.
 

So I would ask again (and I'm sorry if this is a stupid question):
Is there is a way to efficiently retrieve and store only distinct (by id) intersected features?
You may have to create seperate tabels with these results, that are re-created in a needed timeframe (hourly, daily, weekly etc) as needed..
Or if not, how do you recommend to identify such queries before running them?
See above.. 

Pieter Roggemans

unread,
Sep 9, 2022, 12:46:08 PM9/9/22
to SpatiaLite Users
I don't know if it can help you in this specific case, and I didn't test it, but when using subselects sqlite tends to rewrite the queries by "flattening" them... possibly with the union's you get a similar behaviour

You can "nudge" the query optimizer not to flatten subqueries by using LIMIT -1 OFFSET 0. As an example, a query where this construct is used can be found here:

Op vrijdag 9 september 2022 om 15:14:06 UTC+2 schreef mj10777:

a.fu...@lqt.it

unread,
Sep 9, 2022, 2:35:40 PM9/9/22
to spatiali...@googlegroups.com
On Fri, 9 Sep 2022 09:46:08 -0700 (PDT), Pieter Roggemans wrote:
> I don't know if it can help you in this specific case, and I didn't
> test it, but when using subselects sqlite tends to rewrite the
> queries
> by "flattening" them... possibly with the union's you get a similar
> behaviour
>
> You can "nudge" the query optimizer not to flatten subqueries by
> using
> LIMIT -1 OFFSET 0. As an example, a query where this construct is
> used
> can be found here:
>
> https://github.com/geofileops/geofileops/blob/master/geofileops/util/_geoops_sql.py#L1080
>

Pieter,

this is very interesting to know.

going in further depth: SQLite isn't really a DBMS (and this
is pretty obvious, just consider how microscopic is its
binary code).
in effects it's a virtual machine interpreting a special
set of op-codes specifically intended to support very
low-level data access primitives.

The Query Optimizer module is kind of a compiler:
it parses SQL statements checking for correctness,
and then translating it in a small program based
on op-codes that finally will be executed.

this is an astonishing architecture usually ensuring
very fast performances.
but if the Query Planner becomes confused you'll surely
get a deadly slow query; and one of the surest methods
for confusing it is writing too much complex
SQL expressions.
and this explains why breaking some huge query into
many smaller and simples queries usually ensures
an impressing performance boost.

-------------------------------

that's not all: one the first actions of the Query
Planer is to _REWRITE_ your SQL, so to flatten and
simplify the syntax as much as possible.
short conclusione: what you've carefully coded
and whar is actually executed not necessarily
is the same.

final critical details.
first of all the Query Optimizer can decide to
automatically create some temporaty index whenever
it thinks to be usefull; but this could be a very
dangerous option because can completely distort
your real intentions.

second critical point: the RTree Spatial Index
is often paramount for fast Spatial Queries.
but for the Query Planner an RTree isn't at all
an _INDEX_, it's just another _TABLE_ as any other.
and this explains why too much complex SQL statements
could easily ignore the Spatial Index, thus leading
to deadly slowliness.

---------------------------

your best friends.

regularly use EXPLAIN and EXPLAIN QUERY PLAN
in order to check how your SQL will be actually
translated into a sequece of op-codes before
going under execution.

it could be boring and difficult, but the
secret of your success with SQLite is here.

never forget: SQLite isn't a DBMS as any other,
it's something of radically different.
writing a SQL query in a form or another, although
syntactically equivalent, could easily have a
dramatic impact on performances.

bye Sandro

Pieter Roggemans

unread,
Sep 9, 2022, 5:30:59 PM9/9/22
to SpatiaLite Users
Hey Sandro,

I agree with everything you wrote except one detail :-): I think you overestimate "real" RDMS's with the following sentence: "never forget: SQLite isn't a DBMS as any other, it's something of radically different. Writing a SQL query in a form or another, although syntactically equivalent, could easily have a dramatic impact on performances."

Obviously on a lot of fronts Sqlite isn't a full blown RDMS at all and is very different, but specifically the concepts you explained very well in your message (as far as I know) apply just as well to Oracle.
Some time in the past I've done quite some development on Oracle (spatial) and obviously the oracle optimizer is (a lot) more advanced and has a lot more features (both to optimize + to be influenced by the developer by using hit texts). But, also on oracle queries that return the same data but are witten differently will/can get a very different explain plan (eg. using "in" vs. "exists" vs. "join" vs. "subselect" vs. "with",...) I've encountered many queries on oracle where some rewriting or in some cases using hint texts improved performance 100x or more... 
Also on oracle explain plan is indeed a really useful tool to optimize slow queries...

Anyway: I would only add 2 things:
1) The following page is an interesting read. It is also the page where I found the "hint" that you can use LIMIT + OFFSET to avoid subquery flattening/unnesting: https://www.sqlite.org/optoverview.html 
2) Premature optimization is the root of all evil: it is interesting to read the page I linked to, but for >90% of queries you won't need it and should't think too hard about it. 

Regards,
Pieter

Op vrijdag 9 september 2022 om 20:35:40 UTC+2 schreef sandro furieri:

Pieter Roggemans

unread,
Sep 9, 2022, 5:46:10 PM9/9/22
to SpatiaLite Users
PS: thank you for developing spatialite: I'm a big fan!

Op vrijdag 9 september 2022 om 23:30:59 UTC+2 schreef Pieter Roggemans:
Reply all
Reply to author
Forward
0 new messages