Really slow query

760 views
Skip to first unread message

wilddom

unread,
Feb 17, 2010, 3:05:16 AM2/17/10
to SpatiaLite Users
I do have the following query which results in the number of bridges
(rail tracks crossing water).

SELECT count(*) FROM edges_500k e, areawater a WHERE e.railflg = 'Y'
AND Intersects(e.Geometry, a.Geometry) AND e.ROWID IN (SELECT pkid
FROM idx_edges_500k_Geometry WHERE xmin < MbrMaxX(a.Geometry) AND xmax
> MbrMinX(a.Geometry) AND ymin < MbrMaxY(a.Geometry) AND ymax >
MbrMinY(a.Geometry));

The result is correct but the query is running for about 10 hours
while the same query on postgis takes 3 seconds.
Well, the datasets might be rather large for spatialite (edges_500k
has 500'000 linestrings and areawater has 380'000 polygons), but am I
doing something wrong or is there any possibility to improve this
query?

a.furieri

unread,
Feb 18, 2010, 12:31:10 PM2/18/10
to SpatiaLite Users
Hi,

I've tried reproducing your query, actually using
a much more simpler dataset (8,000 points / 100
polygons).
I wasn't able to identify any problem: your SQL
seems to be absolutely valid (and it actually
uses the SpatialIndex).

Let me know any further advancement on this topic.

bye
Sandro

Stefan

unread,
Feb 23, 2010, 5:09:32 AM2/23/10
to SpatiaLite Users
Thank you for the tests.
Let me summarize what you answered elsewhere:
- The main goal of SpatiaLite is a simple, lightweight engine to be
used as an efficient storage medium for desktop (GUI) GIS apps.
- SpatiaLite - and SQLite as well - are very fast when a single table
is queried. When two or more tables are joined together this isn't any
longer assured.
- In SQLite an R*Tree Spatial Index simply is an ordinary table. The
query optimizer has no idea about it. Writing queries based on the
R*Tree is completely under the user responsibility.

Given these facts I think there's no solution to enhance performance -
except when someone implements a "real" spatial index in C!?

If you or anyone want(s) to reproduce the query with the same data
look the "HSR Texas Geo Database Benchmark" we are developing:
http://gis.hsr.ch/wiki/HSR_Texas_Geo_Database_Benchmark

Regards, S.

a.furieri

unread,
Feb 23, 2010, 2:19:30 PM2/23/10
to SpatiaLite Users
Hi Stefan,

**
** this reply is quite long: excuse me
** keep it as a 'real world' case study explained step by step ...
** until the final HAPPY END
**

I've just loaded a SpatiaLite DB using your Tiger-Texas
sample dataset.
Pentium4 2.8GHz, 1GB RAM, 80GB sata HDD, Ubuntu 8.04 32 bit
I constantly monitored the CPU usage [using top] during
the whole process.

A few preliminary impressions:

a) the dataset is really huge: about 3GB
sqlite/spatialite is a 'personal', not an 'enterprise'
DBMS: yes, it works anyway, but we are stressing it to
the extreme limit it can afford.
b) sqlite adopts no special memory caching at all: it
fully relies on standard system management for I/O
buffers handling. As a rule-of-the-thumb, we can
expect sqlite to perform optimally only when the
DB size is comparable to RAM size.
But in my case [3GB DB, only 1 GB RAM] I noticed
several sluggish stalls caused by memory swapping:
and this obviously severely impairs performance.
My 5 cents bet: using a >= 4GB RAM system we can
circumvent this bottleneck.

Performing the test:
==============
Query 3. Spatial Selection I:
Intersect Point
---------------
SELECT count(*)
FROM gnis_names09 AS p
WHERE ST_Intersects(BuildMbr(-103.208,27.435,-96.891,33.460),
p.geometry);

please note: there is no real need compelling us to use the R*Tree:
this
table contains only 10^5 rows (the infamous < 1million rows rule)
Accordingly to this, we can notice a strong 'hot cache' effect:
performing
twice this query will run quite instantaneously the second time
[showing no HDD traffic at all].

anyway:

SELECT count(*)
FROM gnis_names09 AS p
WHERE ST_Intersects(BuildMbr(-103.208,27.435,-96.891,33.460),
p.geometry)
AND p.ROWID IN (SELECT pkid FROM idx_gnis_names09_geometry
WHERE xmin <= -96.891 AND ymin <= 33.460
AND xmax >= -103.208 AND ymax >= 27.435);

Intersect Line
---------------
SELECT count(*)
FROM edges_merge AS l
WHERE ST_Intersects(BuildMbr(-102.169,29.279,-97.930,30.515),
l.geometry);

this one is an huge table (5.8 million rows): slow running query
[strong
memory swapping], and no 'hot cache' effect at all, because the
table's size
exceeds the available RAM.

SELECT count(*)
FROM edges_merge AS l
WHERE ST_Intersects(BuildMbr(-102.169,29.279,-97.930,30.515),
l.geometry)
AND l.ROWID IN (SELECT pkid FROM idx_edges_merge_geometry
WHERE xmin <= -97.930 AND ymin <= 30.515
AND xmax >= -102.169 AND ymax >= 29.279);

all right, the R*Tree works as expected [and I notice a really strong
'hot cache' effect]

Intersect Polygons
---------------
SELECT count(*)
FROM areawater_merge AS pg
WHERE ST_Intersects(BuildMbr(-102.789,28.775,-97.3102,33.174),
pg.geometry);

SELECT count(*)
FROM areawater_merge AS pg
WHERE ST_Intersects(BuildMbr(-102.789,28.775,-97.3102,33.174),
pg.geometry)
AND pg.ROWID IN (SELECT pkid FROM idx_areawater_merge_geometry
WHERE xmin <= -97.3102 AND ymin <= 33.174
AND xmax >= -102.789 AND ymax >= 28.775);

really, no big difference in using the R*Tree or not: this table
only contains < half-million rows

---
conclusion:
the R*Tree-based spatial index is actually useful in order to speed-up
spatially based queries: most notably on huge tables.

please note: during the above tests I noticed that 'top' always
reported
a 80%/100% spatialite activity: quite obviously, the Linux IO
handling
works at its best, and the disk traffic isn't a bottleneck.

*********************************************************************

let's now go deep into the *problem*
Query 5. Spatial Selection III: Intersect/Join Lines and Polygons
-----
SELECT count(*)
FROM edges_merge AS l, areawater_merge AS pg
WHERE l.railflg = 'Y'
AND ST_Intersects(l.Geometry, pg.Geometry)
AND l.ROWID IN (SELECT pkid
FROM idx_edges_merge_Geometry
WHERE xmin <= MbrMaxX(pg.Geometry) AND xmax >= MbrMinX(pg.Geometry)
AND ymin <= MbrMaxY(pg.Geometry) AND ymax >= MbrMinY(pg.Geometry));

YES: confirmed. this query runs in an awful sluggish mode.
it's not at all surprising you've measured a 10 hours time.
[I killed the process after just a few minutes]

But the cause accounting for this is really easy to be
identified: 'top' reports 0%/10% spatialite's CPU activity.
and I notice an endless disk traffic.
Obviously, we are experiencing a catastrophic bottleneck
at the physical I/O level.

After some further testing 'at random' I discovered something
really interesting: the huge disk traffic (poor CPU usage) is
caused by the WHERE l.railflg = 'Y' clause.
Omitting this, I was finally able to see spatialite using the
CPU at full power (100%). Simply dropping the BTree index
on edges_merge.railflg has an identical effect.
It looks like if SQLite get confused in accessing both the
BTree and the RTree simultaneously.
I strongly suspect this one is a memory-related issue: may
well be, if I had 4 GB RAM this will not happen at all.

Anyway, this doesn't seems to be resolutive at all: this query
is still performing very badly.

*********************************************************************

Serendipity :-)
(finding the solution just by accident)

your original query is malformed; you are trying to:
1) extract the 'few' areawater first
2) then querying the 'many' edges using the R*Tree

Simply reverting the order will resolve this issue :-)

SELECT count(*)
FROM edges_merge AS l, areawater_merge AS pg
WHERE l.railflg = 'Y'
AND ST_Intersects(l.Geometry, pg.Geometry)
AND pg.ROWID IN (SELECT pkid
FROM idx_areawater_merge_Geometry
WHERE xmin <= MbrMaxX(l.Geometry) AND xmax >= MbrMinX(l.Geometry)
AND ymin <= MbrMaxY(l.Geometry) AND ymax >= MbrMinY(l.Geometry));

please note; now we are trying to:
1) extract the 'many' edges' first
2) then querying the 'few' areawater using the R*Tree

may well be we aren't setting a new world speed record,
but now this query completes in a really reasonable time.

oddly enough, in this case dropping the edges_merge.railflg
BTree index seems to have no (or very small) effect.

lesson to learn: SQLite has lots and lots of idiosyncrasies
Today I've discovered several aspects I've never suspected before

Honestly, I never hoped SpatiaLite could be able to handle
a such heavy-weighted dataset: but after all, this elementary
simple 2MB basecode can get us really very, very far away :-) ;)

Bye
Sandro

wilddom

unread,
Feb 24, 2010, 7:05:53 AM2/24/10
to SpatiaLite Users
Hi Sandro

From 10 hours to 9 seconds with one long post, awesome :-)
This performance is surprisingly impressive. Thanks a lot for the
insights.
Extending the tutorial with such an example would probably be helpful
for others too.

Regards
Dominik

markb

unread,
Apr 21, 2010, 8:17:14 PM4/21/10
to SpatiaLite Users
I can confirm that b-tree usage does commonly interfere with r-tree
index speed gains. I'd say that the r-tree isn't used at times, but
EXPLAIN QUERY PLAN says otherwise. I've re-written several patience-
challenging queries over a dozen different ways at times, to see
wildly different results. I have several Group By queries that never
seem to optimize correctly (particularly if expecting the significant
performance gain that Sandro extracted from this much more massive
dataset). Though, I'm aware that some not unreasonably complex SQL
queries seem to find the chink in the SQLite armor. I don't know how
prevalent r-tree usage is in the SQLite community, but us spatial
people have got to be nearing the top of the list (I'm not aware of
the existence of other SQLite projects similar to SpatiaLite
though).

I wonder if we all could figure out a way to do some benchmarking /
bug checking in order to get to the bottom of some of these
irregularities, and if necessary submit a few well defined fixes to
the SQLite devs? I'd be more than happy create to create some sample
data and queries to facilitate a little r-tree/b-tree problem-solving,
if anyone is interested that is?

People were tweeting about this tread, and I think we should keep it
alive. Thanks -Mark
--
You received this message because you are subscribed to the Google Groups "SpatiaLite Users" group.
To post to this group, send email to spatiali...@googlegroups.com.
To unsubscribe from this group, send email to spatialite-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/spatialite-users?hl=en.

ctl101

unread,
Apr 14, 2017, 9:19:16 AM4/14/17
to SpatiaLite Users
A very interesting thread. I am using the updated virtualspatialindex method apparent in spatialite versions of more recent years. Do you know if your outcomes still apply, and with a modern pc spec? Might you have any up to date advice regarding bringing down the time it takes to carry out the intersection with large datasets? Regards, Chris
Reply all
Reply to author
Forward
0 new messages