SELECT count(*) FROM edges_500k e, areawater a WHERE e.railflg = 'Y'
AND Intersects(e.Geometry, a.Geometry) AND e.ROWID IN (SELECT pkid
FROM idx_edges_500k_Geometry WHERE xmin < MbrMaxX(a.Geometry) AND xmax
> MbrMinX(a.Geometry) AND ymin < MbrMaxY(a.Geometry) AND ymax >
MbrMinY(a.Geometry));
The result is correct but the query is running for about 10 hours
while the same query on postgis takes 3 seconds.
Well, the datasets might be rather large for spatialite (edges_500k
has 500'000 linestrings and areawater has 380'000 polygons), but am I
doing something wrong or is there any possibility to improve this
query?
I've tried reproducing your query, actually using
a much more simpler dataset (8,000 points / 100
polygons).
I wasn't able to identify any problem: your SQL
seems to be absolutely valid (and it actually
uses the SpatialIndex).
Let me know any further advancement on this topic.
bye
Sandro
Given these facts I think there's no solution to enhance performance -
except when someone implements a "real" spatial index in C!?
If you or anyone want(s) to reproduce the query with the same data
look the "HSR Texas Geo Database Benchmark" we are developing:
http://gis.hsr.ch/wiki/HSR_Texas_Geo_Database_Benchmark
Regards, S.
**
** this reply is quite long: excuse me
** keep it as a 'real world' case study explained step by step ...
** until the final HAPPY END
**
I've just loaded a SpatiaLite DB using your Tiger-Texas
sample dataset.
Pentium4 2.8GHz, 1GB RAM, 80GB sata HDD, Ubuntu 8.04 32 bit
I constantly monitored the CPU usage [using top] during
the whole process.
A few preliminary impressions:
a) the dataset is really huge: about 3GB
sqlite/spatialite is a 'personal', not an 'enterprise'
DBMS: yes, it works anyway, but we are stressing it to
the extreme limit it can afford.
b) sqlite adopts no special memory caching at all: it
fully relies on standard system management for I/O
buffers handling. As a rule-of-the-thumb, we can
expect sqlite to perform optimally only when the
DB size is comparable to RAM size.
But in my case [3GB DB, only 1 GB RAM] I noticed
several sluggish stalls caused by memory swapping:
and this obviously severely impairs performance.
My 5 cents bet: using a >= 4GB RAM system we can
circumvent this bottleneck.
Performing the test:
==============
Query 3. Spatial Selection I:
Intersect Point
---------------
SELECT count(*)
FROM gnis_names09 AS p
WHERE ST_Intersects(BuildMbr(-103.208,27.435,-96.891,33.460),
p.geometry);
please note: there is no real need compelling us to use the R*Tree:
this
table contains only 10^5 rows (the infamous < 1million rows rule)
Accordingly to this, we can notice a strong 'hot cache' effect:
performing
twice this query will run quite instantaneously the second time
[showing no HDD traffic at all].
anyway:
SELECT count(*)
FROM gnis_names09 AS p
WHERE ST_Intersects(BuildMbr(-103.208,27.435,-96.891,33.460),
p.geometry)
AND p.ROWID IN (SELECT pkid FROM idx_gnis_names09_geometry
WHERE xmin <= -96.891 AND ymin <= 33.460
AND xmax >= -103.208 AND ymax >= 27.435);
Intersect Line
---------------
SELECT count(*)
FROM edges_merge AS l
WHERE ST_Intersects(BuildMbr(-102.169,29.279,-97.930,30.515),
l.geometry);
this one is an huge table (5.8 million rows): slow running query
[strong
memory swapping], and no 'hot cache' effect at all, because the
table's size
exceeds the available RAM.
SELECT count(*)
FROM edges_merge AS l
WHERE ST_Intersects(BuildMbr(-102.169,29.279,-97.930,30.515),
l.geometry)
AND l.ROWID IN (SELECT pkid FROM idx_edges_merge_geometry
WHERE xmin <= -97.930 AND ymin <= 30.515
AND xmax >= -102.169 AND ymax >= 29.279);
all right, the R*Tree works as expected [and I notice a really strong
'hot cache' effect]
Intersect Polygons
---------------
SELECT count(*)
FROM areawater_merge AS pg
WHERE ST_Intersects(BuildMbr(-102.789,28.775,-97.3102,33.174),
pg.geometry);
SELECT count(*)
FROM areawater_merge AS pg
WHERE ST_Intersects(BuildMbr(-102.789,28.775,-97.3102,33.174),
pg.geometry)
AND pg.ROWID IN (SELECT pkid FROM idx_areawater_merge_geometry
WHERE xmin <= -97.3102 AND ymin <= 33.174
AND xmax >= -102.789 AND ymax >= 28.775);
really, no big difference in using the R*Tree or not: this table
only contains < half-million rows
---
conclusion:
the R*Tree-based spatial index is actually useful in order to speed-up
spatially based queries: most notably on huge tables.
please note: during the above tests I noticed that 'top' always
reported
a 80%/100% spatialite activity: quite obviously, the Linux IO
handling
works at its best, and the disk traffic isn't a bottleneck.
*********************************************************************
let's now go deep into the *problem*
Query 5. Spatial Selection III: Intersect/Join Lines and Polygons
-----
SELECT count(*)
FROM edges_merge AS l, areawater_merge AS pg
WHERE l.railflg = 'Y'
AND ST_Intersects(l.Geometry, pg.Geometry)
AND l.ROWID IN (SELECT pkid
FROM idx_edges_merge_Geometry
WHERE xmin <= MbrMaxX(pg.Geometry) AND xmax >= MbrMinX(pg.Geometry)
AND ymin <= MbrMaxY(pg.Geometry) AND ymax >= MbrMinY(pg.Geometry));
YES: confirmed. this query runs in an awful sluggish mode.
it's not at all surprising you've measured a 10 hours time.
[I killed the process after just a few minutes]
But the cause accounting for this is really easy to be
identified: 'top' reports 0%/10% spatialite's CPU activity.
and I notice an endless disk traffic.
Obviously, we are experiencing a catastrophic bottleneck
at the physical I/O level.
After some further testing 'at random' I discovered something
really interesting: the huge disk traffic (poor CPU usage) is
caused by the WHERE l.railflg = 'Y' clause.
Omitting this, I was finally able to see spatialite using the
CPU at full power (100%). Simply dropping the BTree index
on edges_merge.railflg has an identical effect.
It looks like if SQLite get confused in accessing both the
BTree and the RTree simultaneously.
I strongly suspect this one is a memory-related issue: may
well be, if I had 4 GB RAM this will not happen at all.
Anyway, this doesn't seems to be resolutive at all: this query
is still performing very badly.
*********************************************************************
Serendipity :-)
(finding the solution just by accident)
your original query is malformed; you are trying to:
1) extract the 'few' areawater first
2) then querying the 'many' edges using the R*Tree
Simply reverting the order will resolve this issue :-)
SELECT count(*)
FROM edges_merge AS l, areawater_merge AS pg
WHERE l.railflg = 'Y'
AND ST_Intersects(l.Geometry, pg.Geometry)
AND pg.ROWID IN (SELECT pkid
FROM idx_areawater_merge_Geometry
WHERE xmin <= MbrMaxX(l.Geometry) AND xmax >= MbrMinX(l.Geometry)
AND ymin <= MbrMaxY(l.Geometry) AND ymax >= MbrMinY(l.Geometry));
please note; now we are trying to:
1) extract the 'many' edges' first
2) then querying the 'few' areawater using the R*Tree
may well be we aren't setting a new world speed record,
but now this query completes in a really reasonable time.
oddly enough, in this case dropping the edges_merge.railflg
BTree index seems to have no (or very small) effect.
lesson to learn: SQLite has lots and lots of idiosyncrasies
Today I've discovered several aspects I've never suspected before
Honestly, I never hoped SpatiaLite could be able to handle
a such heavy-weighted dataset: but after all, this elementary
simple 2MB basecode can get us really very, very far away :-) ;)
Bye
Sandro
From 10 hours to 9 seconds with one long post, awesome :-)
This performance is surprisingly impressive. Thanks a lot for the
insights.
Extending the tutorial with such an example would probably be helpful
for others too.
Regards
Dominik