SQLITE_INNOCUOUS vs SQLITE_VTAB_INNOCUOUS

84 צפיות
מעבר להודעה הראשונה שלא נקראה

Even Rouault

לא נקראה,
12 באוק׳ 2023, 12:29:0812.10.2023
עד SpatiaLite Users

Hi Sandro,

I see in https://www.gaia-gis.it/fossil/libspatialite/info/023f3323443fffa1 that you dealt with the issues related to trusted_schema = 0. But I see you used sqlite3_vtab_config(db, SQLITE_INNOCUOUS) in the spatialite virtual tables xConnect method, whereas my understanding of https://www.sqlite.org/c3ref/c_vtab_constraint_support.html is that it should be the SQLITE_VTAB_INNOCUOUS constant (with a VTAB_).

Another thing I noticed when fixing the same set of issues in the OGR Geopackage driver is that we are a bit stuck with the RTree extension itself which is *not* tagged as SQLITE_VTAB_INNOCUOUS, and thus the usual triggers to update the RTree when inserting/updating/deleting row in a feature table don't work on a trusted_schema = 0 database. I've reported that issue in https://sqlite.org/forum/forumpost/f7b7dd72f6 but that didn't get answers from a SQLite developer

Even

-- 
http://www.spatialys.com
My software is free, but my time generally not.

a.fu...@lqt.it

לא נקראה,
12 באוק׳ 2023, 13:02:4212.10.2023
עד spatiali...@googlegroups.com
On Thu, 12 Oct 2023 18:29:05 +0200, Even Rouault wrote:
> Hi Sandro,
>
> I see in
> https://www.gaia-gis.it/fossil/libspatialite/info/023f3323443fffa1
> [1]
> that you dealt with the issues related to trusted_schema = 0. But I
> see you used sqlite3_vtab_config(db, SQLITE_INNOCUOUS) in the
> spatialite virtual tables xConnect method, whereas my understanding
> of
> https://www.sqlite.org/c3ref/c_vtab_constraint_support.html [2] is
> that it should be the SQLITE_VTAB_INNOCUOUS constant (with a VTAB_).
>

Hi Even,

this was in my very first commit; in a second time I discovered from
it doesn't worked for Virtual Tables so I fixed the issue by declaring
SQLITE_VTAB_INNOCUOUS in the next COMMIT.
see:

https://www.gaia-gis.it/fossil/libspatialite/info/fed6fee65742fe84


> Another thing I noticed when fixing the same set of issues in the OGR
> Geopackage driver is that we are a bit stuck with the RTree extension
> itself which is *not* tagged as SQLITE_VTAB_INNOCUOUS, and thus the
> usual triggers to update the RTree when inserting/updating/deleting
> row in a feature table don't work on a trusted_schema = 0 database.
> I've reported that issue in
> https://sqlite.org/forum/forumpost/f7b7dd72f6 [3] but that didn't get
> answers from a SQLite developer
>

yes, I confirm the problem on R*Tree, which effectively makes
SpatiaLite completely unusable when trusted_schema=0

in the meantime I did some tests.
simply compiling the SQLite sources adding a SQLITE_VTAB_INNOCUOUS
directive on the R*Tree module solves everything, but obviously
it's something that SQLite developers should do upstream.


> I've reported that issue in
> https://sqlite.org/forum/forumpost/f7b7dd72f6 [3] but that didn't get
> answers from a SQLite developer
>

I'll try to support your request; let's see if the two of us can be
taken into more consideration.

bye Sandro

Even Rouault

לא נקראה,
12 באוק׳ 2023, 15:00:5512.10.2023
עד spatiali...@googlegroups.com,a.fu...@lqt.it
Sandro,

Your lobbying helped a lot! The SQLite3 devs have just checked in a
change to declare the RTree virtual table as SQLITE_VTAB_INNOCUOUS:
https://sqlite.org/src/info/f34c533b . I guess this will be for sqlite 3.44

Even

Pieter Roggemans

לא נקראה,
18 באוק׳ 2023, 3:51:4218.10.2023
עד SpatiaLite Users
Hmm... Sandro, if you would be in a lobbying mood still...

I've started a sqlite forum post a while ago asking if it would be possible to improve performance of the creation of the rtree index, because I noticed it takes a large part of the time needed to create a Geopackage (or Spatialite) file. There have already been implemented some performance improvements in general based on the post that should land in the next version of sqlite. However, the biggest gains are to be made by using a specific rtree bulk-filling algorithm when the rtree is created/filled initially.

I've implemented a proof-of-concept that fills up similar (but not the binary ones) tables to the ones used by the sqlite rtree index using just sql statements, and this seems to be already 5 to 10 times faster, depending on what you take as baseline measurement.

If you feel like it, check it out ;-): https://sqlite.org/forum/forumpost/7605e7e8fe


Op donderdag 12 oktober 2023 om 21:00:55 UTC+2 schreef Even Rouault:

a.fu...@lqt.it

לא נקראה,
19 באוק׳ 2023, 5:16:0919.10.2023
עד spatiali...@googlegroups.com
On Wed, 18 Oct 2023 00:51:41 -0700 (PDT), Pieter Roggemans wrote:
> I've started a sqlite forum post a while ago asking if it would be
> possible to improve performance of the creation of the rtree index,
> because I noticed it takes a large part of the time needed to create
> a
> Geopackage (or Spatialite) file. There have already been implemented
> some performance improvements in general based on the post that
> should
> land in the next version of sqlite. However, the biggest gains are to
> be made by using a specific rtree bulk-filling algorithm when the
> rtree is created/filled initially.
>

Hi Pieter,

I congratulate you on being able to convince Richard Hipp of SQLite
to introduce significant optimizations to the R*Tree module.
Well done :-D

Coming to the question of bulk loading of R*Tree I did some tests:
here are my results and my conclusions.



dataset and methodology:
---------------------------------

dataset #1: all the House Numbers of Tuscany; 1,522,752 Points

dataset #2: all the Buildings of Tuscany: 2,053,985 Polygons

Both are real-world datasets and are therefore assumed not
to be affected by significant statistical anomalies.

First I derived for each dataset a DB-table containing only
the value of the Primary Key and the coordinates of the MBR
of each geometry (minx, miny, maxx, maxx).

I then applied different row ordering criteria to these tables
so to measure the impact on the bulk loading of the R*Tree.

Note: I measured the times in seconds; the process is slow
enough that greater precision is not required.

I applied three different row sorting criteria when loading
the R*Tree:

A) natural order (the same one I found in the input shapefiles)

B) random order: ... ORDER BY Random()

C) ordered by Hilbert Code (as computed by the HilbertCode()
sql function of SpatiaLite)

Note: I created derived tables already sorted so to exclude
the cost of sorting from the time measurement, which therefore
depend exclusively on the loading of the R*Tree.



results (measured times)
---------------------------------

- Points, not ordered: 18 secs
- Points, random: 41 secs
- Points, Hilbert: 15 secs

- Polygs, not ordered: 19 secs
- Polygs, random: 32 secs
- Polygs, Hilbert: 20 secs


Conclusions: at least from my tests it seems confirmed that
a completely random insertion is decidedly slower (it takes
almost double the time).
However, it does not seem to me that data sorted according
to Hilbert offer significant performance gains.
Considering that the sorting phase according to Hilbert
required between 3 and 5 seconds, the total time of the
two phases does not seem to improve at all.

It should be considered that the initial datasets were
ordered by Province and Municipality, which in some way
is still a spatial ordering criterion, however crude.

Summing up: the level of statistical dispersion of the
input data certainly has an impact on the loading speed
of the R*Tree.
But it seems highly unlikely that implementing a specific
bulk loading feature would have a truly significant
impact when working with "natural" datasets.
objective differences risk being imperceptible.

bye Sandro

Pieter Roggemans

לא נקראה,
19 באוק׳ 2023, 14:10:0019.10.2023
עד spatiali...@googlegroups.com
Yeah, the order the elements are inserted using a row-by-row insert strategy isn't super interesting. There are some performance differences that can be seen, but really big gains (~order of magnitude) can only be expected using a specific "initial bulk insert algorithm".

For your random inserts being slower, did you do the tests on Windows? If so, did you try increasing the sqlite cache memory?

Did you have a look at my SQL script/statements in the last post? It illustrates clearly why a specific initial bulk insert algorithm is so much faster. If you find the time it might be interesting to test it also on your real-life data to compare with the timings you got above.
 
With the currently available algorithm in Sqlite, to fill up the index one by one, for each insert the tree needs to be searched for a good spot to insert it, the buckets need to be split when full, the tree needs to be rebalanced,... all quite expensive operations.
With a good "initial bulk insert algorithm" all this searching, splitting and rebalancing can be avoided, making this a lot faster.

It must be noted that the "standard" rtree algorithm stays important, as inserts/updates/deletes after the initial index creation still need to be handled with the "relatively slow" one-by-one algorithm, so I'm happy some optimizations landed there already. However, in the use cases I use GIS files I don't need insert/update/deletes that often: most of the time I just need to create an index on an existing "fixed" dataset... 


Op do 19 okt 2023 om 11:16 schreef <a.fu...@lqt.it>:
--
You received this message because you are subscribed to the Google Groups "SpatiaLite Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spatialite-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/spatialite-users/2a4cc010368f57e27539a95a77831cbc%40lqt.it.

Even Rouault

לא נקראה,
19 באוק׳ 2023, 14:23:2319.10.2023
עד spatiali...@googlegroups.com

If we knew the exact content of the rtree_  _node, _parent and _rowid  tables, we could possibly directly populate them and this should be a massive speed-up if we can generate their content outside of sqlite in a much faster way than sqlite (the doc at https://www.sqlite.org/rtree.html doesn't go into the specifics of the exact content/format of those tables)

I've implemented some time ago in ogr2ogr an optimization that fills the main table in a thread,  while creating the RTree in another one (using classic INSERT INTO the RTree virtual table) in a in-memory database, and at the end it does a "INSERT INTO rtree_XXX_node/parent/rowid FROM SELECT * FROM tempdb.rtree_XXX_node/parent/rowid". The later copy is super fast (like 3 seconds for 3 million records)

a.fu...@lqt.it

לא נקראה,
19 באוק׳ 2023, 14:43:2919.10.2023
עד spatiali...@googlegroups.com
On Thu, 19 Oct 2023 20:09:44 +0200, Pieter Roggemans wrote:
> For your random inserts being slower, did you do the tests on
> Windows?
>

no, I did my tests on a Linux Fedora Virtual Machine
hosted on Windows 10.

> If so, did you try increasing the sqlite cache memory?
>

I performed all tests twice, once using a file-system
based DB (with a default cache size), the other
using a memory based database.
the differences between the one and the other were
very slight.

I performed all tests twice, once using a file-system
based DB (with a default cache size), the other
using a memory based database.
the differences between the one and the other were
very slight.

Generally speaking, I always avoid working with
SpatiaLite directly on Windows.
The performances obtained on a Linux VM are
definitely on another level when the IO
becomes a crtical bottleneck.

I suppose it's largely due to the fact that the
virtualization software allocates very large blocks
of memory on the native system, after which Linux's
superior efficiency wins hands down.

bye Sandro

Pieter Roggemans

לא נקראה,
19 באוק׳ 2023, 15:44:3119.10.2023
עד SpatiaLite Users
It would obviously be better if it would be implemented in sqlite, but an alternative is indeed a separate implementation.

The structure is relatively straightforward. At the start of the rtree.c file is some more info about the content on the rtree tables. It is mainly the node table where some extra binary data is stored. In essence a list of all ids and bounding boxes of the elements in this node:



Op donderdag 19 oktober 2023 om 20:23:23 UTC+2 schreef Even Rouault:

Even Rouault

לא נקראה,
21 באוק׳ 2023, 13:44:1721.10.2023
עד spatiali...@googlegroups.com,Pieter Roggemans

Le 19/10/2023 à 21:44, Pieter Roggemans a écrit :
> It would obviously be better if it would be implemented in sqlite, but
> an alternative is indeed a separate implementation.
>
> The structure is relatively straightforward. At the start of the
> rtree.c file is some more info about the content on the rtree tables.
> It is mainly the node table where some extra binary data is stored. In
> essence a list of all ids and bounding boxes of the elements in this node:
> https://github.com/sqlite/sqlite/blob/ee3c55471ca02af8090f893b4deb461fab53dd35/ext/rtree/rtree.c#L115

Thanks for the hint!

I've 2 good news:

- first: SQLite master has a faster RTree construction. On my favorite
benchmark (https://data.linz.govt.nz/layer/101290-nz-building-outlines/)
with 3.2 million polygons,this decreases the R*Tree generation from 30
seconds to 20 seconds. This is due to commit
https://github.com/sqlite/sqlite/commit/7de8ae22f7f895d5564f2f006d22e2dfddfdf5a7
which disables the force reinsert step of the R*Tree (paragraph 4.3 of
http://www.sai.msu.su/~megera/postgres/gist/papers/Rstar3.pdf). Faster,
at the expense of a slightly less compact / optimal R*Tree

- the other good news is that using the documentation you mentioned
above of the internal content of SQLite R*Tree tables and the
implementation of https://github.com/tidwall/rtree.c which I've modified
with the R*Tree specifics of the SQLite implementation (but that's not
strictly needed. anything that builds a structurally valid RTree is
fine. R*Tree and other variations are just optimizations), I've a
prototype where I can populate manually the SQLite RTree in a way that
makes the SELECT rtreecheck() function happy and works as efficiently as
the one generated by SQLite ! Still on my favorite benchmark, my PoC
creates the RTree in 5 seconds, instead of 20. As far as I can see, this
is mostly a constant factor due to skipping a number of things that
SQLite needs to deal with to work with a limited number of nodes in RAM,
being able to do updates, etc.. The algorithic complexity is the same.
There's still a bit of work to turn my PoC into production code, but I
ultimately anticipate to integrate it into the OGR GeoPackage driver.

Even

Pieter Roggemans

לא נקראה,
21 באוק׳ 2023, 19:25:0421.10.2023
עד SpatiaLite Users
That's indeed great news :-)...

Op zaterdag 21 oktober 2023 om 19:44:17 UTC+2 schreef Even Rouault:

a.fu...@lqt.it

לא נקראה,
22 באוק׳ 2023, 3:36:0322.10.2023
עד spatiali...@googlegroups.com
On Sat, 21 Oct 2023 19:44:12 +0200, Even Rouault wrote:
> I've a prototype where I can populate manually the SQLite RTree in a
> way that makes the SELECT rtreecheck() function happy and works as
> efficiently as the one generated by SQLite ! Still on my favorite
> benchmark, my PoC creates the RTree in 5 seconds, instead of 20. As
> far as I can see, this is mostly a constant factor due to skipping a
> number of things that SQLite needs to deal with to work with a
> limited
> number of nodes in RAM, being able to do updates, etc.. The
> algorithic
> complexity is the same. There's still a bit of work to turn my PoC
> into production code, but I ultimately anticipate to integrate it
> into
> the OGR GeoPackage driver.
>

Hi Even,

very good and very interesting news.

I just have a small suggestion to make: what do you think about
implementing the new functionality for ultra fast loading of an
R*Tree in the form of a dynamic extension to SQLite?

This extension should support just a single SQL function,
something like this:

rtree_bulkload (
rtree_name, -- name of the R*Tree to be created
input_table_name, -- name of the input table containing MBRs
rowid_colname, -- name of the column containing ROWIDs,
minx_colname, -- name if the column in the abova table
-- containining the MinX coord of the MBR
miny_colname, -- MinY
maxx_colname, -- MaxX
maxy_colname); -- MaxY

Where "input_table_name" could be either the name of a real
table or the name of a view.

The main advantage of such an implementation is that the
new functionality would be universally usable across all
platforms (even on Android or iOS) without requiring any
dependencies on other libraries such as GDAL/OGR which
may easily not be available on mobile devices.

Nothing prevents this new extension from being part
of GDAL/OGR, but it would be important if it could
be used on its own independently of the rest of the
library.

bye Sandro

Pieter Roggemans

לא נקראה,
22 באוק׳ 2023, 3:36:4722.10.2023
עד SpatiaLite Users
Something I didn't think of before... but actually there is an RTREE bulk loading algorithm implemented in geos: https://libgeos.org/doxygen/classgeos_1_1index_1_1strtree_1_1STRtree.html#details

Could also be an option...

Op zondag 22 oktober 2023 om 01:25:04 UTC+2 schreef Pieter Roggemans:

Pieter Roggemans

לא נקראה,
22 באוק׳ 2023, 4:01:5422.10.2023
עד SpatiaLite Users
@Even: I now see that you've even made a code change to the STRtree implementation, so you might even know the code (a bit): https://github.com/libgeos/geos/blob/28d70a2e4582edcd3625659f70e31d1853423875/src/index/strtree/SimpleSTRtree.cpp

Op zondag 22 oktober 2023 om 09:36:47 UTC+2 schreef Pieter Roggemans:

Pieter Roggemans

לא נקראה,
22 באוק׳ 2023, 4:32:2322.10.2023
עד SpatiaLite Users
Having "overloads" of such a function to read the bounding boxes directly from the gpkg/spatilite geometry column of the table to index could be interesting as well...

Op zondag 22 oktober 2023 om 09:36:03 UTC+2 schreef a.fu...@lqt.it:

Even Rouault

לא נקראה,
22 באוק׳ 2023, 17:39:2322.10.2023
עד spatiali...@googlegroups.com,a.fu...@lqt.it
Hi,

I've created a library (aimed at being vendored) in
https://github.com/rouault/sqlite_rtree_bulk_load. The
sqlite_rtree_bl_from_feature_table() function is what you specified with
extra needed parameters.

GDAL integration in https://github.com/OSGeo/gdal/pull/8596

Even
השב לכולם
השב למחבר
העבר לנמענים
0 הודעות חדשות