speeding up a gunion query

92 views
Skip to first unread message

David Anderson

unread,
Aug 18, 2015, 6:46:01 PM8/18/15
to SpatiaLite Users
I am using gunion to build some GIS data from modeling results.
Data set 1 is a table with a process for each polygon for each time period. 
Data set 2 is the table with the geometry data.
Here is the query
Process is a lookup table.

insert into table3 timestep,process,shape)

select a.timestep,p.process,gunion(g.shape)

from data_set2 a, data_set2 g, simpplle.process p

where cast(a.slink as integer)=g.simple_id and p.PROCESS like '%FIRE%' and p.ID=a.PROCESS_ID

group by a.timestep,p.process

                      

The query works just fine.  Data is generated like I expect.  The issue is that for a data set of a hundred thousand polygons, 7 processes and 50 timestepd, it takes a very long time to generate the data set.


I should mention that I am using the geopackage geometry format.


Is there a way to speed things up?  Can the spatial index be used?  Should the data be put in to native Spatialite format for the gunion operation?





a.fu...@lqt.it

unread,
Aug 19, 2015, 5:13:28 AM8/19/15
to spatiali...@googlegroups.com
On Tue, 18 Aug 2015 15:46:01 -0700 (PDT), David Anderson wrote:
> I am using gunion to build some GIS data from modeling results.
> Data set 1 is a table with a process for each polygon for each time
> period. 
> Data set 2 is the table with the geometry data.
> Here is the query
> Process is a lookup table.
>
> insert into table3 timestep,process,shape)
> select a.timestep,p.process,gunion(g.shape)
> from data_set2 a, data_set2 g, simpplle.process p
> where cast(a.slink as integer)=g.simple_id and p.PROCESS like
> '%FIRE%' and p.ID=a.PROCESS_ID
> group by a.timestep,p.process
>
> The query works just fine.  Data is generated like I expect.  The
> issue is that for a data set of a hundred thousand polygons, 7
> processes and 50 timestepd, it takes a very long time to generate the
> data set.
>

Hi David,

your query doesn't seems to contain any obvious pitfall and leaves
practically no margin for further optimization.
ST_Union intrinsically is a very costly (and slow) operation, and
"a hundred thousand polygons" is a rather demanding dataset: all
this considered a long execution time can be easily forecasted.


> I should mention that I am using the geopackage geometry format.
> Is there a way to speed things up? Can the spatial index be used?
> Should the data be put in to native Spatialite format for the gunion
> operation?
>

a. the Spatial Index can't help at all in a problem like this; there
is no spatial selection to be performed, and the Spatial Index
simply is a spatial filter.
b. using the native binary format will possibly help, but only in
a very marginal way.

you could anyway attempt to deploy an alternative approach by
splitting your problem in two distinct steps: this will surely
help to understand better what really happens behind the scenes,
and will possibly introduce some performance benefit
(the SQLite's own query optimizer sometimes does crazy choices)

step #1
--------------
CREATE TEMP TABLE aux_table AS
SELECT a.timestep AS timestep, p.process AS process,
ST_Collect(g.shape) AS shape
FROM data_set2 AS a, data_set2 AS g, simpplle.process AS p
WHERE CastToInteger(a.slink) = g.simple_id
AND p.process LIKE '%FIRE%' AND p.id = a.process_id
GROUP BY a.timestp, p.process;

this first step will simply create a temporary (intermediate)
table; ST_Collect() is a reasonably fast operation (polygons
will not really be dissolved: just a MultiPolygon collection
will be aggregated), so this first step will be expected to
complete in a reasonably short time.


step #2
-------------
INSERT INTO table3 (timestep, process, shape)
SELECT timestep, process, ST_UnaryUnion(shape)
FROM aux_table;

this final step will actually perform the hard work by
dissolving all polygons.
the time required by this second query will be obviously
"non-compressible", and will give you a direct measure of
the intrinsic computational weight of your problem.

bye Sandro

David Anderson

unread,
Aug 21, 2015, 4:10:35 PM8/21/15
to SpatiaLite Users
Sandro,

My testing indicates that is the collecting that is the expensive step.  The st_collect statement took 691 seconds.  The union of the collected data took 119 seconds. Which is the opposite of what was expected.  For comparison just doing the union was 817 seconds.  It seems the 80% of the time is spent collecting the data into a multipolygon.


David

 

a.fu...@lqt.it

unread,
Aug 21, 2015, 4:40:04 PM8/21/15
to spatiali...@googlegroups.com
Hi David,

all right, now we've positively identified the main slowness cause:

select a.timestep,p.process,gunion(g.shape)
from data_set2 a, data_set2 g, simpplle.process p
WHERE cast(a.slink as integer)=g.simple_id AND p.PROCESS LIKE '%FIRE%'
AND p.ID=a.PROCESS_ID
group by a.timestep,p.process;

the WHERE clause of your original query is clearly suffering from
very poor optimization, and will probably end up in performing many
useless full table scans on behalf of the biggest table (the one
containing hundredth thousand rows).

the most obvious care in such a situation is by creating some
appropriate index so to allow the query planner to choose a more
efficient data access strategy.

I completely ignore if your tables are already supported by any
Primary Key and/or index, anyway a.process_id seems to be a
very promising candidate to be indexed.

bye Sandro

David Anderson

unread,
Aug 21, 2015, 6:01:35 PM8/21/15
to SpatiaLite Users
Sandro,

I too thought the where class might be an issue. There are indexes on the columns in this clause  cast(a.slink as integer)=g.simple_id
Doing this statement to build the data table from the joins is quick, 10 seconds or so to generate 247,000 rows.
create table temp_gis
as
select a.timestep,p.process,.shape

from data_set2 a, data_set2 g, simpplle.process p
WHERE cast(a.slink as integer)=g.simple_id AND p.PROCESS LIKE '%FIRE%'
       AND p.ID=a.PROCESS_ID


Then running this statement to get the data grouped together
create table temp_collect

as

select timestep,process,st_collect(castautomagic(shape)) as shape

from temp_gis

group by timestep,process


takes 588 seconds.

Then this query to do the union
create table temp_union

as

select timestep,process,st_unaryunion(shape) as shape

from temp_collect

takes 113 seconds.


This corroborates that the st_collect part of gathering the data together is the time consuming part of the operation.


The union decreases the geometry count from anywhere from 0% to 90%.


David


Reply all
Reply to author
Forward
0 new messages