[postgis-users] Slow construction of GiST index, but better with smaller # of big rows

531 views
Skip to first unread message

Wenbo Tao

unread,
Jan 12, 2019, 11:29:25 AM1/12/19
to postgi...@lists.osgeo.org
Hello,

    I was trying to build a GiST index on a geometry column in a table with 1 billion rows. It took an entire week to finish. 

    Then I reduced the number of rows by grouping closer objects into one clump (using some clustering algorithm), and then compressed the clump as one row (the geometry column becomes the bounding box of all objects in that clump). The construction then went way faster -- down to 12 hours. I did this because the query I need to answer is finding all objects whose bbox intersects with a given rectangle. I can now query all clumps whose bbox intersects with the rectangle. 

   So essentially, the index construction is slow for too many rows, but much faster for a smaller # of bigger rows. Any intuition why this is the case would be greatly appreciated!

Thank you,
Wenbo Tao

Bo Guo

unread,
Jan 13, 2019, 12:00:49 PM1/13/19
to postgi...@lists.osgeo.org

Wenbo,

The law of physics is in play here.  I think your approach is creative and valid.  I wish Postgis offered grid-base spatial index which grouped geometry BBOXs in grids with user defined levels /sizes.  A couple of other thoughts:

1. You may also look in to table partitioning to physically breakup the large table.

2. In addition, depend on how your data is used and whether or not the data is static, vector-tiling/cacheing the geometry on disk (out-side of database) may help.

Bo

_______________________________________________
postgis-users mailing list
postgi...@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users

Giuseppe Broccolo

unread,
Jan 13, 2019, 12:38:43 PM1/13/19
to PostGIS Users Discussion
Hi Wembo,

Well, building GiST indexes requires an execution time that grows linearly with the size of the dataset (~O(N)). Of course, also hardware (CPU, storage, ...) impacts the build. So long execution times for one billion rows sound reasonable.

Your solution could be fine: you cluster close objects and index the obtained rows, than you can retrieve the clusters themselves and finally find the exact match. Of course, it is not an "elegant" solution.

You already had the suggestion to partition your table, and then index the single partitions, that could be completely fine.

A second suggestion I would like to give you, is to consider BRIN indexing, thought specifically for large datasets:

https://postgis.net/docs/using_postgis_dbmanagement.html#brin_indexes

Of course, there are some limitations with this index, so I invite you to read the linked documentation and consider your specific use case. But for intersections between bbox (and your case looks to be the case), BRINs could be a really good solution.

Hope this can help,
Giuseppe.

Paul Norman

unread,
Jan 13, 2019, 8:59:32 PM1/13/19
to postgi...@lists.osgeo.org
On 2019-01-12 8:28 a.m., Wenbo Tao wrote:
>  So essentially, the index construction is slow for too many rows, but
> much faster for a smaller # of bigger rows. Any intuition why this is
> the case would be greatly appreciated!


GiST indexes work on bounding boxes only, so the portion of the row that
matters to the index is the same size per row. I would expect the two
factors that matter most for index size and generation speed to be
number of rows and number of pages. You've reduced the number of rows
and possibly the number of pages, so it's expected the index will be
smaller and faster to generate.

Martin Davis

unread,
Jan 14, 2019, 4:43:10 PM1/14/19
to PostGIS Users Discussion
As others said, indexing time is proportional to table size (actually more like O( N logN )).   So if the number of rows is reduced the index build time will decrease.

In a recent blog post Paul listed some ideas about how index build performance could be improved via parallelization [1].  But that will require some changes in Postgres.  (Although give the recent flurry of parallelization enhancements, maybe it won't be long now).

It would be nifty if GIST trees could be packed/bulk-loaded using one of the Rtree packing approaches (STR-tree or Hilbert) [2].  That should be faster than one-by-one insertion.  Although might not be as amenable to parallelization.



Felix Kunde

unread,
Jan 15, 2019, 4:45:17 PM1/15/19
to postgi...@lists.osgeo.org, PostGIS Users Discussion
As already mentioned by others, the more rows the longer it takes to build the index.
 
In preparation for a conference talk, I'm currently testing the performance of different spatial indexes that PostGIS offers: GiST, sp-GiST and BRIN. I'm also testing with 1 billion randomly generated points, lines and polygons and it took me around 6 hours to build the GiST index on my laptop. sp-GIST was usually twice as fast.
 
So far, I can say that GiST tends to slightly faster for large data sets than sp-GIST and often twice as fast as BRIN. The big advantage of BRIN imo is that it's fast to build (6hrs vs. 3,5 min) and requires hardly any disk space (50 GB vs. 3,6 MB) due to its simplicity. Just give it a try and see if the executions times are fine for you. Note, that it can be necessary to execute SET enable_seqscan = false before. In my case, only then did the query planner recognize my BRIN index.
 
For optimal performance your spatial can be clustered on disk. For this, I created a functional btree index transforming the geometries with ST_GeoHash. Btree indexes can be created in parallel with the latest Postgres version (took me 30min). Unfortunately, the CLUSTER command then takes a long time (8hrs on my machine). In the end my test query (a simple containment test) took around 250ms with BRIN (and 220ms with GIST and sp-GIST).
 
lg Felix
 
 
Gesendet: Sonntag, 13. Januar 2019 um 18:38 Uhr
Von: "Giuseppe Broccolo" <g.broc...@gmail.com>
An: "PostGIS Users Discussion" <postgi...@lists.osgeo.org>
Betreff: Re: [postgis-users] Slow construction of GiST index, but better with smaller # of big rows

Paul Norman

unread,
Jan 15, 2019, 5:07:22 PM1/15/19
to postgi...@lists.osgeo.org
On 2019-01-15 1:45 p.m., Felix Kunde wrote:
> For optimal performance your spatial can be clustered on disk. For
> this, I created a functional btree index transforming the geometries
> with ST_GeoHash. Btree indexes can be created in parallel with the
> latest Postgres version (took me 30min). Unfortunately, the CLUSTER
> command then takes a long time (8hrs on my machine).


If you're loading up the data a method faster than CLUSTER is to create
a new table with CREATE TABLE foo AS SELECT * FROM foo_tmp ORDER BY
ST_GeoHash(ST_Transform(ST_Envelope(way),4326),10) COLLATE "C"

With CLUSTER you have to create a temporary index, rewrite the table,
and then recreate that temporary index, only to drop it again.

Stephen V. Mather

unread,
Jan 15, 2019, 5:08:57 PM1/15/19
to PostGIS Users Discussion
In addition to the aforementioned solutions and recommendations (this is a fun thread), if your geometry are points than you could leverage the pointcloud extension which would do a bit of the clustering for you by storing the points in patches.

http://sig.cmparks.net/cmp-ms-90x122.pngStephen V. Mather
GIS Manager
(216) 635-3243 (Work) 
--sent from phone--


Felix Kunde

unread,
Jan 16, 2019, 4:47:37 AM1/16/19
to postgi...@lists.osgeo.org
That''s smarter, indeed. Requires more disk space though, but comes without table locks.
In my case, I wanted to track the execution times for CLUSTER operations, too - in order to tell up to which amount of rows it's a viable option.
 
So then, creating an GeoHash (Z-Curve) ordered table and BRIN-index the geom column is the fastest option
... in case your data is static. Otherwise performance of the index will degrade sooner or later. 
 
Gesendet: Dienstag, 15. Januar 2019 um 23:07 Uhr
Von: "Paul Norman" <peno...@mac.com>
An: postgi...@lists.osgeo.org

Betreff: Re: [postgis-users] Slow construction of GiST index, but better with smaller # of big rows

Martin Davis

unread,
Jan 17, 2019, 5:14:56 PM1/17/19
to PostGIS Users Discussion
Possibly relevant - a presentation on how BRIN indexes can provide better performance and reduce storage for very large point datasets:


On Sat, Jan 12, 2019 at 8:29 AM Wenbo Tao <taowen...@gmail.com> wrote:

Simon Greener

unread,
Jan 21, 2019, 12:23:19 AM1/21/19
to PostGIS Users Discussion
Folks,

I'm in a situation where I need to ST_Union or ST_Collect some osm_county Polygon (not MultiPolygon) data for Ireland.

Now, because ST_Union or ST_Collect do not support geography, I cast to geometry before calling.

select min(a.osm_id) as osm_id,
       a.name,
       count(*) as parts, 
       ST_Union(a.geog4326::geometry)::geography as geog4326 -- or ST_Collect
from data.osm_county as a
group by a.name;

Whence I get this:

ERROR: lwgeom_area_spher(oid) returned area < 0.0

Investigating I get results like this:

select distinct st_isvalidreason(a.geog4326::geometry) from data.osm_county as a;

"Hole lies outside shell[-10.2589459 53.9746452]"
etc

I guess this is expected because geodetic lines in the source geography are being treated as straight in the cast'd geometry.

If I use ST_Transform to project a 4326 poly to a 3857 and then call the ST_Union aggregate, or identify a single geography that has the invalidity and execute a self-union, I get the following in both situations.

ERROR: GEOSUnaryUnion: TopologyException: Input geom 0 is invalid: Hole lies outside shell at or near point -1148162.9982628345 7095296.1166736316 at -1148162.9982628345 7095296.1166736316

I can't for the life of me work out how to complete the aggregated ST_Union on the 4326 geography data.

Anyone point out what I am doing wrong or give me a pointer to what I can do to achieve the aggregated union?
 
Regards
Simon
--------------------------------------------------------------------------------------------------------
Spatial Advice & Solutions Architecture
Database Spatial Stored Procedure Designer

Oracle Spatial, SQL Server, PostGIS, MySQL, ArcSDE FME
Awarded "2011 Oracle Spatial Excellence Award for Education and Research"
A: 39 Cliff View Drive, Allens Rivulet, 7150, Tas, Aust
W: www.spdba.com.au
E: si...@spdba.com.au
V: +61 362 396 397
M: +61 418 396 391
GITC Supplier: T1005
Skype: sggreener
Long: 147.20515 (147° 12' 18" E)
Lat: -43.01530 (43° 00' 55" S)
GeoHash: r22em9r98wg
NAC:W80CK 7SWP3

Birgit Laggner

unread,
Jan 21, 2019, 7:27:29 AM1/21/19
to PostGIS Users Discussion
Hi Simon,

I assume, you are trying to get this all done in one query, right? Did you try to throw ST_MakeValid into the mix? Like: ST_Transform(ST_Union(ST_MakeValid(ST_Transform(a.geog4326::geometry, 3857))), 4326)::geography

Regards,
Birgit




Von: "Simon Greener" <si...@spatialdbadvisor.com>

An: "PostGIS Users Discussion" <postgi...@lists.osgeo.org>
Gesendet: Montag, 21. Januar 2019 06:23:10
Betreff: [postgis-users] Geography/geometry

Paul Ramsey

unread,
Jan 21, 2019, 11:41:28 AM1/21/19
to PostGIS Users Discussion
If it’s just Ireland, transform to a good Irish projection after casting to geometry. For maximum awsomeness, transform to gnomonic, in which great circles are represented as straight lines… then even all the introduced points will be at the correct geographic locations. Works for smaller areas, like Ireland.

P

Martin Davis

unread,
Jan 21, 2019, 1:01:00 PM1/21/19
to PostGIS Users Discussion
Simon, can you post one (or more) of the geographies which are being reported as invalid when you run   select distinct st_isvalidreason(a.geog4326::geometry) ?

It's certainly possible for reprojection to introduce invalidities, due to geodesic/straight line issue.  I'm actually surprised this doesn't show up more often as an issue.

Simon Greener

unread,
Jan 21, 2019, 4:00:59 PM1/21/19
to PostGIS Users Discussion, Martin Davis
The shapefile (from OSM) can be accessed here:


The command used to load is:

ogr2ogr -overwrite -progress -skipfailures -a_srs "EPSG:4326" ^
  -f "PostgreSQL" PG:"dbname='gisdb' host='localhost' port='5432' user='postgres'  ^
  gis_osm_county_a_free_1.shp -nln data.osm_county -nlt POLYGON ^
-lco LAUNDER=YES -lco GEOMETRY_NAME=geog4326 -lco GEOM_TYPE=geography -lco FID=ID -lco DIM=2 -lco SPATIAL_INDEX=YES

Am I missing something with the ogr2ogr command for the load?

When a geography is casted to geometry for use in the aggregate form of ST_Union (or other commands such as ST_Within) is the processing done assuming the data is projected even when geodetic?

regards
Simon
--

Paul Ramsey

unread,
Jan 21, 2019, 4:13:42 PM1/21/19
to PostGIS Users Discussion

> On Jan 21, 2019, at 1:00 PM, Simon Greener <si...@spatialdbadvisor.com> wrote:
>
> When a geography is casted to geometry for use in the aggregate form of ST_Union (or other commands such as ST_Within) is the processing done assuming the data is projected even when geodetic?

All processing of geometry make cartesian assumptions. Processing of geography using geography native functions (no casting required) makes spherical assumptions. When you case from geography::geometry you are telling the database “please process this data using cartesian assumptions”. Do we support spherical processing for all (or even many) kinds of operations? No we do not. That’s hard. Could we in the future? Sure, it might happen.

ATB,
P

Darafei "Komяpa" Praliaskouski

unread,
Jan 21, 2019, 4:20:52 PM1/21/19
to PostGIS Users Discussion
If someone is willing, there is an old Geography Overlays patch that needs lots of love: https://github.com/postgis/postgis/pull/191
--
Darafei Praliaskouski

Paul Ramsey

unread,
Jan 21, 2019, 4:21:46 PM1/21/19
to PostGIS Users Discussion
Not really sure that’s the answer… I’d much rather see work on geodetic edges in GEOS

P

Simon Greener

unread,
Jan 21, 2019, 4:28:55 PM1/21/19
to PostGIS Users Discussion, Paul Ramsey
Paul,

I can't find a suitable gnomic srid for PostGIS.

I have tried this:

INSERT INTO spatial_ref_sys (srid, proj4text) VALUES (55000, '+proj=gnom +lat_0=90 +lon_0=-7.646288');

but I get this when I use ST_Transform.

UPDATE data.osm_county SET geom55000 = ST_Transform(geog4326::geometry,55000);

AddToPROJ4SRSCache: could not parse proj4 string '+proj=gnom +lat_0=90 +lon_0=-7.646288' major axis or radius = 0 or not given

I understand your position about things being hard.

At the moment I would be satisfied with a way of taking OSM data and successfully (aka no topological errors) projecting it so that I can achieve what I need to do the projected geometry data.

regards
Simon
--

Martin Davis

unread,
Jan 21, 2019, 4:30:18 PM1/21/19
to Simon Greener, PostGIS Users Discussion
Can you post the WKB of one of the failing geometries, from the loaded data?  It's best to look at the exact input causing the problem.

On a side note, I did notice that when the polygons were merged (using JTS, which works fine), there is only one very small hole in the result [1].  It seems to correspond to an island in the Duff River [2].  It seems unlikely that this is the source of the error, but maybe... It seems like a data anomaly, anyway.

[1] POLYGON ((-8.3812533 54.4642992, -8.3813736 54.4644802, -8.3817665 54.4646982, -8.3821098 54.464834, -8.382223 54.4648443, -8.3824884 54.4647987, -8.3820746 54.4646188, -8.3817225 54.4644208, -8.3814844 54.4643394, -8.3812533 54.4642992))

Paul Ramsey

unread,
Jan 21, 2019, 4:34:08 PM1/21/19
to Simon Greener, PostGIS Users Discussion

> On Jan 21, 2019, at 1:28 PM, Simon Greener <si...@spatialdbadvisor.com> wrote:
>
> I can't find a suitable gnomic srid for PostGIS.
>

You’re going to want one centred on Ireland, not the pole, and I think maybe it’s wanting a geodetic basis, try this:

+proj=gnom +lat_0=53.35 +lon_0=-6.26 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs

Simon Greener

unread,
Jan 21, 2019, 4:46:26 PM1/21/19
to Paul Ramsey, PostGIS Users Discussion
Thanks Paul.

However, I get the same number of distinct st_isvalidreason results.

Is there anything I can do with ogr2ogr when loading to help?

Simon

On Tue, 22 Jan 2019 08:34:02 +1100, Paul Ramsey <pra...@cleverelephant.ca> wrote:

> +proj=gnom +lat_0=53.35 +lon_0=-6.26 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs

--
Regards
Simon
--------------------------------------------------------------------------------------------------------
Spatial Advice & Solutions Architecture
Database Spatial Stored Procedure Designer
Oracle Spatial, SQL Server, PostGIS, MySQL, ArcSDE FME
Awarded "2011 Oracle Spatial Excellence Award for Education and Research"
A: 39 Cliff View Drive, Allens Rivulet, 7150, Tas, Aust
W: www.spdba.com.au
E: si...@spdba.com.au
V: +61 362 396 397
M: +61 418 396 391
GITC Supplier: T1005
Skype: sggreener
Long: 147.20515 (147° 12' 18" E)
Lat: -43.01530 (43° 00' 55" S)
GeoHash: r22em9r98wg
NAC:W80CK 7SWP3

Simon Greener

unread,
Jan 21, 2019, 4:52:35 PM1/21/19
to Martin Davis, PostGIS Users Discussion

Martin Davis

unread,
Jan 21, 2019, 5:39:41 PM1/21/19
to Simon Greener, PostGIS Users Discussion
Not much help, I'm afraid.   I need the geometry in WKBHex.  Not sure offhand how to get that out of PostGIS.

Just post the WKT and let's see what that looks like.

Paul Ramsey

unread,
Jan 21, 2019, 5:40:48 PM1/21/19
to Simon Greener, PostGIS Users Discussion

> On Jan 21, 2019, at 1:46 PM, Simon Greener <si...@spatialdbadvisor.com> wrote:
>
> Thanks Paul.
>
> However, I get the same number of distinct st_isvalidreason results.

Something is awry here, because I get only 'Valid Geometry’ as a validity check result… I loaded your shape file, I have 42 irish counties, I didn’t even try to use geography, just used geometry and they are all valid to start with.

I can even push them through a couple casts and they remain valid (as I would hope they would, since the coordinates shouldn’t change)

select distinct st_isvalidreason(geom::geography::geometry) from county;

P

Paul Ramsey

unread,
Jan 21, 2019, 5:41:50 PM1/21/19
to PostGIS Users Discussion
Just ’select geom’ will return the extended hexwkb (which we should have JTS support as a matter of convenience). If you need pure standard hex wkb, then

select encode(st_asbinary(geom), ‘hex’) 

will do that

P

_______________________________________________

Martin Davis

unread,
Jan 21, 2019, 6:01:51 PM1/21/19
to Simon Greener, PostGIS Users Discussion
Could the problem be the use of the ogr2ogr option -nlt POLYGON ?  Some of the input geometries are multipolygons, and it sounds like the way to handle this with ogr2ogr is to use GEOMETRY or PROMOTE_TO_MULTI [1][2]

Simon (SPDBA) Greener

unread,
Jan 21, 2019, 6:20:50 PM1/21/19
to Paul Ramsey, PostGIS Users Discussion
I'm didn't load into geometry, I used ogr2ogr to load into geography. All work to date has been ok except now I want to aggregate union the polygon parts of bounties with multiple polygon parts. ST_Union is not geographic aware hence all the trouble.

Using geometry as the load target does not appear to help me because of the geodetic vs projected linestring issue.

There are other situations eg ST_Wdithin, that return incorrect results when points fall on boundaries (I had to hack a solution to get correct within results for geographic data.)

I don't my just do the processing for Ireland but also Australia and the USA, the common factor being 4326. So the solution has to be geography not geometry.

Well for as much as I understand all these things?

PS I need a common framework when training across PostGIS, SQL Server Spatial and Oracle Spatial. I also want the math to be correct for the geography data, as it should.

Regards
Simon


Sent from BlueMail
On 22 Jan. 2019, at 09:41, Paul Ramsey <pra...@cleverelephant.ca> wrote:


On Jan 21, 2019, at 1:46 PM, Simon Greener <si...@spatialdbadvisor.com> wrote:

Thanks Paul.

However, I get the same number of distinct st_isvalidreason results.

Something is awry here, because I get only 'Valid Geometry’ as a validity check result… I loaded your shape file, I have 42 irish counties, I didn’t even try to use geography, just used geometry and they are all valid to start with.

I can even push them through a couple casts and they remain valid (as I would hope they would, since the coordinates shouldn’t change)

select distinct st_isvalidreason(geom::geography::geometry) from county;

P


Is there anything I can do with ogr2ogr when loading to help?

Simon

On Tue, 22 Jan 2019 08:34:02 +1100, Paul Ramsey <pra...@cleverelephant.ca> wrote:

+proj=gnom +lat_0=53.35 +lon_0=-6.26 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs


--
Regards
Simon

Simon Greener

unread,
Jan 21, 2019, 6:27:01 PM1/21/19
to Martin Davis, PostGIS Users Discussion
Thanks.

I couldn't see how this would help as this is structural not mathematical.

But I am trying to bring the parts back together.


S

Simon Greener

unread,
Jan 21, 2019, 6:27:23 PM1/21/19
to Martin Davis, PostGIS Users Discussion
Try the ireland.sql file again.
I used Paul's suggested SQL.
S

Simon Greener

unread,
Jan 21, 2019, 10:06:04 PM1/21/19
to PostGIS Users Discussion
I replaced the shapefile load script dropping ogr2ogr in favor of shp2pgshp.

This appears to have given me clean multipolygons as nothing was reported by the following:

select ST_ISValidReason(a.geog4326::geometry)
  from data.osm_county as a 
 where ST_IsValidReason(a.geog4326::geometry) <> 'Valid Geometry'

I discovered though this process that the original shapefile had 10 counties that were duplicated.
I used postgresql to delete the duplicates.

All my SQL appears to be working.

Perhaps Martin is right: there is something wrong with the ogr2ogr load (though I switched from -nlt POLYGON to -nlt PROMOTE_TO_MULTI and it didn't make any difference).

I think this request for help can now be closed.

Thanks to all.

regards
Simon

Martin Davis

unread,
Jan 21, 2019, 11:21:41 PM1/21/19
to PostGIS Users Discussion
Perhaps log this as an OGR issue [1]?  Probably need to bisect the problem a bit more though (e.g. reduce the input to a single failing polygon).





Simon (SPDBA) Greener

unread,
Jan 21, 2019, 11:29:41 PM1/21/19
to PostGIS Users Discussion
I'll do some more checks to see if ogr2ogr really is the problem.
Simon

Sent from BlueMail
Reply all
Reply to author
Forward
0 new messages