[postgis-users] Parallel spatial indexing for GiST?

15 views
Skip to first unread message

Marco Boeringa

unread,
Sep 16, 2020, 10:35:27 AM9/16/20
to postgi...@lists.osgeo.org
Hi all,

This is probably more of a PostgreSQL question than a PostGIS one, but I
have wondered if there is actually any work going on in allowing
PostgreSQL / PostGIS to build GiST type spatial indexes in parallel, and
/ or if this is even logically and technically feasible? According to
the PostgreSQL documentation, only B-tree indexes can be indexed in
parallel.

With the ever growing size of spatial databases like OpenStreetMap, with
tables running into the 100s of million records, spatial indexing using
GiST is one of the major bottle necks in re-creating or reloading a
spatial PostGIS database. The indexing process seems highly CPU bound,
with negligible disk activity for the majority of the time the indexing
process runs, hence being able to take advantage of multiple cores seems
like a possible big win. Nonetheless, there seems little to no mention
of such (future) option for GiST type indexing when searching on the
internet for relevant information.

I have no knowledge of the internals of PostgreSQL, so there may be
technical limitations I am unaware of that make this impossible or
technically very difficult to implement, but it would be nice to hear
something about this even if it is not feasible.

Yes, I know there are BRIN type spatial indexes for PostGIS, which are
comparatively super fast to create and lead to very small indexes even
for ultra large tables, but from the little information and personal
experience I gathered, BRIN seems most suited for Point data only, and
for static, not updated data, due to its requirement of clustered data
for efficiency (actually not a problem in my particular case, since I
don't do updates, but only reloads). The few times I tried to use it for
large, spatially clustered, Polygon data sets, it seemed less efficient
when accessing the data spatially in a GIS, with clearly longer display
times, although I don't have real benchmarks for that.

Most OpenStreetMap related tools like e.g. osm2pgsql also default to
GiST, and probably with good reason.

Marco

_______________________________________________
postgis-users mailing list
postgi...@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users

Giuseppe Broccolo

unread,
Sep 16, 2020, 12:55:40 PM9/16/20
to PostGIS Users Discussion
Hi Marco,

Il giorno mer 16 set 2020 alle ore 15:35 Marco Boeringa <ma...@boeringa.demon.nl> ha scritto:
[...]

Yes, I know there are BRIN type spatial indexes for PostGIS, which are
comparatively super fast to create and lead to very small indexes even
for ultra large tables, but from the little information and personal
experience I gathered, BRIN seems most suited for Point data only, and
for static, not updated data, due to its requirement of clustered data
for efficiency (actually not a problem in my particular case, since I
don't do updates, but only reloads). The few times I tried to use it for
large, spatially clustered, Polygon data sets, it seemed less efficient
when accessing the data spatially in a GIS, with clearly longer display
times, although I don't have real benchmarks for that.

Most OpenStreetMap related tools like e.g. osm2pgsql also default to
GiST, and probably with good reason.

About BRIN in PostGIS: it internally works using bounding boxes of geometries,
as GiST, so in principle you can use this index for any geometry type, and as
far as you use intersect, contains, is_contained operators for 2D geometries and
intersects for 3D ones in your geospatial queries.

You are right when you say that BRIN is more suitable for "static" data, because
of how it internally works - creating a sort of summary of which range of tuples are
included in the data pages physically stored, just to use a few words. New entries
added during INSERTs or UPDATEs are properly summarised in BRINs as far as
the new indexed values/geometries are included in ranges/bounding boxes already
present in the index: in case new pages are created with data which does not fall
within the last summarized range, the new ranges are not automatically acquired
in the summary, and the related tuples remain unsummarized until a new summarization
is invoked, automatically through a VACUUM or manually through brin_summarize_range
or brin_summarize_new_values functions. This allows some maintenance of the
index even with non static data, of course with some limitation compared to GiST.

About the performance: being a range index it surely performs worse compared
to Rtree indexes like GiST. How much worse depends from several factors:

1) how the data pages are physically stored: ranges are as more effective as possible
as far as spatially close geometries are adjacently stored even in physical pages the
storage, so the initial import of spatial data should need to be done following some
sorting criteria

2) BRIN granularity: performance starts to be closer to an Rtree one as far as the size
of the block range is small. This can be configured during index creation with the
parameter pages_per_range, i.e. how many pages are summarised per range.
Of course, the smaller the number, the larger is the resulting BRIN and more time
is needed for the creation

GiSTs remain faster even with 2), but I'd suggest checking how the data was originally
imported into the geospatial DB in order to be sure you could benefit as much as possible
from a range index.

Hope it helps,
Giuseppe.

Paul Ramsey

unread,
Sep 16, 2020, 1:06:07 PM9/16/20
to PostGIS Users Discussion, aekor...@gmail.com, Peter Geoghegan

> On Sep 16, 2020, at 7:35 AM, Marco Boeringa <ma...@boeringa.demon.nl> wrote:
>
> Hi all,
>
> This is probably more of a PostgreSQL question than a PostGIS one, but I have wondered if there is actually any work going on in allowing PostgreSQL / PostGIS to build GiST type spatial indexes in parallel, and / or if this is even logically and technically feasible? According to the PostgreSQL documentation, only B-tree indexes can be indexed in parallel.
>
> With the ever growing size of spatial databases like OpenStreetMap, with tables running into the 100s of million records, spatial indexing using GiST is one of the major bottle necks in re-creating or reloading a spatial PostGIS database. The indexing process seems highly CPU bound, with negligible disk activity for the majority of the time the indexing process runs, hence being able to take advantage of multiple cores seems like a possible big win. Nonetheless, there seems little to no mention of such (future) option for GiST type indexing when searching on the internet for relevant information.

Marco,
I do not know if there is active work in the area of making GIST index builds faster, but I have heard discussions of various approaches from people much smarter than I, so I am sure there are potential areas of improvement available. The single-threaded performance of index build might be made faster with some bulk/batch handling of inserts, though how that interacts with the generic GIST API expectation of one-at-a-time insertion I do not know.
Probably the biggest hurdle is just that the number of size-constrained GIST data sets is much smaller than that of BTREE data sets, so it's a lower priority. Certainly the growth in OSM ubiquity is increasing the number of users with very large spatial databases they need to index though, so we can expect more pressure as time goes on.
ATB,
P

Marco Boeringa

unread,
Sep 16, 2020, 2:36:23 PM9/16/20
to postgi...@lists.osgeo.org

Hi Giuseppe,

Thanks for your insights regarding BRIN.

I actually do employ BRIN, but only for Point type geometry, where I have the (subjective) feeling that the performance is least degraded compared to GiST. I also set the 'pages_per_range' parameter to a much smaller value than the default. Even with small values for this parameter, the size and creation times of the resulting index is nothing compared to GiST.

For Polygon data, the few times I tried using a BRIN type spatial index, I had the feeling it was probably some 3-4 times slower than GiST in terms of display times in a GIS, but these aren't hard figures, because I did not really time it. I also had the feeling that there was considerably more disk activity needed to access the relevant geometries.

The data is from osm2pgsql, that initially spatially sorts the data using the default PostGIS spatial sorting / clustering using Hilbert curve. This should be efficient. I derive tables from that, some of which are additionally being spatially clustered depending on the processing they have had (for those tables I actually also need to create GiST type spatial indexes, as the PostgreSQL CLUSTER command cannot use BRIN as input for spatial clustering, due to the nature of the index, it will fail with an error / warning about this when you attempt it, and CLUSTER needs a (spatial) index as input).

osm2pgsql itself already seems to optimize indexing, in the sense that it launches multiple index processes against different tables in parallel. This is a kind of "parallel indexing", but not against a single table / spatial column. For the processes I developed myself, this is not feasible though, and I would benefit of having parallel GiST index creation for a single geometry column.

Marco

Op 16-9-2020 om 18:55 schreef Giuseppe Broccolo:

Marco Boeringa

unread,
Sep 16, 2020, 3:54:29 PM9/16/20
to Paul Ramsey, PostGIS Users Discussion, aekor...@gmail.com, Peter Geoghegan
Hi Paul,

Appreciate your insights. Good to hear there appear to be opportunities
for improvements to GiST index build speed in the future, even if no
active work is being done right now. Yes, I do think a lot of people,
and an increasing number, could benefit from such work. I personally
would certainly applaud any improvements being made, as it is especially
clear that disk speed is not an issue in most of the processing
involved, and disk speed therefor unlikely to become limiting with any
improvements in index creation, meaning there is likely a good
opportunity for improving GiST index build speed.

Marco

Op 16-9-2020 om 19:05 schreef Paul Ramsey:


>
>> On Sep 16, 2020, at 7:35 AM, Marco Boeringa <ma...@boeringa.demon.nl> wrote:
>>
>> Hi all,
>>
>> This is probably more of a PostgreSQL question than a PostGIS one, but I have wondered if there is actually any work going on in allowing PostgreSQL / PostGIS to build GiST type spatial indexes in parallel, and / or if this is even logically and technically feasible? According to the PostgreSQL documentation, only B-tree indexes can be indexed in parallel.
>>
>> With the ever growing size of spatial databases like OpenStreetMap, with tables running into the 100s of million records, spatial indexing using GiST is one of the major bottle necks in re-creating or reloading a spatial PostGIS database. The indexing process seems highly CPU bound, with negligible disk activity for the majority of the time the indexing process runs, hence being able to take advantage of multiple cores seems like a possible big win. Nonetheless, there seems little to no mention of such (future) option for GiST type indexing when searching on the internet for relevant information.
> Marco,
> I do not know if there is active work in the area of making GIST index builds faster, but I have heard discussions of various approaches from people much smarter than I, so I am sure there are potential areas of improvement available. The single-threaded performance of index build might be made faster with some bulk/batch handling of inserts, though how that interacts with the generic GIST API expectation of one-at-a-time insertion I do not know.
> Probably the biggest hurdle is just that the number of size-constrained GIST data sets is much smaller than that of BTREE data sets, so it's a lower priority. Certainly the growth in OSM ubiquity is increasing the number of users with very large spatial databases they need to index though, so we can expect more pressure as time goes on.
> ATB,
> P
>

Marco Boeringa

unread,
Sep 16, 2020, 5:24:19 PM9/16/20
to Peter Geoghegan, aekor...@gmail.com, PostGIS Users Discussion
Thanks Peter,

That certainly looks promising, and it is heartening to see real work
already being done, although there appears to be no mention of Polygon
type data being tested as well, but I don't know if geometry type is
actually relevant to developing a faster GiST (spatial) index build.
Probably not.

10x improvement would be great... love to see that land in PostgreSQL 13!

Marco

Op 16-9-2020 om 22:13 schreef Peter Geoghegan:


> On Wed, Sep 16, 2020 at 12:54 PM Marco Boeringa <ma...@boeringa.demon.nl> wrote:
>> Appreciate your insights. Good to hear there appear to be opportunities
>> for improvements to GiST index build speed in the future, even if no
>> active work is being done right now. Yes, I do think a lot of people,
>> and an increasing number, could benefit from such work. I personally
>> would certainly applaud any improvements being made, as it is especially
>> clear that disk speed is not an issue in most of the processing
>> involved, and disk speed therefor unlikely to become limiting with any
>> improvements in index creation, meaning there is likely a good
>> opportunity for improving GiST index build speed.

> Actually, there is ongoing work to speed up GiST index builds by using
> Z-order + sorting:
>
> https://postgr.es/m/123F5F32-9A85-4D0B...@yandex-team.ru
>
> It reportedly can be as much as 10x faster. Building the index through
> sorting rather than using retail inserts is inherently much faster.
> And, we've put a lot of effort into speeding up B-Tree index
> builds/sorting over several releases, to the point where a B-Tree
> index build can be very I/O bound even on high end hardware. It seems
> possible that GiST could get much of the benefit of that work by
> adopting index builds to use Z-order.
>
> The tricky part may be getting that benefit without significantly
> impacting the final index structure. The idea of teaching GiST to
> build indexes in a way that's a lot closer to B-Tree seems very
> promising, though.

Darafei "Komяpa" Praliaskouski

unread,
Sep 21, 2020, 3:40:28 AM9/21/20
to PostGIS Users Discussion, Alexander Korotkov, Peter Geoghegan
Hello Marco,

Scope for PG13 was fixed half a year ago. This is up for PG14, and needs your care and help to happen - there is a possibility to get it in parallel, since this sorting build is still single thread.

Marco Boeringa

unread,
Sep 27, 2020, 2:25:06 AM9/27/20
to Peter Geoghegan, Alexander Korotkov, PostGIS Users Discussion
Peter, Paul and Alexander,

One last question: is there anything actionable by me regarding this,
e.g. opening a ticket on the PostgreSQL issue tracker?

That is as much as I can do, because I have zero experience with
PostgreSQL development, but if you want me to open an issue describing
the performance issues and benefits to the PostGIS community of having
much faster GiST indexing and referencing this mailing list thread, that
is something I could potentially do. Probably not a whole lot of value
in it, but at least it would document the enhancement request.

Marco

Op 21-9-2020 om 18:39 schreef Peter Geoghegan:
> On Wed, Sep 16, 2020 at 3:17 PM Alexander Korotkov <aekor...@gmail.com> wrote:
>> In fact, GiST indexes are inferior to B-tree indexes for many reasons.
>> For instance, B-tree implements disk-ordered scan during VACUUM, while
>> GiST doesn't. B-tree could implement retail index tuple delete (which
>> will be very useful for pluggable table AMs), but for GiST retail
>> delete is problematic, because its performance is not guaranteed to be
>> log(N). B-tree implements deduplication, but GiST doesn't. And so
>> on. So, most of advances are applied to B-tree, while GiST is
>> retarded. That makes me think it's a good idea to reimplement GiST on
>> top of B-tree.
> A standard B-Tree with Z ordering is sometimes called a UB-Tree. This
> is described in "Modern B-Tree techniques", which promotes UB-Trees as
> having all of the advantages you list here. Though the book also says
> that it probably has somewhat inferior read performance to
> "specialized structures" (which probably refers to loosely ordered
> tree structures like GiST). This matches my own high level intuitions
> about it.
>
>> In short, my idea is to add to the B-tree ability to maintain "union
>> key" (MBR for spatial) in pivot tuple (including leftmost pivot
>> tuple), while navigation of insertions will remain comparison-based
>> (Z-ordered for spatial). Once we have union keys correctly
>> maintained, we can implement a special "union key" scan for B-tree,
>> which would work similar to the GiST scan. Maintaining union keys
>> could be tricky, because we will have to make sure no concurrent split
>> ruined our work on extension of parent union key before we've extended
>> the child union key. But nevertheless, I think it's solvable.
> That sounds hard. It sounds like you more or less want to make the
> structure a true UB-Tree for inserts -- a structure where you cannot
> allow there to be more than one place for any possible key in the
> index. But you also want to maintain a union key for selects, in order
> to make it possible to have a choice of subtree to insert into, just
> like GiST. Is it really possible to do both at the same time? It feels
> like a fundamental trade-off.
>
> I suspect that it would be helpful (if you just did a traditional
> UB-Tree, without the union key stuff) to add specialized suffix
> truncation logic, that picked a split point according to special
> criteria for the conditioned Z-order keys. That might limit the extent
> of the performance hit for reads, without the added complexity of the
> union key thing. This is just a guess.
>
> I don't have a particularly good sense of the costs here. It's obvious
> that the benefits would be very significant, but you don't need me to
> say that. The hard parts are 1.) determining the costs, and 2.)
> determining if the benefits are worth it.
>
>
> --
> Peter Geoghegan

Marco Boeringa

unread,
Jan 23, 2021, 5:25:37 AM1/23/21
to postgi...@lists.osgeo.org
Hi all,

A couple of months ago, I asked a question about GiST spatial indexing
speed for (PostGIS) geometries, and the fact that the current
implementation is slow and not capable of parallel indexing. GiST
indexing is currently one of the major bottlenecks when dealing with
massive spatial datasets like e.g. OpenStreetMap.

Back then, a couple of posters indicated there was ongoing work to
improve this, and that there might be improvements coming up for a
future release of PostgreSQL, e.g. see the remarks of Alexander below as
one example.

Does anyone know what is the current status of all of this work? Is
there any realistic chance of an implementation in the upcoming
PostgreSQL 14? This GiST indexing performance issue appears to have a
long history, but with the ever growing OpenStreetMap database, is
becoming an increasing problem for PostGIS users. I think improvements
in PG14 would be really welcomed by the PostGIS community.

Marco Boeringa

Darafei "Komяpa" Praliaskouski

unread,
Jan 23, 2021, 5:31:07 AM1/23/21
to PostGIS Users Discussion
Hi, 

Here it is, committed to PG14. Works for Postgres's internal `point` type now as far as I know. No PostGIS support yet since that's gonna be in almost a year and committed just a week ago.
https://commitfest.postgresql.org/29/2276/

There is no parallel sort there though so you're welcome to improve that too.

You're welcome to test and report your findings.
--
Darafei "Komяpa" Praliaskouski
OSM BY Team - http://openstreetmap.by/

Marco Boeringa

unread,
Jan 23, 2021, 5:39:36 AM1/23/21
to postgi...@lists.osgeo.org

Hi Darafei,

How does this relate to e.g. PostGIS with polygon or line data?

Faster indexing for point data only is only of limited value, there are already reasonable alternatives for point type data, e.g. BRIN indexes. However, for polygon and line data, GiST does still appear to be the most efficient index for accessing the data in spatial lookups, and tools like e.g. osm2pgsql default to them.

Is this commit just the "groundwork" layed out to implement something like this in PostGIS, or does it require more work on the PostgreSQL side itself before that is even conceivable?

Marco

Op 23-1-2021 om 11:30 schreef Darafei "Komяpa" Praliaskouski:

Giuseppe Broccolo

unread,
Jan 24, 2021, 12:25:34 PM1/24/21
to PostGIS Users Discussion
Hi Marco,

Il giorno sab 23 gen 2021 alle ore 11:39 Marco Boeringa <ma...@boeringa.demon.nl> ha scritto:

Hi Darafei,

How does this relate to e.g. PostGIS with polygon or line data?


It doesn't. The patch Darafei shared in his reply added the possibility to define a further support function to build the GiST, based on sorting the data following some "natural criteria" depending on the data itself, which makes the build faster - as pointed by Darafei, no parallelism for the moment.

Just the support for the internal point data has been included in the patch, based on Z-order.

To add the support for different geometries, like polygons, you would need before to define some sorting algorithm similar to Z-order for points - not sure the same algos can be used for other geometries, maybe reducing them to the related bbox and considering the bbox vertices? Although PostGIS lately included natural sorting of geometries based on Hilbert curves, can someone else confirm this eventually?
 

Faster indexing for point data only is only of limited value, there are already reasonable alternatives for point type data, e.g. BRIN indexes. However, for polygon and line data, GiST does still appear to be the most efficient index for accessing the data in spatial lookups, and tools like e.g. osm2pgsql default to them.


Actually, all the 2D geometries, geographies and 3D points can be indexed using BRIN support, which internally reduces the indexed data to their bboxes. Of course, with all the limitations imposed by using BRINs (few operators supported, no kNN, etc.).
 

Is this commit just the "groundwork" layed out to implement something like this in PostGIS, or does it require more work on the PostgreSQL side itself before that is even conceivable?


Exactly, just the new support to speed up the build has been added. The implementation for PostGIS types should be done separately. This would mean to define this support function and this other one in here. But I think we need to wait for PostgreSQL 14 to be released before.

Considering what is reported in the exchange of emails related to the patch, the expected improvement in the speed of the index build is a factor x5 for the internal point data in PostgreSQL. Should be tested then in PostGIS, as said by Darafei.

Giuseppe.

Darafei "Komяpa" Praliaskouski

unread,
Jan 25, 2021, 1:07:26 AM1/25/21
to PostGIS Users Discussion
To add the support for different geometries, like polygons, you would need before to define some sorting algorithm similar to Z-order for points - not sure the same algos can be used for other geometries, maybe reducing them to the related bbox and considering the bbox vertices? Although PostGIS lately included natural sorting of geometries based on Hilbert curves, can someone else confirm this eventually?

Yep, that's all pieces of multi-year cross-project improvement. We spent some years on algorithms, then we lacked infrastructure, now postgres added infrastructure and after they push out we can start using it. You're welcome to join the ride and help testing.

Giuseppe Broccolo

unread,
Jan 25, 2021, 11:56:00 AM1/25/21
to PostGIS Users Discussion
Hi Darafei,

Il giorno lun 25 gen 2021 alle ore 07:07 Darafei "Komяpa" Praliaskouski <m...@komzpa.net> ha scritto:
Yep, that's all pieces of multi-year cross-project improvement. We spent some years on algorithms, then we lacked infrastructure, now postgres added infrastructure and after they push out we can start using it. You're welcome to join the ride and help testing.

Yep, this is something I'd be interested in. I had the pleasure in the past to contribute to PostGIS with BRIN indexes, and if I can help...why not?

Is there already any related ticket in osgeo/issue in GitHub opened?

Giuseppe.
Reply all
Reply to author
Forward
0 new messages