This is probably more of a PostgreSQL question than a PostGIS one, but I
have wondered if there is actually any work going on in allowing
PostgreSQL / PostGIS to build GiST type spatial indexes in parallel, and
/ or if this is even logically and technically feasible? According to
the PostgreSQL documentation, only B-tree indexes can be indexed in
parallel.
With the ever growing size of spatial databases like OpenStreetMap, with
tables running into the 100s of million records, spatial indexing using
GiST is one of the major bottle necks in re-creating or reloading a
spatial PostGIS database. The indexing process seems highly CPU bound,
with negligible disk activity for the majority of the time the indexing
process runs, hence being able to take advantage of multiple cores seems
like a possible big win. Nonetheless, there seems little to no mention
of such (future) option for GiST type indexing when searching on the
internet for relevant information.
I have no knowledge of the internals of PostgreSQL, so there may be
technical limitations I am unaware of that make this impossible or
technically very difficult to implement, but it would be nice to hear
something about this even if it is not feasible.
Yes, I know there are BRIN type spatial indexes for PostGIS, which are
comparatively super fast to create and lead to very small indexes even
for ultra large tables, but from the little information and personal
experience I gathered, BRIN seems most suited for Point data only, and
for static, not updated data, due to its requirement of clustered data
for efficiency (actually not a problem in my particular case, since I
don't do updates, but only reloads). The few times I tried to use it for
large, spatially clustered, Polygon data sets, it seemed less efficient
when accessing the data spatially in a GIS, with clearly longer display
times, although I don't have real benchmarks for that.
Most OpenStreetMap related tools like e.g. osm2pgsql also default to
GiST, and probably with good reason.
Marco
_______________________________________________
postgis-users mailing list
postgi...@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users
[...]
Yes, I know there are BRIN type spatial indexes for PostGIS, which are
comparatively super fast to create and lead to very small indexes even
for ultra large tables, but from the little information and personal
experience I gathered, BRIN seems most suited for Point data only, and
for static, not updated data, due to its requirement of clustered data
for efficiency (actually not a problem in my particular case, since I
don't do updates, but only reloads). The few times I tried to use it for
large, spatially clustered, Polygon data sets, it seemed less efficient
when accessing the data spatially in a GIS, with clearly longer display
times, although I don't have real benchmarks for that.
Most OpenStreetMap related tools like e.g. osm2pgsql also default to
GiST, and probably with good reason.
brin_summarize_range brin_summarize_new_values functions. This allows some maintenance of the
index even with non static data, of course with some limitation compared to GiST.
About the performance: being a range index it surely performs worse compared
to Rtree indexes like GiST. How much worse depends from several factors:
1) how the data pages are physically stored: ranges are as more effective as possible
as far as spatially close geometries are adjacently stored even in physical pages the
storage, so the initial import of spatial data should need to be done following somesorting criteria
2) BRIN granularity: performance starts to be closer to an Rtree one as far as the sizeof the block range is small. This can be configured during index creation with theparameter pages_per_range, i.e. how many pages are summarised per range.Of course, the smaller the number, the larger is the resulting BRIN and more timeis needed for the creation
GiSTs remain faster even with 2), but I'd suggest checking how the data was originallyimported into the geospatial DB in order to be sure you could benefit as much as possiblefrom a range index.
Hope it helps,Giuseppe.
> On Sep 16, 2020, at 7:35 AM, Marco Boeringa <ma...@boeringa.demon.nl> wrote:
>
> Hi all,
>
> This is probably more of a PostgreSQL question than a PostGIS one, but I have wondered if there is actually any work going on in allowing PostgreSQL / PostGIS to build GiST type spatial indexes in parallel, and / or if this is even logically and technically feasible? According to the PostgreSQL documentation, only B-tree indexes can be indexed in parallel.
>
> With the ever growing size of spatial databases like OpenStreetMap, with tables running into the 100s of million records, spatial indexing using GiST is one of the major bottle necks in re-creating or reloading a spatial PostGIS database. The indexing process seems highly CPU bound, with negligible disk activity for the majority of the time the indexing process runs, hence being able to take advantage of multiple cores seems like a possible big win. Nonetheless, there seems little to no mention of such (future) option for GiST type indexing when searching on the internet for relevant information.
Marco,
I do not know if there is active work in the area of making GIST index builds faster, but I have heard discussions of various approaches from people much smarter than I, so I am sure there are potential areas of improvement available. The single-threaded performance of index build might be made faster with some bulk/batch handling of inserts, though how that interacts with the generic GIST API expectation of one-at-a-time insertion I do not know.
Probably the biggest hurdle is just that the number of size-constrained GIST data sets is much smaller than that of BTREE data sets, so it's a lower priority. Certainly the growth in OSM ubiquity is increasing the number of users with very large spatial databases they need to index though, so we can expect more pressure as time goes on.
ATB,
P
Hi Giuseppe,
Thanks for your insights regarding BRIN.
I actually do employ BRIN, but only for Point type geometry,
where I have the (subjective) feeling that the performance is
least degraded compared to GiST. I also set the 'pages_per_range'
parameter to a much smaller value than the default. Even with
small values for this parameter, the size and creation times of
the resulting index is nothing compared to GiST.
For Polygon data, the few times I tried using a BRIN type spatial index, I had the feeling it was probably some 3-4 times slower than GiST in terms of display times in a GIS, but these aren't hard figures, because I did not really time it. I also had the feeling that there was considerably more disk activity needed to access the relevant geometries.
The data is from osm2pgsql, that initially spatially sorts the data using the default PostGIS spatial sorting / clustering using Hilbert curve. This should be efficient. I derive tables from that, some of which are additionally being spatially clustered depending on the processing they have had (for those tables I actually also need to create GiST type spatial indexes, as the PostgreSQL CLUSTER command cannot use BRIN as input for spatial clustering, due to the nature of the index, it will fail with an error / warning about this when you attempt it, and CLUSTER needs a (spatial) index as input).
osm2pgsql itself already seems to optimize indexing, in the sense that it launches multiple index processes against different tables in parallel. This is a kind of "parallel indexing", but not against a single table / spatial column. For the processes I developed myself, this is not feasible though, and I would benefit of having parallel GiST index creation for a single geometry column.
Marco
Appreciate your insights. Good to hear there appear to be opportunities
for improvements to GiST index build speed in the future, even if no
active work is being done right now. Yes, I do think a lot of people,
and an increasing number, could benefit from such work. I personally
would certainly applaud any improvements being made, as it is especially
clear that disk speed is not an issue in most of the processing
involved, and disk speed therefor unlikely to become limiting with any
improvements in index creation, meaning there is likely a good
opportunity for improving GiST index build speed.
Marco
Op 16-9-2020 om 19:05 schreef Paul Ramsey:
>
>> On Sep 16, 2020, at 7:35 AM, Marco Boeringa <ma...@boeringa.demon.nl> wrote:
>>
>> Hi all,
>>
>> This is probably more of a PostgreSQL question than a PostGIS one, but I have wondered if there is actually any work going on in allowing PostgreSQL / PostGIS to build GiST type spatial indexes in parallel, and / or if this is even logically and technically feasible? According to the PostgreSQL documentation, only B-tree indexes can be indexed in parallel.
>>
>> With the ever growing size of spatial databases like OpenStreetMap, with tables running into the 100s of million records, spatial indexing using GiST is one of the major bottle necks in re-creating or reloading a spatial PostGIS database. The indexing process seems highly CPU bound, with negligible disk activity for the majority of the time the indexing process runs, hence being able to take advantage of multiple cores seems like a possible big win. Nonetheless, there seems little to no mention of such (future) option for GiST type indexing when searching on the internet for relevant information.
> Marco,
> I do not know if there is active work in the area of making GIST index builds faster, but I have heard discussions of various approaches from people much smarter than I, so I am sure there are potential areas of improvement available. The single-threaded performance of index build might be made faster with some bulk/batch handling of inserts, though how that interacts with the generic GIST API expectation of one-at-a-time insertion I do not know.
> Probably the biggest hurdle is just that the number of size-constrained GIST data sets is much smaller than that of BTREE data sets, so it's a lower priority. Certainly the growth in OSM ubiquity is increasing the number of users with very large spatial databases they need to index though, so we can expect more pressure as time goes on.
> ATB,
> P
>
That certainly looks promising, and it is heartening to see real work
already being done, although there appears to be no mention of Polygon
type data being tested as well, but I don't know if geometry type is
actually relevant to developing a faster GiST (spatial) index build.
Probably not.
10x improvement would be great... love to see that land in PostgreSQL 13!
Marco
Op 16-9-2020 om 22:13 schreef Peter Geoghegan:
> On Wed, Sep 16, 2020 at 12:54 PM Marco Boeringa <ma...@boeringa.demon.nl> wrote:
>> Appreciate your insights. Good to hear there appear to be opportunities
>> for improvements to GiST index build speed in the future, even if no
>> active work is being done right now. Yes, I do think a lot of people,
>> and an increasing number, could benefit from such work. I personally
>> would certainly applaud any improvements being made, as it is especially
>> clear that disk speed is not an issue in most of the processing
>> involved, and disk speed therefor unlikely to become limiting with any
>> improvements in index creation, meaning there is likely a good
>> opportunity for improving GiST index build speed.
> Actually, there is ongoing work to speed up GiST index builds by using
> Z-order + sorting:
>
> https://postgr.es/m/123F5F32-9A85-4D0B...@yandex-team.ru
>
> It reportedly can be as much as 10x faster. Building the index through
> sorting rather than using retail inserts is inherently much faster.
> And, we've put a lot of effort into speeding up B-Tree index
> builds/sorting over several releases, to the point where a B-Tree
> index build can be very I/O bound even on high end hardware. It seems
> possible that GiST could get much of the benefit of that work by
> adopting index builds to use Z-order.
>
> The tricky part may be getting that benefit without significantly
> impacting the final index structure. The idea of teaching GiST to
> build indexes in a way that's a lot closer to B-Tree seems very
> promising, though.
One last question: is there anything actionable by me regarding this,
e.g. opening a ticket on the PostgreSQL issue tracker?
That is as much as I can do, because I have zero experience with
PostgreSQL development, but if you want me to open an issue describing
the performance issues and benefits to the PostGIS community of having
much faster GiST indexing and referencing this mailing list thread, that
is something I could potentially do. Probably not a whole lot of value
in it, but at least it would document the enhancement request.
Marco
Op 21-9-2020 om 18:39 schreef Peter Geoghegan:
> On Wed, Sep 16, 2020 at 3:17 PM Alexander Korotkov <aekor...@gmail.com> wrote:
>> In fact, GiST indexes are inferior to B-tree indexes for many reasons.
>> For instance, B-tree implements disk-ordered scan during VACUUM, while
>> GiST doesn't. B-tree could implement retail index tuple delete (which
>> will be very useful for pluggable table AMs), but for GiST retail
>> delete is problematic, because its performance is not guaranteed to be
>> log(N). B-tree implements deduplication, but GiST doesn't. And so
>> on. So, most of advances are applied to B-tree, while GiST is
>> retarded. That makes me think it's a good idea to reimplement GiST on
>> top of B-tree.
> A standard B-Tree with Z ordering is sometimes called a UB-Tree. This
> is described in "Modern B-Tree techniques", which promotes UB-Trees as
> having all of the advantages you list here. Though the book also says
> that it probably has somewhat inferior read performance to
> "specialized structures" (which probably refers to loosely ordered
> tree structures like GiST). This matches my own high level intuitions
> about it.
>
>> In short, my idea is to add to the B-tree ability to maintain "union
>> key" (MBR for spatial) in pivot tuple (including leftmost pivot
>> tuple), while navigation of insertions will remain comparison-based
>> (Z-ordered for spatial). Once we have union keys correctly
>> maintained, we can implement a special "union key" scan for B-tree,
>> which would work similar to the GiST scan. Maintaining union keys
>> could be tricky, because we will have to make sure no concurrent split
>> ruined our work on extension of parent union key before we've extended
>> the child union key. But nevertheless, I think it's solvable.
> That sounds hard. It sounds like you more or less want to make the
> structure a true UB-Tree for inserts -- a structure where you cannot
> allow there to be more than one place for any possible key in the
> index. But you also want to maintain a union key for selects, in order
> to make it possible to have a choice of subtree to insert into, just
> like GiST. Is it really possible to do both at the same time? It feels
> like a fundamental trade-off.
>
> I suspect that it would be helpful (if you just did a traditional
> UB-Tree, without the union key stuff) to add specialized suffix
> truncation logic, that picked a split point according to special
> criteria for the conditioned Z-order keys. That might limit the extent
> of the performance hit for reads, without the added complexity of the
> union key thing. This is just a guess.
>
> I don't have a particularly good sense of the costs here. It's obvious
> that the benefits would be very significant, but you don't need me to
> say that. The hard parts are 1.) determining the costs, and 2.)
> determining if the benefits are worth it.
>
>
> --
> Peter Geoghegan
A couple of months ago, I asked a question about GiST spatial indexing
speed for (PostGIS) geometries, and the fact that the current
implementation is slow and not capable of parallel indexing. GiST
indexing is currently one of the major bottlenecks when dealing with
massive spatial datasets like e.g. OpenStreetMap.
Back then, a couple of posters indicated there was ongoing work to
improve this, and that there might be improvements coming up for a
future release of PostgreSQL, e.g. see the remarks of Alexander below as
one example.
Does anyone know what is the current status of all of this work? Is
there any realistic chance of an implementation in the upcoming
PostgreSQL 14? This GiST indexing performance issue appears to have a
long history, but with the ever growing OpenStreetMap database, is
becoming an increasing problem for PostGIS users. I think improvements
in PG14 would be really welcomed by the PostGIS community.
Marco Boeringa
Hi Darafei,
How does this relate to e.g. PostGIS with polygon or line data?
Faster indexing for point data only is only of limited value, there are already reasonable alternatives for point type data, e.g. BRIN indexes. However, for polygon and line data, GiST does still appear to be the most efficient index for accessing the data in spatial lookups, and tools like e.g. osm2pgsql default to them.
Is this commit just the "groundwork" layed out to implement something like this in PostGIS, or does it require more work on the PostgreSQL side itself before that is even conceivable?
Marco
Hi Darafei,
How does this relate to e.g. PostGIS with polygon or line data?
Faster indexing for point data only is only of limited value, there are already reasonable alternatives for point type data, e.g. BRIN indexes. However, for polygon and line data, GiST does still appear to be the most efficient index for accessing the data in spatial lookups, and tools like e.g. osm2pgsql default to them.
Is this commit just the "groundwork" layed out to implement something like this in PostGIS, or does it require more work on the PostgreSQL side itself before that is even conceivable?
To add the support for different geometries, like polygons, you would need before to define some sorting algorithm similar to Z-order for points - not sure the same algos can be used for other geometries, maybe reducing them to the related bbox and considering the bbox vertices? Although PostGIS lately included natural sorting of geometries based on Hilbert curves, can someone else confirm this eventually?
Yep, that's all pieces of multi-year cross-project improvement. We spent some years on algorithms, then we lacked infrastructure, now postgres added infrastructure and after they push out we can start using it. You're welcome to join the ride and help testing.