Should we make gp_segment_configuration into a view

Jimmy Yih

unread,

Jul 5, 2023, 4:35:16 PM7/5/23

to Greenplum Developers

With GPDB 7 getting closer to catalog freeze, I remembered discussions about making gp_segment_configuration into a view (see below gpdb-dev discussion thread for reference). By making it a view which calls an internal function, it'll allow us to work on things that were previously blocked. If we don't do it now, then we'd have to wait for GPDB 8 or implement questionable hacks.

For me, my main struggle with gp_segment_configuration is when dealing with coordinator segment replicas/copies. The gp_segment_configuration always needs to be updated by first promoting and starting up the coordinator segment replica (to manually update the catalog table). This locks away some useful features that a coordinator segment replica could do... one major thing being hot standby dispatch (e.g. the ability to have a full on replica cluster up in hot standby recovery mode and do read-only query dispatches).

So my proposal would be the following:
1. Convert gp_segment_configuration into a view that calls an internal catalog function gp_get_segment_configuration(). The main gp_segment_configuration catalog table would now be called gp_internal_segment_configuration.
2. Move the current gp_segment_configuration choosing logic from getCdbComponentInfo() into the new gp_get_segment_configuration() so that it'd look like this:
gp_get_segment_configuration()
=> readGpSegConfigFromCatalog() // if (IsTransactionState())
=> readGpSegConfigFromFTSFiles() // else
Make getCdbComponentInfo use the new gp_get_segment_configuration().

After implementing the above, we would be at functional parity but gp_segment_configuration would now be able to be more easily extended upon. For example, we could later on have:
gp_get_segment_configuration()
=> readGpSegConfigFromFlatFile() // if (EnableHotStandby)
=> readGpSegConfigFromCatalog() // else if (IsTransactionState())
=> readGpSegConfigFromFTSFiles() // else

The above would allow the hot standby dispatcher to create a hot standby cdbgang according to some static flat file similar to how we have readGpSegConfigFromFTSFiles() written today... and the user would be able to SELECT from gp_segment_configuration and get an accurate view of their hot standby cluster. This is just one single use case, but it could be further extended (more conditional cases or have a GUC setting) to help other use cases that struggle due to the limitations of gp_segment_configuration.

Previous discussion thread reference:
https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/_zIdZlnZKK8/m/4f3TIMBMCgAJ

Shine Zhang

unread,

Jul 6, 2023, 11:25:26 AM7/6/23

to Jimmy Yih, Greenplum Developers

Thanks Jimmy.

So far, all the discussions are centered around mirrored deployment, and with auto-failover scenario in mind. I don't have direct comments on that.

I'd like to bring related scenarios of what's the impact to the mirrorless deployment, and what's the impact to the historical table `gp_configuration_history`.

Here are my questions:

- Under mirrorless deployments, the FTS is disabled, then will the `gp_segment_configuration` still readable?

- Since we capture the `gp_segment_configuration` differently, what's the impact to the `gp_configuration_history` table?

Thanks,

Shine

From: 'Jimmy Yih' via Greenplum Developers <gpdb...@greenplum.org>
Date: Wednesday, July 5, 2023 at 3:35 PM
To: Greenplum Developers <gpdb...@greenplum.org>
Subject: Should we make gp_segment_configuration into a view

!! External Email

https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgreenplum.org%2Fg%2Fgpdb-dev%2Fc%2F_zIdZlnZKK8%2Fm%2F4f3TIMBMCgAJ&data=05%7C01%7Czhxin%40vmware.com%7C889d009b830042edd4e508db7d976488%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638241861408043144%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Kx5Ykp%2BJ%2BU%2BG5LyUs%2FsDv9XuHsy2SXEEsiZc5K2p1bA%3D&reserved=0

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

Jimmy Yih

unread,

Jul 7, 2023, 6:56:12 PM7/7/23

to Shine Zhang, Greenplum Developers

> - Under mirrorless deployments, the FTS is disabled, then will the `gp_segment_configuration` still readable?

There shouldn't be a difference. Querying gp_segment_configuration in a mirrorless deployment would be the same as a mirrored deployment where the gp_segment_configuration view would simply return the contents of the new gp_internal_segment_configuration catalog table (which would be populated by gpinitsystem; same as how it works today).

> - Since we capture the `gp_segment_configuration` differently, what's the impact to the `gp_configuration_history` table?

There should be no impact to the `gp_configuration_history` table. It's purpose would still be the same: record updates from FTS for regular GPDB High Availability events. If necessary, other things could INSERT into the table as well since it's just a table logging High Availability relevant changes... but I imagine non-FTS gp_segment_configuration changes would probably be logged elsewhere if it's even needed (e.g. I think pg_auto_failover has its own events table). If needed/desired, we could also extend this proposal to make `gp_configuration_history` into a view too (so that for example we could access the pg_auto_failover events table as a foreign server table)... but I don't see it being a requirement whereas gp_segment_configuration being a table has made starting some new features harder.

Ashwin Agrawal

unread,

Jul 10, 2023, 4:20:45 AM7/10/23

to Jimmy Yih, Greenplum Developers

In general having abstraction for gp_segment_configuration is helpful in longer term future reasons:

- helps accomplish the real important goal of Cluster Management Utilities not directly updating the table. Should only use provided APIs to modify the table

- currently updateSystemConfig() and change_hostname() functions exist which directly update gp_segment_configuration. So, we would have to provide interfaces/API to expose this update functionality. And leverage those APIs for these functions

- even internal GPDB code should be using the API and not direct update to the catalog

- under the hood we can replace the table implementation with anything we wish without callers being affected from the implementation

Some of the action items to propose:

- Clearly articulate the functional APIs (add primary, add mirror, modify config and such ....) required to be exposed by underlying Segment Configuration implementations (Catalog, FlatFile, FTSFile, ...)

- Implement each of them with BaseClass and derived classes concept to make it clear which implementation provides which functionality and such

- Expose high-level abstraction on top using BaseClass APIs with logic to pick the derived class implementation either based on GUC or some other control

- KB articles related to port or hostname changes need to be updated as well which mention updating gp_segment_configuration to use the interfaces or better provide utilities to do it instead which contains all the logic from KB

Question:

- How will stopping or starting (utilities) on the replica cluster choose which gp_segment_configuration implementation to pick?

--

Ashwin Agrawal (VMware)

Shine Zhang

unread,

Jul 10, 2023, 11:31:56 AM7/10/23

to Jimmy Yih, Greenplum Developers

This is great to know. Thanks.

Shivram Mani

unread,

Jul 10, 2023, 12:49:27 PM7/10/23

to Ashwin Agrawal, Jimmy Yih, Greenplum Developers

Really like this proposal introducing view/function based abstraction to gp_segment_configuration.
Ashwin w.r.t to additional action items that you highlighted, do you want those to be done along side with the above change, or can we as a first step get the proposed change in (unless there is any other concern), and incrementally work an finalize the follow up items ?

Thanks

Shivram

From: Ashwin Agrawal <ashwi...@gmail.com>
Sent: Monday, July 10, 2023 1:20 AM
To: Jimmy Yih <jy...@vmware.com>
Cc: Greenplum Developers <gpdb...@greenplum.org>

Subject: Re: Should we make gp_segment_configuration into a view

!! External Email

Kalen Krempely

unread,

Jul 10, 2023, 3:55:25 PM7/10/23

to Ashwin Agrawal, Jimmy Yih, Greenplum Developers

Just as a note gpupgrade needs to update gp_segment_configuration to "swap" the upgraded cluster with the "original". It uses the following SQL:
"UPDATE gp_segment_configuration SET port = $1, datadir = $2 WHERE content = $3 AND role = $4"

Hopefully, the new APIs will allow that functionality.

From: Ashwin Agrawal <ashwi...@gmail.com>
Sent: Monday, July 10, 2023 1:20 AM
To: Jimmy Yih <jy...@vmware.com>
Cc: Greenplum Developers <gpdb...@greenplum.org>
Subject: Re: Should we make gp_segment_configuration into a view

!! External Email

Jimmy Yih

unread,

Jul 10, 2023, 4:41:02 PM7/10/23

to Ashwin Agrawal, Greenplum Developers

> - How will stopping or starting (utilities) on the replica cluster choose which gp_segment_configuration implementation to pick?

As stated in the proposal, it could be done by simply checking for EnableHotStandby and maybe recovery mode state. Current dispatcher logic already checks for transaction state so we'd just be extending the logic with more cases. Or we could trivially control/force it with a new reloadable GUC.

In the replica cluster scenario, the gp_segment_configuration info could be stored in a flat file similar to the FTS twophase file. The replica cluster info is generally static and would never change (or at least until gpexpand is supported but then you'd just have to add an entry to the flat file).

> - Clearly articulate the functional APIs (add primary, add mirror, modify config and such ....) required to be exposed by underlying Segment Configuration implementations (Catalog, FlatFile, FTSFile, ...)

We already have catalog functions to do this:
gp_add_coordinator_standby
gp_add_segment
gp_add_segment_primary
gp_add_segment_mirror
gp_remove_coordinator_standby
gp_remove_segment
gp_remove_segment_mirror

They can be updated later on if needed to satisfy any new logic where required.

Soumyadeep Chakraborty

unread,

Jul 10, 2023, 6:48:46 PM7/10/23

to Jimmy Yih, Ashwin Agrawal, Greenplum Developers

Hey folks,

I realize I am late to this party. Let me share my thoughts nonetheless.

(1) Having a pluggable component fully manage the cluster configuration
throughout the uptime of the cluster is a challenging ask, with complete APIs
etc.

I thought about this a bit and I feel at this stage of the 7 dev cycle, lets
not introduce this change. There is a fair bit to do to get it right, and I
don't think we have the cycles to develop and test. The server backend won't be
the only thing changing, utilites (and their tests) will have to change and be
tested too (and as mentioned KBs, docs etc).

I think we should do it in the best way possible and not compromise due to the
amount of time we have left.

I have added some thoughts on a reference implementation below, which I think
we should do for GP8.

(2) There is quite a bit of code around (gangs, dispatch etc) which lies in
critical paths. The gang and dispatch code today rely on a cache:
cdb_component_dbs which is currently invalidated via FTS and even gpexpand! Now,
we would need to add interface routines around these as well.

else if ((cdb_component_dbs->fts_version != ftsVersion ||
cdb_component_dbs->expand_version != expandVersion))
{
...
cdbcomponent_destroyCdbComponents();
cdb_component_dbs = getCdbComponentInfo();

PS: I also strongly feel that a dispatcher rewrite is in order. But let's leave
that as a separate topic.

(3) One other way to tackle the problem of cluster config maintenance is to
completely decouple it from the server -> just have one provider of truth: etcd
(or similar).

Instead of having multiple ways to query/update cluster configuration, there
would only be one way (and there should really be one way).

Utilities can directly update/read it and so can the server
backends. gp_segment_configuration can be a view kept for backwards
compatibility and DBA convenience.

And there need not be any catalog table at all. An invalidation mechanism would
be necessary to update the backend local caches, whenever the cluster
configuration changes (can be done via listen/notify etc).

This approach has several benefits including: one source of truth, no need for a
running GPDB cluster to query/update configuration etc, apart from the more
obvious benefits of using etcd. Also, there is no need for the dance that we do
in FTS:

* In phase 2 of 2PC, current xact has been marked to TRANS_COMMIT/ABORT,
* COMMIT_PREPARED or ABORT_PREPARED DTM are performed, if they failed,
* dispather disconnect and destroy all gangs and fetch the latest segment
* configurations to do RETRY_COMMIT_PREPARED or RETRY_ABORT_PREPARED,
* however, postgres disallow catalog lookups outside of xacts.
*
* readGpSegConfigFromFTSFiles() notify FTS to dump the configs from catalog
* to a flat file and then read configurations from that file.

This dance because etcd can have transcation semantics for updates. This dance
in FTS is all because we try to maintain gp_segment_configuration in
system catalogs.

I feel that we should strongly explore this architectural option (as opposed to
having etcd as a pluggable provider or having different pluggable providers).

(4) Possible temporary compromise to support read replica clusters:

I think the main requirement for a read replica cluster and hot standby dispatch
is that a mechanism is needed to seed a fresh gp_segment_configuration
that departs
from the primary cluster. What if through a utility, we could modify the
gp_segment_configuration catalog? The main challenge of doing that though is
that gp_segment_configuration is WAL replicated and any updates in the primary
cluster will mess with the read replica cluster.
What if we made gp_segment_configuration UNLOGGED? Making it so
might mean a bit of work for gpactivatestandby, but might help us here.

I can't quite recall what Jimmy did here in the code to read the cluster config
from the catalog in one of our POCs for hot standby dispatch, but maybe it
looked like this:

if (guc)
configs = readGpSegConfigFromFile(&total_dbs);
else if (IsTransactionState())
configs = readGpSegConfigFromCatalog(&total_dbs);
else
configs = readGpSegConfigFromFTSFiles(&total_dbs);

I think this is the simplest thing to do at this point for the 7X release, using
a flat file as an integration mechanism to tackle this one need. It can also be
done post-release as a new feature.

On Mon, Jul 10, 2023 at 1:20AM Ashwin Agrawal <ashwi...@gmail.com> wrote:

> In general having abstraction for gp_segment_configuration is helpful in longer term future reasons:
> - helps accomplish the real important goal of Cluster Management Utilities not directly updating the table. Should only use provided APIs to modify the table
> - currently updateSystemConfig() and change_hostname() functions exist which directly update gp_segment_configuration. So, we would have to provide interfaces/API to expose this update functionality. And leverage those APIs for these functions
> - even internal GPDB code should be using the API and not direct update to the catalog
> - under the hood we can replace the table implementation with anything we wish without callers being affected from the implementation
>
> Some of the action items to propose:

> - Clearly articulate the functional APIs (add primary, add mirror, modify config and such ....) required to be exposed by underlying Segment Configuration implementations (Catalog, FlatFile, FTSFile, ...)

> - Implement each of them with BaseClass and derived classes concept to make it clear which implementation provides which functionality and such
> - Expose high-level abstraction on top using BaseClass APIs with logic to pick the derived class implementation either based on GUC or some other control
> - KB articles related to port or hostname changes need to be updated as well which mention updating gp_segment_configuration to use the interfaces or better provide utilities to do it instead which contains all the logic from KB

Possible reference implementation for Pluggable Cluster Config Providers:

1. We can have what we call a ClusterConfigProvider, which can be provided by
any extension (or even a dynamically loaded library) and they will be captured
in a new catalog: gp_cluster_config_provider. These can be represented as API
structs such as: TableAMRoutine or IndexAmRoutine. This will be the way to
achieve some of the inheritance that you are talking about.

2. Some of the built-in ones (catalog, fts_flat_file) can be burned in at initdb
time, and can be internally provided. All user facing views, such as
gp_segment_configuration will be backed by the APIs provided. Non-core ones can
be installed via extensions, using a new DDL like CREATE CLUSTER CONFIG
PROVIDER (akin to CREATE ACCESS METHOD).

3. There should also be a way to switch the provider. For instance we
do this today:

if (IsTransactionState())
configs = readGpSegConfigFromCatalog(&total_dbs);
else
configs = readGpSegConfigFromFTSFiles(&total_dbs);

Instead, we could do:
configs = readGpSegConfigFromCurrentProvider(&total_dbs); or something similar.

and readGpSegConfigFromCurrentProvider() can consult the GUC:
cluster_config_provider, which will hold the current provider (default can be
catalog).

Backends (like FTS) which need to read the catalog provider and then update the
fts files provider could change the value of the GUC in process-local memory
before doing the update to the file, and then revert it after the
update is done.
Some overhead, but FTS should not be updating this config so frequently.

Additional config options for each specific provider can be added as
extension-specific GUCs (eg auto_explain has its own special non-core GUCs) if
the provider so chooses.

4. All of the added indirection that this pluggable provider approach brings can
be a shade detrimental to critical paths such as all gang management and
dispatch code, specially for OLTP queries. (see all callers for
readGpSegConfigFromCatalog()). The segment config cache exists for this purpose
today and we can continue to use it. So we would need a way to invalidate these
backend-local caches. That should be part of the framework code that would call
into the provider APIs.

5. gp_configuration_history is a pretty critical entity and that would need
interfaces around as well. It can be a view now.

Regards,
Soumyadeep (VMware)

Ashwin Agrawal

unread,

Jul 11, 2023, 4:34:57 AM7/11/23

to Soumyadeep Chakraborty, Jimmy Yih, Greenplum Developers

On Tue, Jul 11, 2023 at 4:18 AM Soumyadeep Chakraborty <soumyad...@gmail.com> wrote:

Hey folks,

I realize I am late to this party. Let me share my thoughts nonetheless.

Not really, the party is just starting.... :-)

(1) Having a pluggable component fully manage the cluster configuration
throughout the uptime of the cluster is a challenging ask, with complete APIs
etc.

I thought about this a bit and I feel at this stage of the 7 dev cycle, lets
not introduce this change. There is a fair bit to do to get it right, and I
don't think we have the cycles to develop and test. The server backend won't be
the only thing changing, utilites (and their tests) will have to change and be
tested too (and as mentioned KBs, docs etc).

I think we should do it in the best way possible and not compromise due to the
amount of time we have left.

We have not committed or decided the scope from this aspect for GPDB7
yet. Once we hash out the details we will know, it's in or out. My
question to first come up with a list of functionalities (APIs) to be
provided was purely from this perspective to highlight its not simple
change to just say convert table to view and done. As soon as we do
that we are in the business of providing all the capabilities where
code has the luxury to directly go ahead and modify the table via SQL
(like the examples provided from utilities, upgrade and numerous
instances to fix customer systems ...)

Once functional requirements are written out and abstraction is
created - what different ways to implement (internal catalog table,
some external software, ...) those can be traced out.

(2) There is quite a bit of code around (gangs, dispatch etc) which lies in
critical paths. The gang and dispatch code today rely on a cache:
cdb_component_dbs which is currently invalidated via FTS and even gpexpand! Now,
we would need to add interface routines around these as well.

else if ((cdb_component_dbs->fts_version != ftsVersion ||
cdb_component_dbs->expand_version != expandVersion))
{
...
cdbcomponent_destroyCdbComponents();
cdb_component_dbs = getCdbComponentInfo();

Exactly add all these to functional requirements for which API needs to be provided.

(3) One other way to tackle the problem of cluster config maintenance is to
completely decouple it from the server -> just have one provider of truth: etcd
(or similar).

Instead of having multiple ways to query/update cluster configuration, there
would only be one way (and there should really be one way).

Utilities can directly update/read it and so can the server
backends.

Let me clarify, Utilities directly update means not going via Server

but still using APIs. As have burnt badly with utilities directly
updating gp_segment_configuration table (via SQL). So, even if we move
it somewhere else like etcd, sqllite or whatever, we need them to
still only work with APIs without direct exposure to underlying
implementation, as we should be able to easily change the underlying
implementation.

(4) Possible temporary compromise to support read replica clusters:

I think the main requirement for a read replica cluster and hot standby dispatch
is that a mechanism is needed to seed a fresh gp_segment_configuration
that departs
from the primary cluster. What if through a utility, we could modify the
gp_segment_configuration catalog? The main challenge of doing that though is
that gp_segment_configuration is WAL replicated and any updates in the primary
cluster will mess with the read replica cluster.
What if we made gp_segment_configuration UNLOGGED? Making it so
might mean a bit of work for gpactivatestandby, but might help us here.

Making gp_segment_configuration UNLOGGED doesn't solve the problem, it

actually creates more :-) Cluster replication can get incremental
changes either via WAL replay or via direct filesystem level diff (for
example delta restore pg_backrest feature). Plus, making
gp_segment_configuration UNLOGGED clearly means it's not crashsafe,
100% unacceptable. Unclean shutdown will truncate that table out,
ooch... So, that's a non-starter as an option.

Just-in-case WAL replay was the only problem then we could have hacked
up instead to avoid replay of gp_segment_configuration in DR cluster
via some mechanism like adding some special bit to wal record or based
on recording OID in WAL or such mechanisms. But as stated before
replay is not the only aspect touching it.

I can't quite recall what Jimmy did here in the code to read the cluster config
from the catalog in one of our POCs for hot standby dispatch, but maybe it
looked like this:

if (guc)
configs = readGpSegConfigFromFile(&total_dbs);
else if (IsTransactionState())
configs = readGpSegConfigFromCatalog(&total_dbs);
else
configs = readGpSegConfigFromFTSFiles(&total_dbs);

That is exactly what Jimmy is proposing in the thread and to

accomplish the goal proposed to convert gp_segment_configuration to
view. Then such switching can be provided via GUC for generating the
view output either from catalog or file.

I think this is the simplest thing to do at this point for the 7X release, using
a flat file as an integration mechanism to tackle this one need. It can also be
done post-release as a new feature.

Can't be done post release, as per thread need to convert
gp_segment_configuration to a view. Then the fallout from converting
gp_segment_configuration to view, need to make all direct modifiers of
it adhere to using the APIs (and provide APIs if don't exist currently
like modifying ports or hostname functionality). Or replace and point
them to gp_internal_segment_configuration table (which means its not
really going to be internal :) and won't work for replica
cluster). When same functionality will be required on replica cluster
then separate procedure/utility has to be hashed out to update the
flat-file. Need is going to arise even if not immediate now to support
similar functionality for replica cluster.

That's where it circles back to what all functionality/APIs are
required for cluster configuration and from that list what all needs
to be implemented for flat file configuration for replica.

What all semantics needs to kept in-sync with current cluster catalog
based config (like DBIDs are currently assigned by
gp_add_segment_primary() interface... I believe for flat file config
we will be providing the dbids...).

How can we make sure to carve out common utility like updating port or
hostname for users where irrespective of executing against primary or
replica cluster it remains same for them and hides the internal
implementation. We can implement it in iterations longer term but the
plan needs to be in-place on how to accomplish the same. That would
give us idea clearly on what all needs to be implemented right now
before GA - when converting the table to view. Most of flat file
config related implementation can come post GA once other things are
in-place.

dispatch code, specially for OLTP queries. (see all callers for t

readGpSegConfigFromCatalog()). The segment config cache exists for this purpose
today and we can continue to use it. So we would need a way to invalidate these
backend-local caches. That should be part of the framework code that would call
into the provider APIs.

5. gp_configuration_history is a pretty critical entity and that would need
interfaces around as well. It can be a view now.

I was not even going there (as for sure GPDB8 material) :-)
Externalizing configuration and decoupling it from Server is one of
the biggest desires. And if going after that goal then really depends
where we design the abstraction. What you articulated above keeps the
abstraction inside Server/GPDB (as Provider is routing from server -
contradiction to earlier point raised in thread for config to be
Utility and Server independent). Or at least proposal is assuming
catalog config provider hosting (catalog table, extension and all
aspect - even if config is not directly stored in PostgreSQL) will be
PostgreSQL based (if some separate dedicated node) which may not be
the case either :-). Will it be etcd, sqllite, postgresql (or
something else to store the config) who knows for now and APIs will be
SQL, C, GO, or something else based to interact with it is also
unknown as of now. So, we will have to think through where the
abstraction is coded and then design the APIs for externalized config,
can't say it will be similar to TableAMAPI (as that was Server
centrist piece).