On Wed, Mar 31, 2021 at 8:02 PM Hao Wu <
ha...@vmware.com> wrote:
> Changes:
>
> rename gp_segment_configuration to gp_internal_segment_configuration.
> create a UDF gp_get_segment_configuration() to get the full segment configuration
> create a view gp_segment_configuration that shows the content from gp_get_segment_configuration().
>
> The view gp_segment_configuration is used outside of the server, it guarantees that we always get the correct segment configuration or an error.
> The function gp_get_segment_configuration() will raise an error if one of the following happens:
>
> The server instance is not the current coordinator acknowledged by the monitor.
> The standby is being promoted.
>
> Details:
> We keep the same content of previous gp_segment_configuration in gp_internal_segment_configuration. In gp_get_segment_configuration(),
> we keep some simple rules:
>
> We only trust these fields for coordinator and standby in gp_internal_segment_configuration: dbid, content, hostname, address, port, datadir.
> Other fields(role, preferred_role, mode, status) in gp_internal_segment_configuration might change at runtime, so we must replace these volatile fields before return.
> We assume that the trusted fields can't change with auto-failover enabled.
>
> Firstly, we lookup the monitor to fetch the node status and translate them to some fields in gp_internal_segment_configuration that might change
> at runtime. These fields includes role, preferred_role, mode, status. Then if we find the primary node(which is the coordinator in GPDB),
> we run some validations and replace the volatile fields in memory with the translated values.
>
> To work together without auto-failover, the UDF gp_get_segment_configuration will first check the GUC monitor_conn_info to see if auto-failover is enabled. If not, the function will simply return the tuples in gp_internal_segment_configuration.
>
> One thing to mention is that the gp_internal_segment_configuration is used directly by interconnect, we'll not lookup the monitor. Because we have another patch to make the primary remember the coordinator ID and the QE will refuse any connections that don't match the coordinator ID, so it has no issue. An additional benefit is that the cluster can still work if the monitor is down.
Since we are talking about changing gp_segment_configuration, I want to take the
chance to bring up a topic that is unrelated to coordinator auto-failover but
which must be discussed now.
I think we should take this chance to allow the complete externalization of
gp_segment_configuration. Here is why:
1. If Greenplum is configured with archive based replication, we have the need
for two separate gp_segment_configuration tables - one to keep track of the
primary cluster (a cluster with only primaries, writing WAL to an archive
location) and the other to keep track of the replica cluster (a cluster with
only mirrors which consume from the archive). We cannot synthesize the cluster
state from the two clusters into one view, as one cluster may not know about the
other.
It is currently not possible to manage the replica cluster with the existing
GP utilities such as gpstart, gpstop etc. This is because the
gp_segment_configuration table is replicated over in the WAL stream from the
primary cluster - the one that represents the primary cluster AND not the
replica cluster.
This also prevents hot standby dispatch from ever working on such a cluster.
Thus, we have a need to supply gp_segment_configuration externally, in the
replica cluster.
2. It would be convenient for utilities such as gpconfig to be able to work
without the server being live (to set GUCs that need to be set before the
server is brought up. E.g. archive_command)
3. Anyone (DBA/program) can view gp_segment_configuration without running the
server.
The ways we can externalize the table:
1) External table
2) Foreign server (for pg_auto_failover, this can well be the monitor's
database..perhaps in a separate schema)
The external entity will have to be the source of truth for the
cluster configuration at
all times.
If we are to go down this road:
* We should expose UDFs to get/update this external entity.
* Then, we can continue to represent gp_segment_configuration as a
view, which can
internally resolve to the external source with the UDF.
* FTS as part of it's update step (probeWalRepUpdateConfig()), can update the
external entity and also the local cache of cluster state (I am not too familiar
with FTS to know where this is)
* We don't have a gp_internal_segment_configuration table in the catalog at
all. None of GP's MPP mechanisms should read from it, including the
interconnect. This does mean a lot of code needs to be reworked: For instance,
all the call-sites where we do heap_open(GpSegmentConfigRelationId,
...) will have
to be reworked.
* Mechanisms could be made to rely exclusively on the cache of cluster state
(which is going to be constantly updated by FTS). So we might have to get rid of
readGpSegConfigFromCatalog() and replace it with a mechanism to read from the
cache, backed by the external entity.
* GP utilities will have to be rewired to seek
gp_segment_configuration directly from
the external entity.
As an alternative to completely externalizing gp_segment_configuration, we could
probably meet halfway: support an external gp_segment_configuration and hook
into gp_get_segment_configuration() so it reads from the external table, in
addition to reading from gp_internal_segment_configuration and the monitor.
However, that will only support reason 1 above and not 2 and 3. I
think we should
go all the way and completely externalize it.
Externalizing gp_segment_configuration entirely does have its set of
concerns - such
as fault tolerance for the external table (what happens if it's
inaccessible/corrupt?)
or even the foreign server (what if it is down?).
Look forward to hearing more ideas on the subject.
Regards,
Deep