Auto-failover: changes of gp_segment_configuration

190 views
Skip to first unread message

Hao Wu

unread,
Mar 31, 2021, 11:02:33 PM3/31/21
to Greenplum Developers, Asim Praveen, Kalen Krempely
Hi hackers,

With auto-failover enabled in GPDB 7, the coordinator instance may change at runtime, i.e. the standby may be promoted as the new active coordinator in the cluster. There are a couple of things that should change. One of them is the change about gp_segment_configuration.

gp_segment_configuration is a catalog in GPDB to store the configuration about segments and coordinator/standby. With auto-failover,
we may get the segment configuration from the non-coordinator node, which acted as the coordinator before. We must ensure that
the segment configuration we get is always correct.

So, the issue becomes that we should first get the active correct coordinator in the cluster. As we discussed before, the monitor
is the only source of truth on who the current coordinator is. This is why we change the way to access the gp_segment_configuration.

Changes:
  1. rename gp_segment_configuration to gp_internal_segment_configuration.
  2. create a UDF gp_get_segment_configuration() to get the full segment configuration
  3. create a view gp_segment_configuration that shows the content from gp_get_segment_configuration().
The view gp_segment_configuration is used outside of the server, it guarantees that we always get the correct segment configuration or an error.
The function gp_get_segment_configuration() will raise an error if one of the following happens:
  1. The server instance is not the current coordinator acknowledged by the monitor.
  2. The standby is being promoted.
Details:
We keep the same content of previous gp_segment_configuration in gp_internal_segment_configuration. In gp_get_segment_configuration(),
we keep some simple rules:
  1. We only trust these fields for coordinator and standby in gp_internal_segment_configuration: dbid, content, hostname, address, port, datadir.
  2. Other fields(role, preferred_role, mode, status) in gp_internal_segment_configuration might change at runtime, so we must replace these volatile fields before return.
  3. We assume that the trusted fields can't change with auto-failover enabled.
Firstly, we lookup the monitor to fetch the node status and translate them to some fields in gp_internal_segment_configuration that might change
at runtime. These fields includes role, preferred_role, mode, status. Then if we find the primary node(which is the coordinator in GPDB),
we run some validations and replace the volatile fields in memory with the translated values.

To work together without auto-failover, the UDF gp_get_segment_configuration will first check the GUC monitor_conn_info to see if auto-failover is enabled. If not, the function will simply return the tuples in gp_internal_segment_configuration.

One thing to mention is that the gp_internal_segment_configuration is used directly by interconnect, we'll not lookup the monitor. Because we have another patch to make the primary remember the coordinator ID and the QE will refuse any connections that don't match the coordinator ID, so it has no issue. An additional benefit is that the cluster can still work if the monitor is down.

Any ideas?

Regards,
Hao Wu

dkri...@pivotal.io

unread,
Apr 1, 2021, 12:24:50 PM4/1/21
to Greenplum Developers, ha...@vmware.com, Asim R P, Kalen Krempely
The current CM utilities(gpaddmirrors, gprecoverseg, gpstop, ...) rely on gp_segment_configuration to discover what actions need to be done.  The basic workflow of the CM utilities is:
  • connect to coordinator and get gp_segment_configuration
  • use the information in gp_segment_configuration to decide what to do
    • for gprecoverseg, find the down segments and recover them
    • for gpstop, discover all up segments in the cluster and stop them
    • other utilities behave similarly
So the CM utilities require a reliable mechanism both to 
  1. obtain the "current" gp_segment_configuration at the time of a given request
  2. perform some action(say, pg_ctl stop -d primary_one) and then get the "new" gp_segment_configuration
      1. that is, get the gp_segment_configuration guaranteed to contain the effects of actions that happened before reading it
      2. this is the same as getting the "current" gp_segment_configuration after doing some synchronization action.

    For getting the "new" gp_segment_configuration, for the synchronization action, we currently do a select gp_request_fts_probe_scan() followed by creating a temp table in a transaction(BEGIN; create temp table tt(a int); COMMIT;).  After the synchronization action successfully completes, we know we can proceed.  We typically loop until that action succeeds up to some long timeout value.

    Can we use this same synchronization action with auto_failover?  If not, what can we use?

    For getting the "current" gp_segment_configuration, I am concerned about how we are to deal with changes happening while a CM utility is going on.  For the first time in a utility we request the gp_segment_configuration, we can do a re-try loop for some period of time until we do not get an error.  However, many of our CM utilities call each other CM utilities, and each one gets the gp_segment_configuration directly, which might change throughout the utility duration(often several hours).

    So how are we to deal with the coordinator failing over during a CM utility?  

    Our CM utility actions are not, of course, atomic with respect to an auto_failover.  Since the auto_failover can happen "at any time", if we need to perform an action on the coordinator, we need its identity to be stable until our action is complete.  This will actually come to light during a full code review/study of our code, but I am pretty sure we identify the coordinator and then perform some action based on it(like stopping it).  I am not sure it will still work if the coordinator changes during a run.

    These issues will be discussed further as we review the designs of auto_failover, but I wanted to stress that the CM utilities need an architected way to obtain the "current" gp_segment_configuration as well as to perform a synchronization action on it.

    --D

    Hao Wu

    unread,
    Apr 5, 2021, 10:40:56 PM4/5/21
    to David Krieger (Pivotal), Greenplum Developers, Asim Praveen, Kalen Krempely
    For getting the "new" gp_segment_configuration, for the synchronization action, we currently do a select gp_request_fts_probe_scan() followed by creating a temp table in a transaction(BEGIN; create temp table tt(a int); COMMIT;).  After the synchronization action successfully completes, we know we can proceed.  We typically loop until that action succeeds up to some long timeout value.

    Can we use this same synchronization action with auto_failover?  If not, what can we use?
    It becomes slightly different. Before GP 7, the entries of coordinator/standby in gp_segment_configuration don't change at runtime, we only care about the segment entries. With auto-failover, it's not true, the entries of coordinator/standby may change. The synchronization action becomes:
    1. Make sure the (role, status) of the coordinator in gp_segment_configuration is ('p', 's');
    2. Run gp_request_fts_probe_scan() followed by creating a temp table.
    The role of the coordinator should be 'p', if it's not, it likely means that auto-failover happens. The mode is meaningful with auto-failover, 's' means
    the coordinator and standby were synchronized in the last check and the GUC synchronous_standby_names​ is '*'.


    So how are we to deal with the coordinator failing over during a CM utility?  
    This is a good question. When the coordinator is down or the network partition happens between the coordinator and the monitor, the state of the coordinator changes from primary​ to demote_timeout​. We don't consider the coordinator as an active coordinator if the state of the coordinator becomes demote_timeout. Then, the standby will be promoted after a short while. Fetching gp_segment_configuration in the time gap will raise an error, so, you can't run most utility tools.

    I admit that it only reduces the time gap, and the problem still exists.
    A simple way(Not implemented yet) to fix is that we could provide an option to enable/disable promotion. Before running a utility tool, we disable promotion, and we enable promotion after the utility job. The promotion will finish if it should promote the standby.
    Keep in mind, the coordinator could be down when the utility tool is running, it probably means the utility can't run as expected and the cluster keeps auto-failover

    Regards,
    Hao Wu



    From: dkri...@pivotal.io <dkri...@pivotal.io>
    Sent: Friday, April 2, 2021 12:24 AM
    To: Greenplum Developers <gpdb...@greenplum.org>
    Cc: Hao Wu <ha...@vmware.com>; Asim Praveen <pa...@vmware.com>; Kalen Krempely <kkre...@vmware.com>
    Subject: Re: Auto-failover: changes of gp_segment_configuration
     

    Soumyadeep Chakraborty

    unread,
    Apr 5, 2021, 10:47:20 PM4/5/21
    to Hao Wu, Greenplum Developers, Asim Praveen, Kalen Krempely, dkri...@pivotal.io
    On Wed, Mar 31, 2021 at 8:02 PM Hao Wu <ha...@vmware.com> wrote:
    > Changes:
    >
    > rename gp_segment_configuration to gp_internal_segment_configuration.
    > create a UDF gp_get_segment_configuration() to get the full segment configuration
    > create a view gp_segment_configuration that shows the content from gp_get_segment_configuration().
    >
    > The view gp_segment_configuration is used outside of the server, it guarantees that we always get the correct segment configuration or an error.
    > The function gp_get_segment_configuration() will raise an error if one of the following happens:
    >
    > The server instance is not the current coordinator acknowledged by the monitor.
    > The standby is being promoted.
    >
    > Details:
    > We keep the same content of previous gp_segment_configuration in gp_internal_segment_configuration. In gp_get_segment_configuration(),
    > we keep some simple rules:
    >
    > We only trust these fields for coordinator and standby in gp_internal_segment_configuration: dbid, content, hostname, address, port, datadir.
    > Other fields(role, preferred_role, mode, status) in gp_internal_segment_configuration might change at runtime, so we must replace these volatile fields before return.
    > We assume that the trusted fields can't change with auto-failover enabled.
    >
    > Firstly, we lookup the monitor to fetch the node status and translate them to some fields in gp_internal_segment_configuration that might change
    > at runtime. These fields includes role, preferred_role, mode, status. Then if we find the primary node(which is the coordinator in GPDB),
    > we run some validations and replace the volatile fields in memory with the translated values.
    >
    > To work together without auto-failover, the UDF gp_get_segment_configuration will first check the GUC monitor_conn_info to see if auto-failover is enabled. If not, the function will simply return the tuples in gp_internal_segment_configuration.
    >
    > One thing to mention is that the gp_internal_segment_configuration is used directly by interconnect, we'll not lookup the monitor. Because we have another patch to make the primary remember the coordinator ID and the QE will refuse any connections that don't match the coordinator ID, so it has no issue. An additional benefit is that the cluster can still work if the monitor is down.

    Since we are talking about changing gp_segment_configuration, I want to take the
    chance to bring up a topic that is unrelated to coordinator auto-failover but
    which must be discussed now.

    I think we should take this chance to allow the complete externalization of
    gp_segment_configuration. Here is why:

    1. If Greenplum is configured with archive based replication, we have the need
    for two separate gp_segment_configuration tables - one to keep track of the
    primary cluster (a cluster with only primaries, writing WAL to an archive
    location) and the other to keep track of the replica cluster (a cluster with
    only mirrors which consume from the archive). We cannot synthesize the cluster
    state from the two clusters into one view, as one cluster may not know about the
    other.

    It is currently not possible to manage the replica cluster with the existing
    GP utilities such as gpstart, gpstop etc. This is because the
    gp_segment_configuration table is replicated over in the WAL stream from the
    primary cluster - the one that represents the primary cluster AND not the
    replica cluster.

    This also prevents hot standby dispatch from ever working on such a cluster.

    Thus, we have a need to supply gp_segment_configuration externally, in the
    replica cluster.

    2. It would be convenient for utilities such as gpconfig to be able to work
    without the server being live (to set GUCs that need to be set before the
    server is brought up. E.g. archive_command)

    3. Anyone (DBA/program) can view gp_segment_configuration without running the
    server.

    The ways we can externalize the table:
    1) External table
    2) Foreign server (for pg_auto_failover, this can well be the monitor's
    database..perhaps in a separate schema)

    The external entity will have to be the source of truth for the
    cluster configuration at
    all times.

    If we are to go down this road:

    * We should expose UDFs to get/update this external entity.

    * Then, we can continue to represent gp_segment_configuration as a
    view, which can
    internally resolve to the external source with the UDF.

    * FTS as part of it's update step (probeWalRepUpdateConfig()), can update the
    external entity and also the local cache of cluster state (I am not too familiar
    with FTS to know where this is)

    * We don't have a gp_internal_segment_configuration table in the catalog at
    all. None of GP's MPP mechanisms should read from it, including the
    interconnect. This does mean a lot of code needs to be reworked: For instance,
    all the call-sites where we do heap_open(GpSegmentConfigRelationId,
    ...) will have
    to be reworked.

    * Mechanisms could be made to rely exclusively on the cache of cluster state
    (which is going to be constantly updated by FTS). So we might have to get rid of
    readGpSegConfigFromCatalog() and replace it with a mechanism to read from the
    cache, backed by the external entity.

    * GP utilities will have to be rewired to seek
    gp_segment_configuration directly from
    the external entity.

    As an alternative to completely externalizing gp_segment_configuration, we could
    probably meet halfway: support an external gp_segment_configuration and hook
    into gp_get_segment_configuration() so it reads from the external table, in
    addition to reading from gp_internal_segment_configuration and the monitor.
    However, that will only support reason 1 above and not 2 and 3. I
    think we should
    go all the way and completely externalize it.

    Externalizing gp_segment_configuration entirely does have its set of
    concerns - such
    as fault tolerance for the external table (what happens if it's
    inaccessible/corrupt?)
    or even the foreign server (what if it is down?).

    Look forward to hearing more ideas on the subject.

    Regards,
    Deep

    Hao Wu

    unread,
    Apr 5, 2021, 11:36:36 PM4/5/21
    to Soumyadeep Chakraborty, Greenplum Developers, Asim Praveen, Kalen Krempely, David Krieger (Pivotal)
    Since we are talking about changing gp_segment_configuration, I want to take the
    chance to bring up a topic that is unrelated to coordinator auto-failover but
    which must be discussed now.

    I think we should take this chance to allow the complete externalization of
    gp_segment_configuration. Here is why:

    1. If Greenplum is configured with archive based replication, we have the need
    for two separate gp_segment_configuration tables - one to keep track of the
    primary cluster (a cluster with only primaries, writing WAL to an archive
    location) and the other to keep track of the replica cluster (a cluster with
    only mirrors which consume from the archive). We cannot synthesize the cluster
    state from the two clusters into one view, as one cluster may not know about the
    other.

    It is currently not possible to manage the replica cluster with the existing
    GP utilities such as gpstart, gpstop etc. This is because the
    gp_segment_configuration table is replicated over in the WAL stream from the
    primary cluster - the one that represents the primary cluster AND not the
    replica cluster.

    This also prevents hot standby dispatch from ever working on such a cluster.

    Thus, we have a need to supply gp_segment_configuration externally, in the
    replica cluster.

    2. It would be convenient for utilities such as gpconfig to be able to work
    without the server being live (to set GUCs that need to be set before the
    server is brought up. E.g. archive_command)
    +1, fully externalized gp_segment_configuration is more appealing. From auto-failover's
    view, it simplifies the process of gp_segment_configuration and it's easier for the utility
    tools. 
    But, the server that hosts gp_segment_configuration should be highly available, otherwise,
    the GPDB cluster is totally unavailable if the server that hosts the gp_segment_configuration
    is unavailable for any reason, which makes auto-failover a fool. Currently, the cluster can
    run and serve queries if the monitor is down.

    Deep, could you describe more details about gp_segment_configuration in your scenario?
    Thank you.

    Regards,
    Hao Wu



    dkri...@pivotal.io

    unread,
    Apr 6, 2021, 12:19:41 PM4/6/21
    to Greenplum Developers, ha...@vmware.com, Asim R P, Kalen Krempely, dkri...@pivotal.io
    On Monday, April 5, 2021 at 7:40:56 PM UTC-7 ha...@vmware.com wrote:
     With auto-failover, it's not true, the entries of coordinator/standby may change. The synchronization action becomes:
    1. Make sure the (role, status) of the coordinator in gp_segment_configuration is ('p', 's');
    2. Run gp_request_fts_probe_scan() followed by creating a temp table.
    The role of the coordinator should be 'p', if it's not, it likely means that auto-failover happens. The mode is meaningful with auto-failover, 's' means
    the coordinator and standby were synchronized in the last check and the GUC synchronous_standby_names​ is '*'.


    I want to be explicit about how an external program(such as the CM utilities) will perform this synchronization action, to make sure the steps will work.  

    Is the following correct?

    1). Identify the coordinator from select * from gp_segment_configuration where content == -1 and role == 'p'.  Cache the entire table gp_segment_configuration in the utility.
    2). If there is no coordinator(above query returns no entries), either:
         a). sit in a loop for, say, 1 minute, sleeping for 5 seconds and then trying 1). until it succeeds.  If it still doesn't after 1 minute, fail the utility.
         b). fail the utility and tell the user to "try again later"
    3). If there is a coordinator(there is 1 entry returned above), run gp_request_fts_probe_scan() followed by creating a temp table.
    4). Now we know the cluster is likely in the state reported in 1) and held in the local python program variable storing it there.

    Note that after 1) and before 4), the cluster can still change state, but we don't handle that explicitly in our utilities.

    If it is case 2), how long does auto_failover take in best, typical and worst case?

    So how are we to deal with the coordinator failing over during a CM utility?  
    This is a good question. When the coordinator is down or the network partition happens between the coordinator and the monitor, the state of the coordinator changes from primary​ to demote_timeout​. We don't consider the coordinator as an active coordinator if the state of the coordinator becomes demote_timeout. Then, the standby will be promoted after a short while. Fetching gp_segment_configuration in the time gap will raise an error, so, you can't run most utility tools.


    For the current CM utilities, it is not clear to me we properly handle the case of gp_segment_configuration changing during the duration of the utility.  For instance, the coordinator going down after gp_segment_configuration was taken but before the command completes.  I'd like to say in general we do as much as we can and otherwise report an error on what we cannot.  That might be true for primary/mirror but probably not for the coordinator.

    So there are two interrelated issues: some part of the underlying cluster going down during a CM utility call and gp_segment_configuration changing during the CM utility call.  Note that a segment can go down faster than our gp_segment_configuration updates, so we always have the case of a segment state changing during a call.  That is, we always have a race condition since our actual operations on the cluster(addmirrors, etc) are not atomic.

    auto_failover does not fundamentally change this behavior but does add the additional complication that the coordinator can change.
     
    I admit that it only reduces the time gap, and the problem still exists.
    A simple way(Not implemented yet) to fix is that we could provide an option to enable/disable promotion. Before running a utility tool, we disable promotion, and we enable promotion after the utility job. The promotion will finish if it should promote the standby.
    Keep in mind, the coordinator could be down when the utility tool is running, it probably means the utility can't run as expected and the cluster keeps auto-failover


    This would preserve the current behavior of the CM utilities.  An explicit promotion via auto_failover will likely require a fair bit of thought as to how to handle.

    So the most straightforward way to proceed would be to allow a disable and re-enable of auto-failover programmatically.  This will also need to be carefully considered, as it is hard to guarantee that a re-enable of auto_failover will always occur if we call disable.  Though a python atexit handler might be good enough.

    --David 


     

    Jesse Zhang

    unread,
    Apr 6, 2021, 12:40:31 PM4/6/21
    to Hao Wu, Soumyadeep Chakraborty, Greenplum Developers
    I think Deep was alluding to something like ZooKeeper, Consul, etcd,
    or (if a slight chance of stale read is tolerable) Dqlite [1] as a
    "highly available and consistent" substrate. While all four are based
    on a replication and consensus model, if I understand correctly,
    Dqlite might be more appealing here because

    1. It supports querying in SQL (no big deal)
    2. It has a better relational semantics (medium-to-big deal: you do
    get a snapshot of the range you're scanning, not on a single key in a
    key-value store, although we usually only care about one or two keys
    if we were to fit into the K-V model)
    3. It has a nicer C binding (big deal, given how painful it is to find
    the "right" ZK, Consul or even etcd client libraries in C).

    [1] https://dqlite.io/docs/consistency-model
    > --
    > To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-dev+u...@greenplum.org.

    Soumyadeep Chakraborty

    unread,
    Apr 6, 2021, 2:31:38 PM4/6/21
    to Jesse Zhang, Hao Wu, Greenplum Developers
    On Tue, Apr 6, 2021 at 9:40 AM Jesse Zhang <sbj...@gmail.com> wrote:
    >
    > I think Deep was alluding to something like ZooKeeper, Consul, etcd,
    > or (if a slight chance of stale read is tolerable) Dqlite [1] as a
    > "highly available and consistent" substrate. While all four are based
    > on a replication and consensus model, if I understand correctly,
    > Dqlite might be more appealing here because
    >
    > 1. It supports querying in SQL (no big deal)
    > 2. It has a better relational semantics (medium-to-big deal: you do
    > get a snapshot of the range you're scanning, not on a single key in a
    > key-value store, although we usually only care about one or two keys
    > if we were to fit into the K-V model)
    > 3. It has a nicer C binding (big deal, given how painful it is to find
    > the "right" ZK, Consul or even etcd client libraries in C).
    >

    Yes! Jimmy and I had briefly discussed Zookeeper, but Dqlite seems even better.
    What Jesse suggested.

    > >
    > > Deep, could you describe more details about gp_segment_configuration in your scenario?
    > > Thank you.

    Sure. It is the first use case I had suggested - two independent GP clusters,
    one replicating to an archive and the other consuming from the archive. We want
    to be able to operate each cluster separately with GP utilities such as
    gpstop/gpstart etc. We want to be able to perform dispatch on both clusters.

    This means that we need to capture two different sets of cluster state, i.e. two
    separate instances of gp_segment_configuration. Thus, we can't use the
    gp_segment_configuration that is WALed over from the first cluster. This is why
    we need an external gp_segment_configuration service which will run on
    the second
    cluster and will be consulted for dispatch and for running the GP utilities.

    Regards,
    Deep

    Kalen Krempely

    unread,
    Apr 6, 2021, 8:27:16 PM4/6/21
    to Soumyadeep Chakraborty, pvtl-cont-sbjesse, Hao Wu, Greenplum Developers
    A couple of observations, challenges, and call to action or proposal.

    * Observations and Challenges:
    - Looks like externalizing gp_segment_configuration or the concepts behind that could be very useful to Greenplum as a whole. This includes multiple components specifically such as pg_auto_failover, disaster recovery, python cloud-ready utility rewrite, and others.
    - This type of change affects multiple teams across multiple time zones.
    - This feature or change requires lots of careful planning and consideration since its such a fundamental and core change.
    - Architecting this correctly will have huge wins for Greenplum and enable.
    - I don’t think just discussing on the mailing list is enough due diligence for this type of change. But rather requires more prototyping, leadership, and in-person discussions.


    * Call to Action / Proposal:
    - Create a small focused team composed of various members to interface with the various teams including the mailing list to take ownership in flushing out all the details. And if it makes sense to move this over the finish line.
    > [1] https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdqlite.io%2Fdocs%2Fconsistency-model&amp;data=04%7C01%7Ckkrempely%40vmware.com%7Ca158fabd1acb45f5332108d8f92a3e22%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637533307121062374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Ktoqd39IZEQbcHVLjPsmPj0J6RC9q%2BcGb3UztMe3nzU%3D&amp;reserved=0

    Hao Wu

    unread,
    Apr 7, 2021, 2:26:53 AM4/7/21
    to David Krieger (Pivotal), Greenplum Developers, Asim Praveen, Kalen Krempely
    I want to be explicit about how an external program(such as the CM utilities) will perform this synchronization action, to make sure the steps will work.  

    Is the following correct?

    1). Identify the coordinator from select * from gp_segment_configuration where content == -1 and role == 'p'.  Cache the entire table gp_segment_configuration in the utility.
    2). If there is no coordinator(above query returns no entries), either:
         a). sit in a loop for, say, 1 minute, sleeping for 5 seconds and then trying 1). until it succeeds.  If it still doesn't after 1 minute, fail the utility.
         b). fail the utility and tell the user to "try again later"
    3). If there is a coordinator(there is 1 entry returned above), run gp_request_fts_probe_scan() followed by creating a temp table.
    4). Now we know the cluster is likely in the state reported in 1) and held in the local python program variable storing it there.

    Note that after 1) and before 4), the cluster can still change state, but we don't handle that explicitly in our utilities.

    If it is case 2), how long does auto_failover take in best, typical and worst case?
    Yes, but perhaps, we should resolve the consistency of read/write of gp_segment_configuration first before running the above steps.

    The gp_segment_configuration in this mailing list becomes two issues:
    1. Whether to externalize gp_segment_configuration(which is highly available) from the coordinator or not.
    2. How to resolve coordinator auto-failover when the utility tool grabs gp_segment_configuration and is still running? It's not just about gp_segment_configuration. Nearly all utility tools assume that they are running on the same host as the current coordinator, if the coordinator has changed, the behavior of the utility tools is hard to predict.

    Regards,
    Hao Wu


    From: dkri...@pivotal.io <dkri...@pivotal.io>
    Sent: Wednesday, April 7, 2021 12:19 AM
    To: Greenplum Developers <gpdb...@greenplum.org>
    Cc: Hao Wu <ha...@vmware.com>; Asim Praveen <pa...@vmware.com>; Kalen Krempely <kkre...@vmware.com>; David Krieger (Pivotal) <dkri...@pivotal.io>

    Subject: Re: Auto-failover: changes of gp_segment_configuration

    dkri...@pivotal.io

    unread,
    Apr 7, 2021, 12:18:49 PM4/7/21
    to Greenplum Developers, ha...@vmware.com, Asim R P, Kalen Krempely, dkri...@pivotal.io
    On Tuesday, April 6, 2021 at 11:26:53 PM UTC-7 ha...@vmware.com wrote:
    The gp_segment_configuration in this mailing list becomes two issues:
    1. Whether to externalize gp_segment_configuration(which is highly available) from the coordinator or not.
    Agreed.  I think Kalen's suggestion of a dedicated team to think through these issues is the right way to proceed here.  
    1. How to resolve coordinator auto-failover when the utility tool grabs gp_segment_configuration and is still running? It's not just about gp_segment_configuration. Nearly all utility tools assume that they are running on the same host as the current coordinator, if the coordinator has changed, the behavior of the utility tools is hard to predict.
    It's a great point you make here: that having the coordinator auto_failover during a CM utility in general changes the host of the coordinator.  That adds a big potential complication.  That's all the more reason to consider having functionality to disable/re-enable auto_failover behavior.  That's not an ideal solution as it reduces the availability of the cluster.  But as a stable state that adds a great feature, it makes sense: the utilities are not running most of the time so we'll in general have auto_failover enabled.  We can then work towards supporting a cluster that can fail_over at "any" time.  

    Jimmy Yih

    unread,
    Apr 7, 2021, 9:40:32 PM4/7/21
    to Hao Wu, Greenplum Developers, Asim Praveen, Kalen Krempely
    After thinking about it for a while, the proposal that Hao is putting here is an ideal iterative approach.  Converting the gp_segment_configuration into a view which only calls a new internal catalog function gp_get_segment_configuration() to get the segment information is nice.  That will allow extending on gp_segment_configuration to happen in the gp_get_segment_configuration() function itself (e.g. gp_get_segment_configuration() could be turned into a wrapper function which has different options to obtain segment information maybe controlled by GUC).

    Example:
    SELECT * FROM gp_segment_configuration;
        => gp_get_segment_configuration()
            => gp_get_segment_configuration_autofailover()
            => gp_get_segment_configuration_foreign_server()
            => gp_get_segment_configuration_fts_2pc_file()
            => etc.

    This is similar to how we have readGpSegConfigFromCatalog() and readGpSegConfigFromFTSFiles() right now in getCdbComponentInfo().

    - Jimmy

    From: Hao Wu <ha...@vmware.com>
    Sent: Wednesday, March 31, 2021 8:02 PM
    To: Greenplum Developers <gpdb...@greenplum.org>
    Cc: Asim Praveen <pa...@vmware.com>; Kalen Krempely <kkre...@vmware.com>
    Subject: Auto-failover: changes of gp_segment_configuration
     
    Reply all
    Reply to author
    Forward
    0 new messages