The distributed monitor is complicated but more robust. We could implement a single monitor first. Because the distributed monitor contains all the logic of a single monitor, and we should focus on implementing a robust monitor.
Both of them have their advantages and disadvantages.
Inside GPDB:
PROS: no deployment effort, it can use the functionality provided by Greenplum.
CONS: any fix for the monitor will update the whole GPDB, any critical errors may panic the whole segment. Not easy to expose the monitor information(How to access the monitor?).
Outside GPDB
PROS: Updating the code is independent of GPDB. Complexity is decoupled from the GPDB. Easy to expose the monitor information.
CONS: Deployment effort. It may change the gpstart/gpstop/gpexpand/gpaddmirrors and more management tools.
Currently, we should focus on implementing the core logic of the monitor for a workable GPDB. Moving the code outside of GPDB may be the next step if needed.
Manage master/standby vs all pairs of segments(including master/standby)
FTS works well for the segments, it doesn't care about the pair master/standby.
MAF will only work for the pair of master/standby, but it may take over all the pairs
of all segments if needed. To isolate from FTS, MAF will mainly focus on the pair
of master/standby.
We propose a simple design to make MAF available with less effort but remain its possibility
to other features.
Any ideas?
Regards,
Hao Wu
Tradeoff:Single monitor vs distributed monitor/etcdThe distributed monitor is complicated but more robust. We could implement a single monitor first. Because the distributed monitor contains all the logic of a single monitor, and we should focus on implementing a robust monitor.
Monitor inside GPDB vs outside GPDB
Both of them have their advantages and disadvantages.
Inside GPDB:
PROS: no deployment effort, it can use the functionality provided by Greenplum.
CONS: any fix for the monitor will update the whole GPDB, any critical errors may panic the whole segment. Not easy to expose the monitor information(How to access the monitor?).
Outside GPDB
PROS: Updating the code is independent of GPDB. Complexity is decoupled from the GPDB. Easy to expose the monitor information.
CONS: Deployment effort. It may change the gpstart/gpstop/gpexpand/gpaddmirrors and more management tools.
Currently, we should focus on implementing the core logic of the monitor for a workable GPDB. Moving the code outside of GPDB may be the next step if needed.
Manage master/standby vs all pairs of segments(including master/standby)
FTS works well for the segments, it doesn't care about the pair master/standby.
MAF will only work for the pair of master/standby, but it may take over all the pairs
of all segments if needed. To isolate from FTS, MAF will mainly focus on the pair
of master/standby.
The above table and cases are hard to follow as the table only mentions which connection is healthy but doesn't state which is not healthy. We need better arrangement of the table to grasp the cases being discussed.
The main aspect missing from the proposal is how will the client connections be handled and/or routed to connect currently active master. This was the important aspect which preempted the feature last time, hence extremely important to cover that aspect in-detail while respawning this feature.The biggest advantage we have with segment primaries and mirrors is GPDB initiates / routes the connections to them. Whereas for master and standby, we can't / don't control the same. Any approach chosen, the third party entities like pgpool or anything else need to work with Monitor to make that decision. Hence, "Not easy to expose the monitor information" becomes a very critical con for Inside GPDB approach. Visibility of what node is in which state is a strong requirement for many purposes.
I understand a solution based on multiple IP addresses in libpq connection will be proposed for client routing between master and standby. Though that falls short in certain situations where master and standby can be active at same time due to n/w partitions. Where only Monitor knows which is currently active master and hence clients mostly will have to consult the Monitor.
It's important to get the architecture right the first time before implementation. Understanding the complexity of dealing with utilities ties our hands many times for new features. Though I feel we should not drop exploring right architecture just because of utilities complexity. So, I liked your consideration for both the approaches. And we should continue to explore more before rushing to go with the simplest or fastest implementation.
I am really looking forward to hearing from the effort under way for externalizing cluster configuration exploration. Insight from that would tremendously help us on this front. Master node keepings its own configuration information and on top of that not reflecting the truth of sync/not-in-sync with standby is a pretty embarrassing situation we have currently.
Manage master/standby vs all pairs of segments(including master/standby)
FTS works well for the segments, it doesn't care about the pair master/standby.
MAF will only work for the pair of master/standby, but it may take over all the pairs
of all segments if needed. To isolate from FTS, MAF will mainly focus on the pair
of master/standby.
I kind of strongly feel better to have a single implementation and solution for all pairs of segments including master and standby, instead of different implementations of them, as finally functionality is the same.
I have one question: how can a disconnected primary be incorporated back in thecluster? Changes made to its data directory after standby promotion (possibly
by in-progress clients before the primary shut itself down) need to be reverted,
using pg_rewind. Should this process also be automated, possibly triggered by
monitor?
Thank you Wu Hao, for clarifying the proposal in an off-list discussion. Let me
summarise my understanding.
Unlike FTS, monitor in your design probes both, primary as well as standby. For
standby to be promoted by monitor, *all* of the following conditions must be
met:
(1) monitor cannot connect to primary
(2) standby cannot connect to primary
(3) standby was in-sync with the primary prior to disconnection
Moreover, primary knows the existence of monitor, possibly by means of an always
alive connection, similar to replication connection. If the primary is alive,
it can determine if monitor is disconnected. Upon standby promotion, it can
also determine that standby is disconnected (WAL sender exits). Primary shuts
down as soon as it detects that both, monitor as well as standby are
disconnected.
The scenario raised by Ashwin is monitor cannot connect to primary but it can
connect to standby. If the standby is actively streaming from the primary,
monitor does not promote it. Thus, the problematic scenario that both primary
and standby end up accepting client connections, is avoided by this design.
Monitor continues to probe standby after it is promoted. If the connection
between monitor and standby breaks, the promoted standby shuts itself down, as
per the logic mentioned above, leaving the cluster unavailable. The design
tolerates at the most one node failure.
How can clients connect to the cluster? The new libpq API feature that allows
specifying multiple host address must be used (it is not clear what the impact
of this requirent would be, especially on legacy applications / ETL tools). The
first address tried should be primary. Primary accepts connections in *any*
of the following cases:
(1) monitor connection is active
(2) standby connection is active
Note that the cluster is available when monitor is down but both primary and
standby are up. If primary is not reachable, the clients automatically try
standby. Standby accepts connections if it is promoted or if hot-standby is
enabled.
I have one question: how can a disconnected primary be incorporated back in the
cluster? Changes made to its data directory after standby promotion (possibly
by in-progress clients before the primary shut itself down) need to be reverted,
using pg_rewind. Should this process also be automated, possibly triggered by
monitor?
>>> If the standby is actively streaming from the primary,
>> monitor does not promote it. Thus, the problematic scenario that both primary
>> and standby end up accepting client connections, is avoided by this design.
>
> Based on the above, it seems we will not failover if the master's
> connection to segments is also broken along with the monitor but still
> alive with standby. If having a monitor as part of some segment, then
> the link with the monitor and segment will always go down
> together. So, this points out for sure that the monitor needs to be
> separated from segments.
A correlation between master and segments connection and placement of the
monitor is hard to see. Let’s say, monitor is running on a segment. If this
segment becomes unreachable from master, connection with monitor is also lost.
If master can reach standby, the cluster continues to operate, except that FTS
on master takes necessary action triggered by the segment’s failure.
> If the monitor is separate from segments, then I am not sure how
> important not being able to reach segments scenario is to cover but
> in cloud env might be possible.
I agree, it is not clear at the moment.
> On 18-Sep-2020, at 11:02 PM, Ashwin Agrawal <ashwi...@gmail.com> wrote:
>
> On Fri, Sep 18, 2020 at 2:59 AM Asim Praveen <pa...@vmware.com> wrote:
>>
>> Primary shuts down as soon as it detects that both, monitor as well as standby are
>> disconnected.
>
> Just to point out self shutting-down behavior should just be leveraged
> for optimization purposes and not for correctness. As shutdown can
> take time or may hang due to some bug. We have seen many instances of
> this in tests, where connections continue being served even after
> receiving a shut-down request for some time. Possibly need some external
> local entity to force kick it out, if it doesn't self shutdown. (pg_auto_failover has
> local non-postgres process on a node which can do this.)
This is a valid concern. A mechanism to shutdown postmaster that can work
independent of the postmaster itself is necessary.
>
> I think what will save the day is synchronous replication will still
> be ON for master (when standby is promoted), hence even if master
> continues to accept connections, they will never complete.
Note that the prepare broadcasts sent by such a master to segments will not be
affected even when synchronous replication is ON and standby has been promoted.
The recent change (commit 5b73cefc99) to distributed transaction recovery
bgworker process would periodically abort such transactions on the promoted
standby.
>
>> Monitor continues to probe standby after it is promoted. If the connection
>> between monitor and standby breaks, the promoted standby shuts itself down, as
>> per the logic mentioned above, leaving the cluster unavailable. The design
>> tolerates at the most one node failure.
>
> That seems to be too self-destructive. Why promoted standby (only
> standing master) if able to serve queries should go down?
>
It seems self-destructive, indeed. However, the design enforces that a master
must stop if it loses both, link with monitor and link with standby. At the
most one master must be active in the system. The same logic should apply to
standby upon promotion.
> Multiple host addresses in definitely one way of automating the client
> routing. Not sure if all users will leverage that aspect. In-case
> users wish to route the connections using some other mechanism, I
> think Monitor would be the place to find that information, correct?
> So, it seems important to be able to reach Monitor and extract
> information. Hence, "Not easy to expose the monitor information"
> becomes a very critical con for Inside GPDB approach, right?
That sounds right - monitor is the source of truth on master’s whereabouts and
therefore, running it on a Greenplum segment makes it more susceptible to being
unavailable.
>
> Also, it would be helpful to think through the following aspects:
> - when synchronous replication will be turned on and off. Which means
> discussing the flow for disconnection and reconnection of standby
> (this should be a simple piece)
> - where would utilities like gpstart / gpstop extract information
> about standby
> - how manual activation of standby (if required) might affect the flow
> or will we not provide utility to manually activate the standby
That’s a very good direction to pursue, thank you!
Based on whatever has been proposed so far also signals not just shutting old master down but also knowing its down before promoting standby is required. I am not sure how that can be accomplished when connectivity is lost with it?
Or some other alternative measures are evaluated to stop the old master from causing damage even if up like a monitor before promoting standby, broadcasting who the new master is in the system to segments. So, that can refuse connection from the old master. For this might need a monitor to also know where segments are and have connectivity to segments. Plus, segments restarts or failover, how to communicate who the new master is to accept connections from.
>
> I think what will save the day is synchronous replication will still
> be ON for master (when standby is promoted), hence even if master
> continues to accept connections, they will never complete.
Note that the prepare broadcasts sent by such a master to segments will not be
affected even when synchronous replication is ON and standby has been promoted.
The recent change (commit 5b73cefc99) to distributed transaction recovery
bgworker process would periodically abort such transactions on the promoted
standby.
That's crazy dangerous :)
Note that the prepare broadcasts sent by such a master to segments will not be
affected even when synchronous replication is ON and standby has been promoted.
The recent change (commit 5b73cefc99) to distributed transaction recovery
bgworker process would periodically abort such transactions on the promoted
standby.
That's crazy dangerous :)If the master lost connectivity to both monitor and standby, we probably have no way to shut down the master when promoting the standby. Shutting down the master is not necessary for promoting the standby. When the master detects it lost connections to the monitor and standby, what the master needs to do is to close all active sessions and refuse to provide service or shut down itself.Synchronous replication is always ON unless the monitor explicitly asks the master0(original master)/master1(promoted master), and master0/master1 has no right to turn it OFF. The instance who starts as the master role always turns synchronous replication ON, even if the GUC is OFF in the configuration file. After the monitor confirms the promotion finished, the monitor will ask the master1 to turn synchronous replication OFF and mark the replication not in-sync. After the master0 became standby1(the master0 is demoted to standby1) and the replication is reported to be in-sync by master1, the monitor mark the replication between master1 and standby1 in-sync again.Will it introduce any data consistency?
> On 30-Sep-2020, at 4:26 AM, Ashwin Agrawal <ashwi...@gmail.com> wrote:
>
> Or some other alternative measures are evaluated to stop the old master from causing damage even if up like a monitor before promoting standby, broadcasting who the new master is in the system to segments. So, that can refuse connection from the old master. For this might need a monitor to also know where segments are and have connectivity to segments. Plus, segments restarts or failover, how to communicate who the new master is to accept connections from.
>
What if the standby, upon promotion, towards the end of promote sequence, before
starting to accept client connections, performs the following?
1. Standby sends a special libpq message “I’m the new master” to primary
segments.
2. Upon receiving this message, segments terminate all active backend processes
handling user queries.
3. Once all active connections have been terminated (or signalled to terminate),
segments respond back with success.
4. Standby waits until all segments have responded positively.
5. Standby is now ready to accept connections.
Segments don’t need to remember who the active master is. Monitor doesn’t need
to be made aware of segments in the cluster.
>
>> It seems self-destructive, indeed. However, the design enforces that a master
>> must stop if it loses both, link with monitor and link with standby. At the
>> most one master must be active in the system. The same logic should apply to
>> standby upon promotion.
>
> Shouldn't the decision to shut-down or not (on connectivity loss) depend on if some other node was in sync and can be failed-over or not ? It is a sole entity and can't failover or activate another master, no point self destructing as going to cause more damage than good.
One way to look at the proposal is it guarantees one node fault tolerance among
master, standby and monitor. If any two or more nodes are down, the system is
not available (not exactly but somewhat similar to double fault in FTS lingo).
> - only mechanism to know who is master in system is by attempting *distributed* connection to master (distributed means like BEGIN which will also reach to all the segments and start the session)
The original proposal defines active master to be the one which can accept new
client connections. The underlying assumption is, the failed master rejects new
connections if both, synchronous replication is not active and monitor is not
reachable. Your suggestion, on the other hand, assumes that segments remember
who the active master is. That implies additional complexity in the standby
promotion workflow proposed above.
> >
> > Also, it would be helpful to think through the following aspects:
> > - when synchronous replication will be turned on and off. Which means
> > discussing the flow for disconnection and reconnection of standby
> > (this should be a simple piece)
> > - where would utilities like gpstart / gpstop extract information
> > about standby
> > - how manual activation of standby (if required) might affect the flow
> > or will we not provide utility to manually activate the standby
>
> That’s a very good direction to pursue, thank you!
>
> I am guessing these are being looked into and the next iteration of the proposal will have these covered.
>
Let me track these and other items that are somewhat well defined as GItHub
issues.
The standby promotion workflow proposed earlier should be changed so thatsegments, upon receiving “I’m the new coordinator” message, not only terminate
active connections, but also remember the coordinator’s identity (IP address and
port). This information should be persisted by a primary segment and also
propagated to its mirror segment. What is the right way to persist this
information? One option is to introduce a new GUC coordinator_conninfo, like
primary_conninfo. In addition to writing the GUC to postgresql.auto.conf, a WAL
records can be emitted so that mirror segment also does the same thing.
address in
gp_segment_configuration, but the source IP is not bound. So, it’s not reliable to tell whether the connection is from the desired coordinator by only IP address. Adding additional token may resolve this issue, but it changes the protocols which may be unacceptable.Let's summarize the difference between pg_auto_failover and our current proposal.
- When the connection between the coordinator and monitor is lost, and the standby can connect to both monitor and coordinator, pg_auto_failover will promotes the standby, but our proposal doesn't.
- The primary node in pg_auto_failover stores the user data, it directly modifies database objects locally. In Greenplum, the coordinator/master node mainly dispatches queries to the segments.
The second point makes a big difference on split-brain/data consistency. 🧠 In pg_auto_failover, if synchronize_replication is turned on in primary node, any commit/prepare transaction can't complete. But in Greenplum, things become complicated. If the old coordinator is still alive after promotion, there are 2 types of cases we must take care of.
- If the query is optimized to 1-phase-commit, the transaction will not be blocked by synchronizing the WAL records from coordinator to standby.
- The coordinator may connect to the primaries to modify database objects/states. Like dtx-recovery, FTS, GDD, etc.
Asim and I have different approaches to resolve the above issues.Asim: Let the primaries remember who is the active coordinator, and reject connections from coordinator0(the old coordinator).Me: Coordinator maintains a timer. When timeout, the coordinator is disallowed to connect to the segments.
When promotion happens, the promoted coordinator notifies all primaries "Hey, I'm the coordinator now!". Then all primaries terminate all backends(except the current backend for notification) and remember the coordinator.
It seems good to combine the 2 approaches.My doubt is what is the key we use to distinguish the source node. Using IP address as the key may have problem in multi-address environment. Currently, the destination IP is fixed to beaddressin gp_segment_configuration, but the source IP is not bound. So, it’s not reliable to tell whether the connection is from the desired coordinator by only IP address. Adding additional token may resolve this issue, but it changes the protocols which may be unacceptable.
What's the downside of sticking with pg_auto_failovers logic? (I understand your proposed enhancement is less disruptive to the work-load as only the monitor is disconnected but coordinator is fine why failover. Still trying to understand if it's just optimization we are proposing here or a really hard requirement. Possibly reasking something we might have discussed earlier.)
Given in GPDB standby can't run queries, I don't know how monitor will know standby is connected to coordinator and replication is flowing. I am guessing pg_auto_failover uses pg_stat_replication kind of queries on standby to monitor the system. Maybe we will have to hack, similar to how we let FTS connections today to let pg_auto_failover queries to go through on standby.
I don't understand how timer and timeout is a solution. Please can you explain how it fixes the situation. I think any timeout solution will have a window/gap where things will go wrong.
Yes, mostly will have to use DBID possibly for this.
What's the downside of sticking with pg_auto_failovers logic? (I understand your proposed enhancement is less disruptive to the work-load as only the monitor is disconnected but coordinator is fine why failover. Still trying to understand if it's just optimization we are proposing here or a really hard requirement. Possibly reasking something we might have discussed earlier.)
When the network between the monitor and coordinator is unstable, the monitor may fail to detect the coordinator. The network issue may be temporary and all other connections are good. If we're hurrying to promote the standby, there would be a window that the cluster is unavailable.
Given in GPDB standby can't run queries, I don't know how monitor will know standby is connected to coordinator and replication is flowing. I am guessing pg_auto_failover uses pg_stat_replication kind of queries on standby to monitor the system. Maybe we will have to hack, similar to how we let FTS connections today to let pg_auto_failover queries to go through on standby.
Standby knows the replication and it periodically reports to the monitor the health of the replication and how long it has been disconnected, etc.
I don't understand how timer and timeout is a solution. Please can you explain how it fixes the situation. I think any timeout solution will have a window/gap where things will go wrong.
Sure. Unconfirmed coordinator is disallowed to connect to the primaries.
When an instance starts/restarts as the coordinator, its role is unconfirmed. After the role of coordinator is confirmed, the coordinator maintains a timer(Tc). It records how long the coordinator has lost both connections to the monitor and standby. When timeout, the role of coordinator is undetermined/unconfirmed, i.e. needs reconfirmation or demote. See below. If the replication is established, the role of coordinator is confirmed. If the coordinator role is assigned from the monitor, the role is also confirmed.On the standby side, it also remains a timer(Ts). The timer records how long the standby has lost the replication connection to the peer/coordinator. Standby periodically reports the timer to the monitor.On the monitor side, it remains a timer(Tm). The timer records how long the monitor has lost the connection to the coordinator.Let Tp = min(Ts, Tm).
From the definition of the timers, Tp is equal to Tc in theory if we don't take the detect delay into consideration. We could think Tp is near to Tc in practice.Now, we define 3 time intervals.Lc: if Tc >= Lc, the coordinator lost its role, the role of coordinator needs reconfirmation or demote.Lp: if Tp >= Lp, the monitor starts to promote standby.Gp := Lp - Lc, the gap between the coordinator lost its role and the beginning of promotion.
If we set Gp to a safe value, the coordinator has lost its role before promotion.When the standby is promoted, it still should notify all primaries to terminate all backends(except the notification connection).
Hot standby in pg_auto_failover
Hot standby is required by pg_auto_failover, but Greenplum doesn’t fully support hot standby. pg_auto_failover needs hot standby so that the monitor can connect to the standby for read only queries. The monitor still has to specify the connection in utility mode, which means the client can’t connect to the standby for read. Besides, the hot standby is still transaction read only, it disallows write on the standby.
Deployment of the monitor and save the metadata about pg_auto_failover
The monitor is a new PG instance node with the extension of pg_auto_failover. We need to choose where to deploy the monitor node and where to save the metadata about auto failover. The metadata contains the host, port of the monitor node and the datadir of the monitor instance. Currently, we’ll choose one of the segment nodes to deploy the monitor and the metadata about auto failover is explicitly passed to the programs.
Init system
Init greenplum system with auto failover has 2 steps:
Init greenplum system without pg_auto_failover support
Configure the existing greenplum system to be auto failover
Note: `pg_autoctl create postgres` could either create a new PG instance or configure an existing PG instance to be a node managed by pg_auto_failover. But, we can’t let pg_autoctl create the coordinator instance for us, because `pg_autoctl create postgres` doesn’t only create a PG instance, it also starts the instance. Before the gp_segment_configuration is set properly, we can’t really start the node in dispatch mode. So, we should firstly create the GPDB nodes and then configure the coordinator and standby to be auto failover.
pg_hba.conf
`pg_autoctl create postgres` will add some configuration files inside and outside of the existing pgdatadir, for example, `~/.config/pg_autoctl/home/gpadmin/datadirs/<node_name>/pg_autoctl.cfg`, $pgdata/postgresql-auto-failover.conf. pg.conf is also added some entries to allow connection from the monitor. However, some entries are still missing, so the monitor can not connect to the standby. It needs to be fixed.
pg_rewind
Running pg_rewind in pg_autoctl is buggy, it can’t correctly update the WAL segment and controldata file. Currently, we skip pg_rewind and always use pg_basebackup instead.
pg_basebackup
Currently, pg_basebackup is always applied in promotion. But the configuration file for replication(postgresql.auto.conf) isn’t updated. We may add the option `-R` in pg_basebackup, which will generate a new postgresql.auto.conf for us. pg_autoctl also generates a file named `postgresql-auto-failover-standby.conf`, which has the context about SSL options. Currently, we overwrite postgresql.auto.conf by the content of postgresql-auto-failover-standby.conf. We need to check which one is better in later stories.
restart the postmaster by pg_autoctl
For the coordinator and standby, the postmaster of the PG instance is a child process of one of the pg_autoctl processes. When the postmaster is down, the pg_autoctl may restart the postmaster process. Currently, restarting the postmaster by pg_autoctl fails.
libpq support for multiple addresses
From PG 10+, libpq supports multiple addresses in connection string, which allows auto failover. However, if the first address is not hot-standby, the libpq connection will not try the second address.
Huber and Hao Wu have a patch for the upstream: https://www.postgresql.org/message-id/flat/BN6PR05MB3492948E4FD76C156E747E8BC9160%40BN6PR05MB3492.namprd05.prod.outlook.com
state of the postmaster: ready vs dtmready
pg_autoctl checks the status of the postmaster. In postgresql, `ready` is the state for both primary and secondary when the instance is ready. But in greenplum, `ready` is a middle state of the coordinator. It may cause some error or dump annoying messages. We’d better take care of the states.
let the segments remember the active coordinator
In our previous discussion, when promotion happens, the promoted coordinator sends a notification to all segments(primary and mirror) to tell them “I’m the new coordinator, remember my id”. This logic is totally absent.
status of btree_gist and pg_stat_statements
btree_gist and pg_stat_statements are required by pg_auto_failover. However, they are not compiled and installed.
We need to check their status and understand why they are required by pg_auto_failover.
group_id is a keyword in Greenplum
group_id is a keyword in Greenplum, but it is used as the parameter name in some UDFs, which causes the syntax error in parsing the function. I have opened a github issue in upstream(https://github.com/citusdata/pg_auto_failover/issues/509). The upstream is considering to provide a PR for compatibility purposes. Currently, I refactor the parameter name group_id to group__id.
Impact on utility tools
As far as I know, the affected utility tools contain: gpinitsystem, gpstart, gpstop, gpactivatestandby, gpinitstandby, gpexpand.
Create a new GPDB cluster without auto failover.
Create the monitor instance.
Configure the coordinator and standby to be a pair in a group of the auto failover.
Configure the metadata of the monitor somewhere. E.g. in the coordinator IF we save the metadata in the coordinator/standby.
Get the metadata about the auto failover based on the input parameter
If we get the metadata from a heap in the coordinator node, we retrieve it by connecting to the PG instance in single mode.
Start the monitor instance.
Get the metadata of the current coordinator from the monitor.
Retrieve gp_segment_configuration from the coordinator in single mode.
Start all the segments.
Get the metadata of the coordinator and standby
Start the coordinator and standby. The start options are saved in the configuration files managed by pg_auto_failover.
Get the metadata about the auto failover based on the input parameter.
Retrieve the coordinator and standby metadata from the monitor instance.
Get gp_segment_configuration from the coordinator.
Stop the coordinator and standby
Stop all segments in gp_segment_configuration.
Stop the monitor instance
Besides the normal case, we also have to start/stop the cluster correctly if the monitor is down
gpexpend also needs to have some changes. The pgdata of the segment is copied from the coordinator, but the configuration file(postgresql-auto-failover.conf & postgresql-auto-failover-standby.conf) about the auto failover should be cleaned out.
With auto failover, gpactivestandby is not supposed to promote the standby. What will happen if we manually promote the standby? Will the pg_auto_failover handle this mistake correctly? Maybe gpactivestandby should check whether the auto failover is configured first and raise an error if the standby is part of auto failover.
If the standby is added to the pg_failover, `gpinitstandby -r` is a trouble maker. The old data directory configured auto failover may introduce risk.
The next step is to create more detailed stories if we decide to develop auto failover based on pg_auto_failover.
Welcome to post your concerns or issue.
Regards,
Hao Wu
Hot standby in pg_auto_failover
Hot standby is required by pg_auto_failover, but Greenplum doesn’t fully support hot standby. pg_auto_failover needs hot standby so that the monitor can connect to the standby for read only queries. The monitor still has to specify the connection in utility mode, which means the client can’t connect to the standby for read. Besides, the hot standby is still transaction read only, it disallows write on the standby.
Impact on utility tools
As far as I know, the affected utility tools contain: gpinitsystem, gpstart, gpstop, gpactivatestandby, gpinitstandby, gpexpand.
gprecoverseg for primaries and mirrors.To clarify, the latest Greenplum supports to enable hot standby in utility mode, but not dispatch mode. But in pg_auto_failover, it's enough to connect to hot standby in utility mode. So there is no dependency on fully support the hotstandby feature, right?
One more thing to add is that we need [new] tools to recover the failed coordinator, standby or monitor when error happens, something likegprecoverseg for primaries and mirrors.