Proposal: Master Auto Failover for Greenplum

305 views
Skip to first unread message

Hao Wu

unread,
Sep 2, 2020, 10:08:31 PM9/2/20
to Greenplum Developers, Simon Gao
Hi hackers,

This is an old topic, we have discussed this feature in
and we have the design doc at 
Greenplum Database. Contribute to greenplum-db/gpdb development by creating an account on GitHub.


Through some time spiking, comparing and analyze, I summarized our design of
Master Auto Failover(MAF) for several trade-offs. 

Assumption:
If there is a healthy connection between A and B, both A and B can be
aware of the existence of each other.

Design:
We have 3 roles in our design: Monitor, Primary, Standby(we also call it
mirror/secondary in some places). The monitor periodically probes the
liveness of the Primary and Standby and maintains the states of the pair
of Primary/Standby. The monitor will promote the Standby
to a new Primary if the Primary is unhealthy and the system satisfies some
other conditions.


The connections among the monitor(M), the primary(P) and the standby(S).

It looks like the FTS in Greenplum, but it's different. The monitor probes
both the Primary and the Standby, to avoid split-brain. See the analysis
below.

The Monitor is implemented to be a bgworker running in one of
the primaries in Greenplum. The Primary is a backend in the
master node in Greenplum, and the Standby is also a backend
in the standby in Greenplum.

When does promotion happen?

Connection is healthy

Cluster is available

Who provides service

None

X

both of the primary and secondary can’t provide service

1

O

primary

2

O

promote secondary, primary stops

3

O

primary

1,2

O

primary

2,3

O

primary

3,1

O

primary

1,2,3

O

primary


The number in the above table represents the connection in the 3-role-triangle figure.
We only promote the Standby if the Monitor can only connect to the Standby
and the Standby can't connect to its Primary. If the Monitor behaves like the FTS
prober process, i.e. it promotes the mirror to a new primary when the connection to
the primary is lost, both of the master and the standby could be able to accept
queries after the standby has been promoted in case {2,3}.

Monitor-and-promote for the pair of master and standby is similar to the pair of
the primary and mirror, but currently, MAF will focus on the pair of master and
standby. FTS probe still works for the pairs of the primary and mirror.

Data Consistency:
Data consistency is a hard requirement in our design, which is a prerequisite for
promotion. To safely promote standby, the standby must be in sync with the master.
The policy is the same as the FTS.

Auto re-join MAF:
To enable the failed node easily re-join the MAF, the monitor maintains the states
of the nodes dedicatedly and their transition.

Tradeoff:
Single monitor vs distributed monitor/etcd

The distributed monitor is complicated but more robust. We could implement a single monitor first. Because the distributed monitor contains all the logic of a single monitor, and we should focus on implementing a robust monitor.

Monitor inside GPDB vs outside GPDB

Both of them have their advantages and disadvantages.

Inside GPDB:

PROS: no deployment effort, it can use the functionality provided by Greenplum.

CONS: any fix for the monitor will update the whole GPDB, any critical errors may panic the whole segment. Not easy to expose the monitor information(How to access the monitor?).


Outside GPDB

PROS: Updating the code is independent of GPDB. Complexity is decoupled from the GPDB. Easy to expose the monitor information.

CONS: Deployment effort. It may change the gpstart/gpstop/gpexpand/gpaddmirrors and more management tools.


Currently, we should focus on implementing the core logic of the monitor for a workable GPDB. Moving the code outside of GPDB may be the next step if needed.


Manage master/standby vs all pairs of segments(including master/standby)

FTS works well for the segments, it doesn't care about the pair master/standby.

MAF will only work for the pair of master/standby, but it may take over all the pairs

of all segments if needed. To isolate from FTS, MAF will mainly focus on the pair

of master/standby.


We propose a simple design to make MAF available with less effort but remain its possibility

to other features.


Any ideas?


Regards,

Hao Wu



Ashwin Agrawal

unread,
Sep 16, 2020, 7:25:33 PM9/16/20
to Hao Wu, Greenplum Developers, Simon Gao
The above table and cases are hard to follow as the table only mentions which connection is healthy but doesn't state which is not healthy. We need better arrangement of the table to grasp the cases being discussed.

Tradeoff:
Single monitor vs distributed monitor/etcd

The distributed monitor is complicated but more robust. We could implement a single monitor first. Because the distributed monitor contains all the logic of a single monitor, and we should focus on implementing a robust monitor.


Agree, single monitor to start with seems fine. If need can expand to multiple.

Monitor inside GPDB vs outside GPDB

Both of them have their advantages and disadvantages.

Inside GPDB:

PROS: no deployment effort, it can use the functionality provided by Greenplum.

CONS: any fix for the monitor will update the whole GPDB, any critical errors may panic the whole segment. Not easy to expose the monitor information(How to access the monitor?).


Outside GPDB

PROS: Updating the code is independent of GPDB. Complexity is decoupled from the GPDB. Easy to expose the monitor information.

CONS: Deployment effort. It may change the gpstart/gpstop/gpexpand/gpaddmirrors and more management tools.


The main aspect missing from the proposal is how will the client connections be handled and/or routed to connect currently active master. This was the important aspect which preempted the feature last time, hence extremely important to cover that aspect in-detail while respawning this feature.

The biggest advantage we have with segment primaries and mirrors is GPDB initiates / routes the connections to them. Whereas for master and standby, we can't / don't control the same. Any approach chosen, the third party entities like pgpool or anything else need to work with Monitor to make that decision. Hence, "Not easy to expose the monitor information" becomes a very critical con for Inside GPDB approach. Visibility of what node is in which state is a strong requirement for many purposes.

I understand a solution based on multiple IP addresses in libpq connection will be proposed for client routing between master and standby. Though that falls short in certain situations where master and standby can be active at same time due to n/w partitions. Where only Monitor knows which is currently active master and hence clients mostly will have to consult the Monitor.

Currently, we should focus on implementing the core logic of the monitor for a workable GPDB. Moving the code outside of GPDB may be the next step if needed.


It's important to get the architecture right the first time before implementation. Understanding the complexity of dealing with utilities ties our hands many times for new features. Though I feel we should not drop exploring right architecture just because of utilities complexity. So, I liked your consideration for both the approaches. And we should continue to explore more before rushing to go with the simplest or fastest implementation.

I am really looking forward to hearing from the effort under way for externalizing cluster configuration exploration. Insight from that would tremendously help us on this front. Master node keepings its own configuration information and on top of that not reflecting the truth of sync/not-in-sync with standby is a pretty embarrassing situation we have currently.

Manage master/standby vs all pairs of segments(including master/standby)

FTS works well for the segments, it doesn't care about the pair master/standby.

MAF will only work for the pair of master/standby, but it may take over all the pairs

of all segments if needed. To isolate from FTS, MAF will mainly focus on the pair

of master/standby.


I kind of strongly feel better to have a single implementation and solution for all pairs of segments including master and standby, instead of different implementations of them, as finally functionality is the same.

Hao Wu

unread,
Sep 17, 2020, 1:31:59 AM9/17/20
to Ashwin Agrawal, Greenplum Developers, Simon Gao
Ashwin,

The above table and cases are hard to follow as the table only mentions which connection is healthy but doesn't state which is not healthy. We need better arrangement of the table to grasp the cases being discussed.
Here, 'connection is healthy` means the connection is alive currently. In some documents of other database production, they said 'node is healthy'.
It's easy to determine the connectivity, but it looks impossible to know whether another node is actually healthy.

The main aspect missing from the proposal is how will the client connections be handled and/or routed to connect currently active master. This was the important aspect which preempted the feature last time, hence extremely important to cover that aspect in-detail while respawning this feature.
The biggest advantage we have with segment primaries and mirrors is GPDB initiates / routes the connections to them. Whereas for master and standby, we can't / don't control the same. Any approach chosen, the third party entities like pgpool or anything else need to work with Monitor to make that decision. Hence, "Not easy to expose the monitor information" becomes a very critical con for Inside GPDB approach. Visibility of what node is in which state is a strong requirement for many purposes.

I understand a solution based on multiple IP addresses in libpq connection will be proposed for client routing between master and standby. Though that falls short in certain situations where master and standby can be active at same time due to n/w partitions. Where only Monitor knows which is currently active master and hence clients mostly will have to consult the Monitor.
There libpq helps in routing the client connection to the right master. As I showed in the table, if both master and standby are active, at most one of them can accept the connection and provide service. The table enumerates all possible connectivities among the 3 roles. When the master and standby start, they act as their previous saved roles. So, they may both are master or standby, but the cluster is not available, because both the replication and the monitor connection(accurately to say, acknowledged by the monitor) is not established. If they consider them one master, one standby separately, the master can provide service to the client.

It's important to get the architecture right the first time before implementation. Understanding the complexity of dealing with utilities ties our hands many times for new features. Though I feel we should not drop exploring right architecture just because of utilities complexity. So, I liked your consideration for both the approaches. And we should continue to explore more before rushing to go with the simplest or fastest implementation.
Deploying the monitor to a standalone node is in the next stage. Thank you for reminding me to think about it again. We don't need to let `gpstart` start the monitor even in the first stage. We can separately run the monitor manually without dealing with the utilities.
I am really looking forward to hearing from the effort under way for externalizing cluster configuration exploration. Insight from that would tremendously help us on this front. Master node keepings its own configuration information and on top of that not reflecting the truth of sync/not-in-sync with standby is a pretty embarrassing situation we have currently.
Yeah, the monitor maintains the real status of the master and standby. The states may be persistently stored locally or remotely, which doesn't affect MAF. Inside the monitor, it will maintain a state machine for each pair of master and standby. The states could be easily dumped.

Manage master/standby vs all pairs of segments(including master/standby)

FTS works well for the segments, it doesn't care about the pair master/standby.

MAF will only work for the pair of master/standby, but it may take over all the pairs

of all segments if needed. To isolate from FTS, MAF will mainly focus on the pair

of master/standby.


I kind of strongly feel better to have a single implementation and solution for all pairs of segments including master and standby, instead of different implementations of them, as finally functionality is the same.
It's the next step. I have the same feeling as you, it takes more code and effort to maintain the similar code. But at the first stage, I feel we should focus on the implementation for the pair of master and standby, and not mess up the FTS. FTS has some differences to the current design. It will take more effort to remove the difference and add new functionality for MAF. Until the MAF is generally stable, we can make the monitor take over all pairs of segments. It should take not much effort.


This thread mainly introduces the first stage. Sorry for not mentioning the full picture.
The monitor is a single process and it could be extended to a distributed monitor if needed.
The monitor only takes care of the master and standby, after the MAF is generally stable, it will take over all pairs of segments including the master and standby.
The states of sync/non-sync and roles are maintained by the monitor. The states may be stored locally on the same node as the monitor or remotely.

Regards,
Hao Wu

Asim Praveen

unread,
Sep 18, 2020, 5:59:09 AM9/18/20
to Hao Wu, Ashwin Agrawal, Greenplum Developers, Simon Gao
Thank you Wu Hao, for clarifying the proposal in an off-list discussion. Let me
summarise my understanding.

Unlike FTS, monitor in your design probes both, primary as well as standby. For
standby to be promoted by monitor, *all* of the following conditions must be
met:

(1) monitor cannot connect to primary
(2) standby cannot connect to primary
(3) standby was in-sync with the primary prior to disconnection

Moreover, primary knows the existence of monitor, possibly by means of an always
alive connection, similar to replication connection. If the primary is alive,
it can determine if monitor is disconnected. Upon standby promotion, it can
also determine that standby is disconnected (WAL sender exits). Primary shuts
down as soon as it detects that both, monitor as well as standby are
disconnected.

The scenario raised by Ashwin is monitor cannot connect to primary but it can
connect to standby. If the standby is actively streaming from the primary,
monitor does not promote it. Thus, the problematic scenario that both primary
and standby end up accepting client connections, is avoided by this design.

Monitor continues to probe standby after it is promoted. If the connection
between monitor and standby breaks, the promoted standby shuts itself down, as
per the logic mentioned above, leaving the cluster unavailable. The design
tolerates at the most one node failure.

How can clients connect to the cluster? The new libpq API feature that allows
specifying multiple host address must be used (it is not clear what the impact
of this requirent would be, especially on legacy applications / ETL tools). The
first address tried should be primary. Primary accepts connections in *any*
of the following cases:

(1) monitor connection is active
(2) standby connection is active

Note that the cluster is available when monitor is down but both primary and
standby are up. If primary is not reachable, the clients automatically try
standby. Standby accepts connections if it is promoted or if hot-standby is
enabled.

I have one question: how can a disconnected primary be incorporated back in the
cluster? Changes made to its data directory after standby promotion (possibly
by in-progress clients before the primary shut itself down) need to be reverted,
using pg_rewind. Should this process also be automated, possibly triggered by
monitor?

Asim

Hao Wu

unread,
Sep 18, 2020, 8:42:25 AM9/18/20
to Asim Praveen, Ashwin Agrawal, Greenplum Developers, Simon Gao
Thank you Asim for your note.

I have one question: how can a disconnected primary be incorporated back in the
cluster?  Changes made to its data directory after standby promotion (possibly
by in-progress clients before the primary shut itself down) need to be reverted,
using pg_rewind.  Should this process also be automated, possibly triggered by
monitor?
An interesting feature. Let's put it in the second stage.
To bring up the primary as the new standby, there should be
a long-running process to do this job locally. SSH may be a good
choice. There are 2 types of cases:
  1. The monitor can connect to the SSH server. The monitor could pass the startup parameters to start the server and the server takes care to catch up with the WAL replication of the active primary. After the primary(promoted standby) tells the monitor that the standby(old primary) has caught up with my WAL records, the monitor could safely mark `sync` between the pair of primary and standby.
  2. The monitor can't connect to the node of the primary. We can't do anything further but mark the instance down.

Regards
Hao Wu

From: Asim Praveen <pa...@vmware.com>
Sent: Friday, September 18, 2020 5:59 PM
To: Hao Wu <ha...@vmware.com>
Cc: Ashwin Agrawal <ashwi...@gmail.com>; Greenplum Developers <gpdb...@greenplum.org>; Simon Gao <sim...@vmware.com>
Subject: Re: Proposal: Master Auto Failover for Greenplum
 

Ashwin Agrawal

unread,
Sep 18, 2020, 1:32:41 PM9/18/20
to Asim Praveen, Hao Wu, Greenplum Developers, Simon Gao
On Fri, Sep 18, 2020 at 2:59 AM Asim Praveen <pa...@vmware.com> wrote:
Thank you Wu Hao, for clarifying the proposal in an off-list discussion.  Let me
summarise my understanding.

Thank you Asim and Wu Hao, this is helpful.
 
Unlike FTS, monitor in your design probes both, primary as well as standby.  For
standby to be promoted by monitor, *all* of the following conditions must be
met:

    (1) monitor cannot connect to primary
    (2) standby cannot connect to primary
    (3) standby was in-sync with the primary prior to disconnection

Moreover, primary knows the existence of monitor, possibly by means of an always
alive connection, similar to replication connection.  If the primary is alive,
it can determine if monitor is disconnected.  Upon standby promotion, it can
also determine that standby is disconnected (WAL sender exits).  Primary shuts
down as soon as it detects that both, monitor as well as standby are
disconnected.

Reads very similar to what pg_auto_failover does.

Just to point out self shutting-down behavior should just be leveraged
for optimization purposes and not for correctness. As shutdown can
take time or may hang due to some bug. We have seen many instances of
this in tests, where connections continue being served even after
receiving a shut-down request for some time. Possibly need some external
local entity to force kick it out, if it doesn't self shutdown. (pg_auto_failover has
local non-postgres process on a node which can do this.)

I think what will save the day is synchronous replication will still
be ON for master (when standby is promoted), hence even if master
continues to accept connections, they will never complete.
 
The scenario raised by Ashwin is monitor cannot connect to primary but it can
connect to standby.  If the standby is actively streaming from the primary,
monitor does not promote it.  Thus, the problematic scenario that both primary
and standby end up accepting client connections, is avoided by this design.

Based on the above, it seems we will not failover if the master's
connection to segments is also broken along with the monitor but still
alive with standby. If having a monitor as part of some segment, then
the link with the monitor and segment will always go down
together. So, this points out for sure that the monitor needs to be
separated from segments. 
If the monitor is separate from segments, then I am not sure how 
important not being able to reach segments scenario is to cover but 
in cloud env might be possible.

Monitor continues to probe standby after it is promoted.  If the connection
between monitor and standby breaks, the promoted standby shuts itself down, as
per the logic mentioned above, leaving the cluster unavailable.  The design
tolerates at the most one node failure.

That seems to be too self-destructive. Why promoted standby (only
standing master) if able to serve queries should go down?

How can clients connect to the cluster?  The new libpq API feature that allows
specifying multiple host address must be used (it is not clear what the impact
of this requirent would be, especially on legacy applications / ETL tools).  The
first address tried should be primary.  Primary accepts connections in *any*
of the following cases:

    (1) monitor connection is active
    (2) standby connection is active

Note that the cluster is available when monitor is down but both primary and
standby are up.  If primary is not reachable, the clients automatically try
standby.  Standby accepts connections if it is promoted or if hot-standby is
enabled.

Multiple host addresses in definitely one way of automating the client
routing. Not sure if all users will leverage that aspect. In-case
users wish to route the connections using some other mechanism, I
think Monitor would be the place to find that information, correct?
So, it seems important to be able to reach Monitor and extract
information. Hence, "Not easy to expose the monitor information"
becomes a very critical con for Inside GPDB approach, right?


Also, it would be helpful to think through the following aspects:
- when synchronous replication will be turned on and off. Which means
  discussing the flow for disconnection and reconnection of standby
  (this should be a simple piece)
- where would utilities like gpstart / gpstop extract information
  about standby
- how manual activation of standby (if required) might affect the flow
  or will we not provide utility to manually activate the standby

I have one question: how can a disconnected primary be incorporated back in the
cluster?  Changes made to its data directory after standby promotion (possibly
by in-progress clients before the primary shut itself down) need to be reverted,
using pg_rewind.  Should this process also be automated, possibly triggered by
monitor?

Well we don't do that today for segments so feel we can hold on to that automation
for master.

Note:
Asim, you used Primary, to refer to Master (I am guessing that was
intentional to avoid usage of master word). But primary just confuses
in the GPDB context. We know we want to find a better alternative for
Master (to remove problematic language). Though till that alternative
is found I feel I can continue with master. Or some other alternative is 
welcome but primary is just too confusing.


--
Ashwin Agrawal (VMware)

Asim Praveen

unread,
Sep 21, 2020, 8:03:17 AM9/21/20
to Ashwin Agrawal, Hao Wu, Greenplum Developers, Simon Gao

> On 18-Sep-2020, at 11:02 PM, Ashwin Agrawal <ashwi...@gmail.com> wrote:
>
> On Fri, Sep 18, 2020 at 2:59 AM Asim Praveen <pa...@vmware.com> wrote:
>>
>> Primary shuts down as soon as it detects that both, monitor as well as standby are
>> disconnected.
>
> Just to point out self shutting-down behavior should just be leveraged
> for optimization purposes and not for correctness. As shutdown can
> take time or may hang due to some bug. We have seen many instances of
> this in tests, where connections continue being served even after
> receiving a shut-down request for some time. Possibly need some external
> local entity to force kick it out, if it doesn't self shutdown. (pg_auto_failover has
> local non-postgres process on a node which can do this.)

This is a valid concern. A mechanism to shutdown postmaster that can work
independent of the postmaster itself is necessary.

>
> I think what will save the day is synchronous replication will still
> be ON for master (when standby is promoted), hence even if master
> continues to accept connections, they will never complete.

Note that the prepare broadcasts sent by such a master to segments will not be
affected even when synchronous replication is ON and standby has been promoted.
The recent change (commit 5b73cefc99) to distributed transaction recovery
bgworker process would periodically abort such transactions on the promoted
standby.

>
>> If the standby is actively streaming from the primary,
>> monitor does not promote it. Thus, the problematic scenario that both primary
>> and standby end up accepting client connections, is avoided by this design.
>
> Based on the above, it seems we will not failover if the master's
> connection to segments is also broken along with the monitor but still
> alive with standby. If having a monitor as part of some segment, then
> the link with the monitor and segment will always go down
> together. So, this points out for sure that the monitor needs to be
> separated from segments.

A correlation between master and segments connection and placement of the
monitor is hard to see. Let’s say, monitor is running on a segment. If this
segment becomes unreachable from master, connection with monitor is also lost.
If master can reach standby, the cluster continues to operate, except that FTS
on master takes necessary action triggered by the segment’s failure.

> If the monitor is separate from segments, then I am not sure how
> important not being able to reach segments scenario is to cover but
> in cloud env might be possible.

I agree, it is not clear at the moment.

>
>> Monitor continues to probe standby after it is promoted. If the connection
>> between monitor and standby breaks, the promoted standby shuts itself down, as
>> per the logic mentioned above, leaving the cluster unavailable. The design
>> tolerates at the most one node failure.
>
> That seems to be too self-destructive. Why promoted standby (only
> standing master) if able to serve queries should go down?
>

It seems self-destructive, indeed. However, the design enforces that a master
must stop if it loses both, link with monitor and link with standby. At the
most one master must be active in the system. The same logic should apply to
standby upon promotion.


> Multiple host addresses in definitely one way of automating the client
> routing. Not sure if all users will leverage that aspect. In-case
> users wish to route the connections using some other mechanism, I
> think Monitor would be the place to find that information, correct?
> So, it seems important to be able to reach Monitor and extract
> information. Hence, "Not easy to expose the monitor information"
> becomes a very critical con for Inside GPDB approach, right?

That sounds right - monitor is the source of truth on master’s whereabouts and
therefore, running it on a Greenplum segment makes it more susceptible to being
unavailable.

>
> Also, it would be helpful to think through the following aspects:
> - when synchronous replication will be turned on and off. Which means
> discussing the flow for disconnection and reconnection of standby
> (this should be a simple piece)
> - where would utilities like gpstart / gpstop extract information
> about standby
> - how manual activation of standby (if required) might affect the flow
> or will we not provide utility to manually activate the standby

That’s a very good direction to pursue, thank you!

> Note:
> Asim, you used Primary, to refer to Master (I am guessing that was
> intentional to avoid usage of master word). But primary just confuses
> in the GPDB context. We know we want to find a better alternative for
> Master (to remove problematic language). Though till that alternative
> is found I feel I can continue with master. Or some other alternative is
> welcome but primary is just too confusing.
>

You have guessed it right. Apologies for causing the confusion. I’ve reverted
back to using master.

Asim

Hao Wu

unread,
Sep 24, 2020, 7:26:35 AM9/24/20
to Asim Praveen, Ashwin Agrawal, Greenplum Developers, Simon Gao

>> If the standby is actively streaming from the primary,
>> monitor does not promote it.  Thus, the problematic scenario that both primary
>> and standby end up accepting client connections, is avoided by this design.
>
> Based on the above, it seems we will not failover if the master's
> connection to segments is also broken along with the monitor but still
> alive with standby. If having a monitor as part of some segment, then
> the link with the monitor and segment will always go down
> together. So, this points out for sure that the monitor needs to be
> separated from segments.

A correlation between master and segments connection and placement of the
monitor is hard to see.  Let’s say, monitor is running on a segment.  If this
segment becomes unreachable from master, connection with monitor is also lost.
If master can reach standby, the cluster continues to operate, except that FTS
on master takes necessary action triggered by the segment’s failure.

> If the monitor is separate from segments, then I am not sure how
> important not being able to reach segments scenario is to cover but
> in cloud env might be possible.

I agree, it is not clear at the moment.
The connectivity between the master and the primaries may be different as long as the monitor doesn't live in the same node as the master. The sole truth on the connectivity between the master and the primaries is from the master. So we could let the master tell the monitor how many connections the master can't establish to the primaries. If it really happens, the monitor could take it into consideration whether to promote the standby even if the connectivity among the monitor, master, and standby is healthy.

Ashwin Agrawal

unread,
Sep 29, 2020, 6:57:11 PM9/29/20
to Asim Praveen, Hao Wu, Greenplum Developers, Simon Gao
On Mon, Sep 21, 2020 at 5:03 AM Asim Praveen <pa...@vmware.com> wrote:

> On 18-Sep-2020, at 11:02 PM, Ashwin Agrawal <ashwi...@gmail.com> wrote:
>
> On Fri, Sep 18, 2020 at 2:59 AM Asim Praveen <pa...@vmware.com> wrote:
>> 
>> Primary shuts down as soon as it detects that both, monitor as well as standby are
>> disconnected.
>
> Just to point out self shutting-down behavior should just be leveraged
> for optimization purposes and not for correctness. As shutdown can
> take time or may hang due to some bug. We have seen many instances of
> this in tests, where connections continue being served even after
> receiving a shut-down request for some time. Possibly need some external
> local entity to force kick it out, if it doesn't self shutdown. (pg_auto_failover has
> local non-postgres process on a node which can do this.)

This is a valid concern.  A mechanism to shutdown postmaster that can work
independent of the postmaster itself is necessary.

Based on whatever has been proposed so far also signals not just shutting old master down but also knowing its down before promoting standby is required. I am not sure how that can be accomplished when connectivity is lost with it?

Or some other alternative measures are evaluated to stop the old master from causing damage even if up like a monitor before promoting standby, broadcasting who the new master is in the system to segments. So, that can refuse connection from the old master. For this might need a monitor to also know where segments are and have connectivity to segments. Plus, segments restarts or failover, how to communicate who the new master is to accept connections from.
 

>
> I think what will save the day is synchronous replication will still
> be ON for master (when standby is promoted), hence even if master
> continues to accept connections, they will never complete.

Note that the prepare broadcasts sent by such a master to segments will not be
affected even when synchronous replication is ON and standby has been promoted.
The recent change (commit 5b73cefc99) to distributed transaction recovery
bgworker process would periodically abort such transactions on the promoted
standby.

That's crazy dangerous :-)

>
>> Monitor continues to probe standby after it is promoted.  If the connection
>> between monitor and standby breaks, the promoted standby shuts itself down, as
>> per the logic mentioned above, leaving the cluster unavailable.  The design
>> tolerates at the most one node failure.
>
> That seems to be too self-destructive. Why promoted standby (only
> standing master) if able to serve queries should go down?
>

It seems self-destructive, indeed.  However, the design enforces that a master
must stop if it loses both, link with monitor and link with standby.  At the
most one master must be active in the system.  The same logic should apply to
standby upon promotion.

Shouldn't the decision to shut-down or not (on connectivity loss) depend on if some other node was in sync and can be failed-over or not ? It is a sole entity and can't failover or activate another master, no point self destructing as going to cause more damage than good.
 

> Multiple host addresses in definitely one way of automating the client
> routing. Not sure if all users will leverage that aspect. In-case
> users wish to route the connections using some other mechanism, I
> think Monitor would be the place to find that information, correct?
> So, it seems important to be able to reach Monitor and extract
> information. Hence, "Not easy to expose the monitor information"
> becomes a very critical con for Inside GPDB approach, right?

That sounds right - monitor is the source of truth on master’s whereabouts and
therefore, running it on a Greenplum segment makes it more susceptible to being
unavailable.

The main aspect still unclear is, the monitor is completely transparent to external entities, sits within the system and does it's job or needs to be provided access to externally.

Just random thoughts just to take the discussion forward:
- completely stop recording master-standby entry in gp_segment_configuration
- just for utilities like starting / stopping standby, record configuration info in some flat file on master (where standby exists). This remains static and doesn't change really
- monitor runs somewhere (inside the cluster) and records the SYNC state between the master and standby and controls fail-over activity. Mostly no external interface to connect to it.
- only mechanism to know who is master in system is by attempting *distributed* connection to master (distributed means like BEGIN which will also reach to all the segments and start the session)

What issues do folks think on this?


>
> Also, it would be helpful to think through the following aspects:
> - when synchronous replication will be turned on and off. Which means
>   discussing the flow for disconnection and reconnection of standby
>   (this should be a simple piece)
> - where would utilities like gpstart / gpstop extract information
>   about standby
> - how manual activation of standby (if required) might affect the flow
>   or will we not provide utility to manually activate the standby

That’s a very good direction to pursue, thank you!

I am guessing these are being looked into and the next iteration of the proposal will have these covered.


--
Ashwin Agrawal (VMware)

Hao Wu

unread,
Sep 29, 2020, 9:37:02 PM9/29/20
to Ashwin Agrawal, Asim Praveen, Greenplum Developers, Simon Gao
Based on whatever has been proposed so far also signals not just shutting old master down but also knowing its down before promoting standby is required. I am not sure how that can be accomplished when connectivity is lost with it?

Or some other alternative measures are evaluated to stop the old master from causing damage even if up like a monitor before promoting standby, broadcasting who the new master is in the system to segments. So, that can refuse connection from the old master. For this might need a monitor to also know where segments are and have connectivity to segments. Plus, segments restarts or failover, how to communicate who the new master is to accept connections from.
 

>
> I think what will save the day is synchronous replication will still
> be ON for master (when standby is promoted), hence even if master
> continues to accept connections, they will never complete.

Note that the prepare broadcasts sent by such a master to segments will not be
affected even when synchronous replication is ON and standby has been promoted.
The recent change (commit 5b73cefc99) to distributed transaction recovery
bgworker process would periodically abort such transactions on the promoted
standby.

That's crazy dangerous :)
If the master lost connectivity to both monitor and standby, we probably have no way to shut down the master when promoting the standby. Shutting down the master is not necessary for promoting the standby. When the master detects it lost connections to the monitor and standby, what the master needs to do is to close all active sessions and refuse to provide service or shut down itself.
Synchronous replication is always ON unless the monitor explicitly asks the master0(original master)/master1(promoted master), and master0/master1 has no right to turn it OFF. The instance who starts as the master role always turns synchronous replication ON, even if the GUC is OFF in the configuration file. After the monitor confirms the promotion finished, the monitor will ask the master1 to turn synchronous replication OFF and mark the replication not in-sync. After the master0 became standby1(the master0 is demoted to standby1) and the replication is reported to be in-sync by master1, the monitor mark the replication between master1 and standby1 in-sync again.
Will it introduce any data consistency?

Hao Wu


From: Ashwin Agrawal <ashwi...@gmail.com>
Sent: Wednesday, September 30, 2020 6:56 AM
To: Asim Praveen <pa...@vmware.com>
Cc: Hao Wu <ha...@vmware.com>; Greenplum Developers <gpdb...@greenplum.org>; Simon Gao <sim...@vmware.com>

Subject: Re: Proposal: Master Auto Failover for Greenplum

Ashwin Agrawal

unread,
Sep 30, 2020, 7:49:39 PM9/30/20
to Hao Wu, Asim Praveen, Greenplum Developers, Simon Gao
On Tue, Sep 29, 2020 at 6:37 PM Hao Wu <ha...@vmware.com> wrote:
Note that the prepare broadcasts sent by such a master to segments will not be
affected even when synchronous replication is ON and standby has been promoted.
The recent change (commit 5b73cefc99) to distributed transaction recovery
bgworker process would periodically abort such transactions on the promoted
standby.

That's crazy dangerous :)
If the master lost connectivity to both monitor and standby, we probably have no way to shut down the master when promoting the standby. Shutting down the master is not necessary for promoting the standby. When the master detects it lost connections to the monitor and standby, what the master needs to do is to close all active sessions and refuse to provide service or shut down itself.
Synchronous replication is always ON unless the monitor explicitly asks the master0(original master)/master1(promoted master), and master0/master1 has no right to turn it OFF. The instance who starts as the master role always turns synchronous replication ON, even if the GUC is OFF in the configuration file. After the monitor confirms the promotion finished, the monitor will ask the master1 to turn synchronous replication OFF and mark the replication not in-sync. After the master0 became standby1(the master0 is demoted to standby1) and the replication is reported to be in-sync by master1, the monitor mark the replication between master1 and standby1 in-sync again.
Will it introduce any data consistency?

I am not seeing how this covers for the case Asim mentioned about cancelling prepared transactions earlier. Plus, synchronous replication ON comes into play only at Prepare or Commit time. If master0 continues to run and master1 is promoted, for example master0 might try to create a table with the same name and block master1 and such Or possibly GDD from master0 kicks in and tries some fancy things with sessions from master1 which again doesn't depend on synchronous replication. So, even if may not pose direct data consistency issues can potentially cause unnecessary noise and disruption in the cluster.

Asim Praveen

unread,
Oct 5, 2020, 2:22:29 PM10/5/20
to Ashwin Agrawal, Hao Wu, Greenplum Developers, Simon Gao


> On 30-Sep-2020, at 4:26 AM, Ashwin Agrawal <ashwi...@gmail.com> wrote:
>
> Or some other alternative measures are evaluated to stop the old master from causing damage even if up like a monitor before promoting standby, broadcasting who the new master is in the system to segments. So, that can refuse connection from the old master. For this might need a monitor to also know where segments are and have connectivity to segments. Plus, segments restarts or failover, how to communicate who the new master is to accept connections from.
>

What if the standby, upon promotion, towards the end of promote sequence, before
starting to accept client connections, performs the following?

1. Standby sends a special libpq message “I’m the new master” to primary
segments.

2. Upon receiving this message, segments terminate all active backend processes
handling user queries.

3. Once all active connections have been terminated (or signalled to terminate),
segments respond back with success.

4. Standby waits until all segments have responded positively.

5. Standby is now ready to accept connections.

Segments don’t need to remember who the active master is. Monitor doesn’t need
to be made aware of segments in the cluster.

>
>> It seems self-destructive, indeed. However, the design enforces that a master
>> must stop if it loses both, link with monitor and link with standby. At the
>> most one master must be active in the system. The same logic should apply to
>> standby upon promotion.
>
> Shouldn't the decision to shut-down or not (on connectivity loss) depend on if some other node was in sync and can be failed-over or not ? It is a sole entity and can't failover or activate another master, no point self destructing as going to cause more damage than good.

One way to look at the proposal is it guarantees one node fault tolerance among
master, standby and monitor. If any two or more nodes are down, the system is
not available (not exactly but somewhat similar to double fault in FTS lingo).

>
> The main aspect still unclear is, the monitor is completely transparent to external entities, sits within the system and does it's job or needs to be provided access to externally.
>
> Just random thoughts just to take the discussion forward:
> - completely stop recording master-standby entry in gp_segment_configuration

Makes sense. As you’ve noted subsequently, the sync state needs to be
maintained by the monitor.

> - just for utilities like starting / stopping standby, record configuration info in some flat file on master (where standby exists). This remains static and doesn't change really

Standby details can be obtained from a running master using pg_stat_replication.
These details must be accessible while master is not running by gpstart. Flat
file on master is a good idea. The flat file can be created upon standby
initialisation and also when bringing up a failed master as standby.

> - monitor runs somewhere (inside the cluster) and records the SYNC state between the master and standby and controls fail-over activity. Mostly no external interface to connect to it.
> - only mechanism to know who is master in system is by attempting *distributed* connection to master (distributed means like BEGIN which will also reach to all the segments and start the session)

The original proposal defines active master to be the one which can accept new
client connections. The underlying assumption is, the failed master rejects new
connections if both, synchronous replication is not active and monitor is not
reachable. Your suggestion, on the other hand, assumes that segments remember
who the active master is. That implies additional complexity in the standby
promotion workflow proposed above.

> >
> > Also, it would be helpful to think through the following aspects:
> > - when synchronous replication will be turned on and off. Which means
> > discussing the flow for disconnection and reconnection of standby
> > (this should be a simple piece)
> > - where would utilities like gpstart / gpstop extract information
> > about standby
> > - how manual activation of standby (if required) might affect the flow
> > or will we not provide utility to manually activate the standby
>
> That’s a very good direction to pursue, thank you!
>
> I am guessing these are being looked into and the next iteration of the proposal will have these covered.
>

Let me track these and other items that are somewhat well defined as GItHub
issues.

Asim

Ashwin Agrawal

unread,
Oct 5, 2020, 8:45:44 PM10/5/20
to Asim Praveen, Hao Wu, Greenplum Developers, Simon Gao
On Mon, Oct 5, 2020 at 11:22 AM Asim Praveen <pa...@vmware.com> wrote:
> On 30-Sep-2020, at 4:26 AM, Ashwin Agrawal <ashwi...@gmail.com> wrote:
>
> Or some other alternative measures are evaluated to stop the old master from causing damage even if up like a monitor before promoting standby, broadcasting who the new master is in the system to segments. So, that can refuse connection from the old master. For this might need a monitor to also know where segments are and have connectivity to segments. Plus, segments restarts or failover, how to communicate who the new master is to accept connections from.


What if the standby, upon promotion, towards the end of promote sequence, before
starting to accept client connections, performs the following?

1. Standby sends a special libpq message “I’m the new master” to primary
segments.

2. Upon receiving this message, segments terminate all active backend processes
handling user queries.

3. Once all active connections have been terminated (or signalled to terminate),                                                                                                                                                               
segments respond back with success.

4. Standby waits until all segments have responded positively.

5. Standby is now ready to accept connections.

Segments don’t need to remember who the active master is.  Monitor doesn’t need
to be made aware of segments in the cluster.

So, the primary purpose of that message is just to trigger (shouldn't wait for) termination of currently running sessions on promotion. Sounds good to me.


>
>> It seems self-destructive, indeed.  However, the design enforces that a master
>> must stop if it loses both, link with monitor and link with standby.  At the
>> most one master must be active in the system.  The same logic should apply to
>> standby upon promotion.
>
> Shouldn't the decision to shut-down or not (on connectivity loss) depend on if some other node was in sync and can be failed-over or not ? It is a sole entity and can't failover or activate another master, no point self destructing as going to cause more damage than good.

One way to look at the proposal is it guarantees one node fault tolerance among
master, standby and monitor.  If any two or more nodes are down, the system is
not available (not exactly but somewhat similar to double fault in FTS lingo).

Sure, let's make note of this in design, not a blocker to move forward. We can always iterate on this behavior as we move forward. Though just to co-relate to FTS, incase of double fault we use conservative approach and *don't take any action* instead let the currently acting role best serve whatever way it can in the current role.

> - only mechanism to know who is master in system is by attempting *distributed* connection to master (distributed means like BEGIN which will also reach to all the segments and start the session)

The original proposal defines active master to be the one which can accept new
client connections.  The underlying assumption is, the failed master rejects new
connections if both, synchronous replication is not active and monitor is not
reachable.  Your suggestion, on the other hand, assumes that segments remember
who the active master is.  That implies additional complexity in the standby
promotion workflow proposed above.

Agree, is desirable and simpler if can eliminate segments from this picture. I still miss how practically we will avoid internal processes like FTS, GDD, DTM recovery process, auto-analyze or something else in future from kicking and doing some damage. Like there would be some time-outs I guess to detect disconnection with standby or monitor, if those gaps can be filled in then sounds good to me.


> >
> > Also, it would be helpful to think through the following aspects:
> > - when synchronous replication will be turned on and off. Which means
> >   discussing the flow for disconnection and reconnection of standby
> >   (this should be a simple piece)
> > - where would utilities like gpstart / gpstop extract information
> >   about standby
> > - how manual activation of standby (if required) might affect the flow
> >   or will we not provide utility to manually activate the standby
>
> That’s a very good direction to pursue, thank you!
>
> I am guessing these are being looked into and the next iteration of the proposal will have these covered.
>

Let me track these and other items that are somewhat well defined as GItHub
issues.

Thanks. I think we should update and use https://github.com/greenplum-db/gpdb/wiki/Master-auto-failover to track high-level refined design, open items and such details as discussion evolves on mailing list.

Asim Praveen

unread,
Oct 20, 2020, 10:54:21 AM10/20/20
to Ashwin Agrawal, Hao Wu, Greenplum Developers, Simon Gao

> On 06-Oct-2020, at 6:15 AM, Ashwin Agrawal <ashwi...@gmail.com> wrote:
>
>> On Mon, Oct 5, 2020 at 11:22 AM Asim Praveen <pa...@vmware.com> wrote:
>> > On 30-Sep-2020, at 4:26 AM, Ashwin Agrawal <ashwi...@gmail.com> wrote:
>> >
>> > - only mechanism to know who is master in system is by attempting *distributed* connection to master (distributed means like BEGIN which will also reach to all the segments and start the session)
>>
>> The original proposal defines active master to be the one which can accept new
>> client connections. The underlying assumption is, the failed master rejects new
>> connections if both, synchronous replication is not active and monitor is not
>> reachable. Your suggestion, on the other hand, assumes that segments remember
>> who the active master is. That implies additional complexity in the standby
>> promotion workflow proposed above.
>>
> Agree, is desirable and simpler if can eliminate segments from this picture. I still miss how practically we will avoid internal processes like FTS, GDD, DTM recovery process, auto-analyze or something else in future from kicking and doing some damage. Like there would be some time-outs I guess to detect disconnection with standby or monitor, if those gaps can be filled in then sounds good to me.
>

That’s a decisive point. We cannot rely on segments not remembering who the
currently acting coordinator is. The not so hypothetical case of a failed
coordinator steadfastly refusing to shutdown even after standby promotion, may
have a background process continue to reach out to segments. Segments must be
able to reject such connections.

The standby promotion workflow proposed earlier should be changed so that
segments, upon receiving “I’m the new coordinator” message, not only terminate
active connections, but also remember the coordinator’s identity (IP address and
port). This information should be persisted by a primary segment and also
propagated to its mirror segment. What is the right way to persist this
information? One option is to introduce a new GUC coordinator_conninfo, like
primary_conninfo. In addition to writing the GUC to postgresql.auto.conf, a WAL
records can be emitted so that mirror segment also does the same thing.

The “I’m the new coordinator” message can be acknowledged only after mirror has
confirmed flush of the WAL record.

Asim

Hao Wu

unread,
Oct 20, 2020, 9:51:46 PM10/20/20
to Asim Praveen, Ashwin Agrawal, Greenplum Developers, Simon Gao
The standby promotion workflow proposed earlier should be changed so that
segments, upon receiving “I’m the new coordinator” message, not only terminate
active connections, but also remember the coordinator’s identity (IP address and
port).  This information should be persisted by a primary segment and also
propagated to its mirror segment.  What is the right way to persist this
information?  One option is to introduce a new GUC coordinator_conninfo, like
primary_conninfo.  In addition to writing the GUC to postgresql.auto.conf, a WAL
records can be emitted so that mirror segment also does the same thing.
I disagree with the above workflow. Since both the monitor and standby can't connect to the coordinator, the coordinator should also detect the connections are gone if the coordinator is still alive. The message, "I’m the new coordinator", tells the primaries to terminate all active connections. So, the assumption of the above discussion is:
  1. The coordinator0 isn't killed.
  2. Even if the coordinator0 detects the connections to standby and monitor are gone, the coordinator0 is still able to start new FTS/GDD connections to the primaries.

It seems pessimistic. We could add an additional condition to start new connections to the primary to avoid this case. Only if the role of coordinator is confirmed and it can connect to either standby or monitor, the coordinator could connect to the primaries. Once the coordinator detects that both connections are gone, its role is undetermined which needs reconfirmation again.
The confirmation of the role coordinator requires any one of the conditions.
  1. The role of coordinator is assigned by the monitor.
  2. The standby establishes the replication to the node.

Remembering the coordinator for the segment is not a good idea, it makes things complicated.

Regards,
Hao Wu

Hao Wu

unread,
Oct 22, 2020, 10:56:31 AM10/22/20
to Asim Praveen, Ashwin Agrawal, Greenplum Developers, Simon Gao
Let's summarize the difference between pg_auto_failover and our current proposal.

  1. When the connection between the coordinator and monitor is lost, and the standby can connect to both monitor and coordinator, pg_auto_failover will promotes the standby, but our proposal doesn't.
  2. The primary node in pg_auto_failover stores the user data, it directly modifies database objects locally. In Greenplum, the coordinator/master node mainly dispatches queries to the segments.
The second point makes a big difference on split-brain/data consistency. 🧠 In pg_auto_failover, if synchronize_replication is turned on in primary node, any commit/prepare transaction can't complete. But in Greenplum, things become complicated. If the old coordinator is still alive after promotion, there are 2 types of cases we must take care of.

  1. If the query is optimized to 1-phase-commit, the transaction will not be blocked by synchronizing the WAL records from coordinator to standby.
  2. The coordinator may connect to the primaries to modify database objects/states. Like dtx-recovery, FTS, GDD, etc.
Asim and I have different approaches to resolve the above issues.
Asim: Let the primaries remember who is the active coordinator, and reject connections from coordinator0(the old coordinator).
Me: Coordinator maintains a timer. When timeout, the coordinator is disallowed to connect to the segments.

When promotion happens, the promoted coordinator notifies all primaries "Hey, I'm the coordinator now!". Then all primaries terminate all backends(except the current backend for notification) and remember the coordinator.

It seems good to combine the 2 approaches.
My doubt is what is the key we use to distinguish the source node. Using IP address as the key may have problem in multi-address environment. Currently, the destination IP is fixed to be address in gp_segment_configuration, but the source IP is not bound. So, it’s not reliable to tell whether the connection is from the desired coordinator by only IP address. Adding additional token may resolve this issue, but it changes the protocols which may be unacceptable.


Regards,
Hao Wu

Ashwin Agrawal

unread,
Oct 22, 2020, 5:54:15 PM10/22/20
to Hao Wu, Asim Praveen, Greenplum Developers, Simon Gao
On Thu, Oct 22, 2020 at 7:56 AM Hao Wu <ha...@vmware.com> wrote:
Let's summarize the difference between pg_auto_failover and our current proposal.

  1. When the connection between the coordinator and monitor is lost, and the standby can connect to both monitor and coordinator, pg_auto_failover will promotes the standby, but our proposal doesn't.
What's the downside of sticking with pg_auto_failovers logic? (I understand your proposed enhancement is less disruptive to the work-load as only the monitor is disconnected but coordinator is fine why failover. Still trying to understand if it's just optimization we are proposing here or a really hard requirement. Possibly reasking something we might have discussed earlier.)

Given in GPDB standby can't run queries, I don't know how monitor will know standby is connected to coordinator and replication is flowing. I am guessing pg_auto_failover uses pg_stat_replication kind of queries on standby to monitor the system. Maybe we will have to hack, similar to how we let FTS connections today to let pg_auto_failover queries to go through on standby.

I would really like to see a table/matrix capturing the different node states and connections status between them and next actions. Kind of state transition diagram.
  1. The primary node in pg_auto_failover stores the user data, it directly modifies database objects locally. In Greenplum, the coordinator/master node mainly dispatches queries to the segments.
The second point makes a big difference on split-brain/data consistency. 🧠 In pg_auto_failover, if synchronize_replication is turned on in primary node, any commit/prepare transaction can't complete. But in Greenplum, things become complicated. If the old coordinator is still alive after promotion, there are 2 types of cases we must take care of.

  1. If the query is optimized to 1-phase-commit, the transaction will not be blocked by synchronizing the WAL records from coordinator to standby.
  2. The coordinator may connect to the primaries to modify database objects/states. Like dtx-recovery, FTS, GDD, etc.
Agreed that's where the complexity stems for GPDB compared to single node PostgreSQL where each node is consistent in itself (only worry exists it may be behind in time) for HighAvailability. For GPDB, consistency is based on a group of PostgreSQL nodes.
 
Asim and I have different approaches to resolve the above issues.
Asim: Let the primaries remember who is the active coordinator, and reject connections from coordinator0(the old coordinator).
Me: Coordinator maintains a timer. When timeout, the coordinator is disallowed to connect to the segments.

I don't understand how timer and timeout is a solution. Please can you explain how it fixes the situation. I think any timeout solution will have a window/gap where things will go wrong.

Frankly, I personally don't like either one, even though I had proposed the first as well.

It's a problem we should continue to think about and see how to solve, definitely not a showstopper.  We can move forward in the feature while exploring ideas for this.

When promotion happens, the promoted coordinator notifies all primaries "Hey, I'm the coordinator now!". Then all primaries terminate all backends(except the current backend for notification) and remember the coordinator.

It seems good to combine the 2 approaches.
My doubt is what is the key we use to distinguish the source node. Using IP address as the key may have problem in multi-address environment. Currently, the destination IP is fixed to be address in gp_segment_configuration, but the source IP is not bound. So, it’s not reliable to tell whether the connection is from the desired coordinator by only IP address. Adding additional token may resolve this issue, but it changes the protocols which may be unacceptable.

Yes, mostly will have to use DBID possibly for this.

Hao Wu

unread,
Oct 22, 2020, 9:54:36 PM10/22/20
to Ashwin Agrawal, Asim Praveen, Greenplum Developers, Simon Gao
What's the downside of sticking with pg_auto_failovers logic? (I understand your proposed enhancement is less disruptive to the work-load as only the monitor is disconnected but coordinator is fine why failover. Still trying to understand if it's just optimization we are proposing here or a really hard requirement. Possibly reasking something we might have discussed earlier.)
When the network between the monitor and coordinator is unstable, the monitor may fail to detect the coordinator. The network issue may be temporary and all other connections are good. If we're hurrying to promote the standby, there would be a window that the cluster is unavailable.

Given in GPDB standby can't run queries, I don't know how monitor will know standby is connected to coordinator and replication is flowing. I am guessing pg_auto_failover uses pg_stat_replication kind of queries on standby to monitor the system. Maybe we will have to hack, similar to how we let FTS connections today to let pg_auto_failover queries to go through on standby.
Standby knows the replication and it periodically reports to the monitor the health of the replication and how long it has been disconnected, etc.

I don't understand how timer and timeout is a solution. Please can you explain how it fixes the situation. I think any timeout solution will have a window/gap where things will go wrong.
Sure. Unconfirmed coordinator is disallowed to connect to the primaries.
  1. When an instance starts/restarts as the coordinator, its role is unconfirmed.
  2. After the role of coordinator is confirmed, the coordinator maintains a timer(Tc). It records how long the coordinator has lost both connections to the monitor and standby. When timeout, the role of coordinator is undetermined/unconfirmed, i.e. needs reconfirmation or demote. See below.
  3. If the replication is established, the role of coordinator is confirmed.
  4. If the coordinator role is assigned from the monitor, the role is also confirmed.
On the standby side, it also remains a timer(Ts). The timer records how long the standby has lost the replication connection to the peer/coordinator. Standby periodically reports the timer to the monitor.
On the monitor side, it remains a timer(Tm). The timer records how long the monitor has lost the connection to the coordinator.
Let Tp = min(Ts, Tm).

From the definition of the timers, Tp is equal to Tc in theory if we don't take the detect delay into consideration. We could think Tp is near to Tc in practice.
Now, we define 3 time intervals.
Lc: if Tc >= Lc, the coordinator lost its role, the role of coordinator needs reconfirmation or demote.
Lp: if Tp >= Lp, the monitor starts to promote standby.
Gp := Lp - Lc, the gap between the coordinator lost its role and the beginning of promotion.

If we set Gp to a safe value, the coordinator has lost its role before promotion.
When the standby is promoted, it still should notify all primaries to terminate all backends(except the notification connection).

Yes, mostly will have to use DBID possibly for this.
When the backend gets the DBID, it possibly means the backend rejects the connection in processing the query. I'm not sure if all connections will provide the DBID of the QD.

Regards,
Hao Wu

From: Ashwin Agrawal <ashwi...@gmail.com>
Sent: Friday, October 23, 2020 5:54 AM
To: Hao Wu <ha...@vmware.com>
Cc: Asim Praveen <pa...@vmware.com>; Greenplum Developers <gpdb...@greenplum.org>; Simon Gao <sim...@vmware.com>

Subject: Re: Proposal: Master Auto Failover for Greenplum

Ashwin Agrawal

unread,
Oct 23, 2020, 6:41:48 PM10/23/20
to Hao Wu, Asim Praveen, Greenplum Developers, Simon Gao
On Thu, Oct 22, 2020 at 6:54 PM Hao Wu <ha...@vmware.com> wrote:
What's the downside of sticking with pg_auto_failovers logic? (I understand your proposed enhancement is less disruptive to the work-load as only the monitor is disconnected but coordinator is fine why failover. Still trying to understand if it's just optimization we are proposing here or a really hard requirement. Possibly reasking something we might have discussed earlier.)
When the network between the monitor and coordinator is unstable, the monitor may fail to detect the coordinator. The network issue may be temporary and all other connections are good. If we're hurrying to promote the standby, there would be a window that the cluster is unavailable.

Doesn't read any GPDB specific issue to me. If the concern exists, single node PostgreSQL will essentially face the similar. Plus, if n/w is so unstable (given GPDB nodes are within a single site) we potentially have many other issues with the cluster. So, I don't see any reason to touch the state machine for this purpose. Definitely not till hearing about it from the field as a concerning incident from a real production cluster once this solution is deployed.

Given in GPDB standby can't run queries, I don't know how monitor will know standby is connected to coordinator and replication is flowing. I am guessing pg_auto_failover uses pg_stat_replication kind of queries on standby to monitor the system. Maybe we will have to hack, similar to how we let FTS connections today to let pg_auto_failover queries to go through on standby.
Standby knows the replication and it periodically reports to the monitor the health of the replication and how long it has been disconnected, etc.

How?

I don't understand how timer and timeout is a solution. Please can you explain how it fixes the situation. I think any timeout solution will have a window/gap where things will go wrong.
Sure. Unconfirmed coordinator is disallowed to connect to the primaries.
  1. When an instance starts/restarts as the coordinator, its role is unconfirmed.
  2. After the role of coordinator is confirmed, the coordinator maintains a timer(Tc). It records how long the coordinator has lost both connections to the monitor and standby. When timeout, the role of coordinator is undetermined/unconfirmed, i.e. needs reconfirmation or demote. See below.
  3. If the replication is established, the role of coordinator is confirmed.
  4. If the coordinator role is assigned from the monitor, the role is also confirmed.
On the standby side, it also remains a timer(Ts). The timer records how long the standby has lost the replication connection to the peer/coordinator. Standby periodically reports the timer to the monitor.
On the monitor side, it remains a timer(Tm). The timer records how long the monitor has lost the connection to the coordinator.
Let Tp = min(Ts, Tm).

From the definition of the timers, Tp is equal to Tc in theory if we don't take the detect delay into consideration. We could think Tp is near to Tc in practice.
Now, we define 3 time intervals.
Lc: if Tc >= Lc, the coordinator lost its role, the role of coordinator needs reconfirmation or demote.
Lp: if Tp >= Lp, the monitor starts to promote standby.
Gp := Lp - Lc, the gap between the coordinator lost its role and the beginning of promotion.

If we set Gp to a safe value, the coordinator has lost its role before promotion.
When the standby is promoted, it still should notify all primaries to terminate all backends(except the notification connection).

Thanks for explaining. Sorry but this all just is too complicated for my brain. Will probably take a long time for me to digest. Feels will introduce multiple choking points in connecting to segments, anytime replication or monitor connection drops. Also, it seems it continues to rely on disconnected Coordinator to behave which I don't know how much we can trust on. Failed nodes may be facing any kind of issues. I don't think we can trust it to behave.

Hao Wu

unread,
Nov 19, 2020, 7:43:35 AM11/19/20
to Greenplum Developers, Asim Praveen, Simon Gao
Update:

Currently, we make pg_auto_failover work for Greenplum in development, however, there are dozens of works to do. Here is the summary.
Create a new document and edit with others at the same time -- from your computer, phone or tablet. Get stuff done with or without an internet connection. Use Docs to edit Word files. Free from Google.


Hot standby in pg_auto_failover

Hot standby is required by pg_auto_failover, but Greenplum doesn’t fully support hot standby. pg_auto_failover needs hot standby so that the monitor can connect to the standby for read only queries. The monitor still has to specify the connection in utility mode, which means the client can’t connect to the standby for read. Besides, the hot standby is still transaction read only, it disallows write on the standby.


Deployment of the monitor and save the metadata about pg_auto_failover

The monitor is a new PG instance node with the extension of pg_auto_failover. We need to choose where to deploy the monitor node and where to save the metadata about auto failover. The metadata contains the host, port of the monitor node and the datadir of the monitor instance. Currently, we’ll choose one of the segment nodes to deploy the monitor and the metadata about auto failover is explicitly passed to the programs.


Init system

Init greenplum system with auto failover has 2 steps:

  1. Init greenplum system without pg_auto_failover support

  2. Configure the existing greenplum system to be auto failover

Note: `pg_autoctl create postgres` could either create a new PG instance or configure an existing PG instance to be a node managed by pg_auto_failover. But, we can’t let pg_autoctl create the coordinator instance for us, because `pg_autoctl create postgres` doesn’t only create a PG instance, it also starts the instance. Before the gp_segment_configuration is set properly, we can’t really start the node in dispatch mode. So, we should firstly create the GPDB nodes and then configure the coordinator and standby to be auto failover.


pg_hba.conf

`pg_autoctl create postgres` will add some configuration files inside and outside of the existing pgdatadir, for example, `~/.config/pg_autoctl/home/gpadmin/datadirs/<node_name>/pg_autoctl.cfg`, $pgdata/postgresql-auto-failover.conf. pg.conf is also added some entries to allow connection from the monitor. However, some entries are still missing, so the monitor can not connect to the standby. It needs to be fixed.


pg_rewind

Running pg_rewind in pg_autoctl is buggy, it can’t correctly update the WAL segment and controldata file. Currently, we skip pg_rewind and always use pg_basebackup instead.


pg_basebackup

Currently, pg_basebackup is always applied in promotion. But the configuration file for replication(postgresql.auto.conf) isn’t updated. We may add the option `-R` in pg_basebackup, which will generate a new postgresql.auto.conf for us. pg_autoctl also generates a file named `postgresql-auto-failover-standby.conf`, which has the context about SSL options. Currently, we overwrite postgresql.auto.conf by the content of postgresql-auto-failover-standby.conf. We need to check which one is better in later stories.


restart the postmaster by pg_autoctl

For the coordinator and standby, the postmaster of the PG instance is a child process of one of the pg_autoctl processes. When the postmaster is down, the pg_autoctl may restart the postmaster process. Currently, restarting the postmaster by pg_autoctl fails.



libpq support for multiple addresses

From PG 10+, libpq supports multiple addresses in connection string, which allows auto failover. However, if the first address is not hot-standby, the libpq connection will not try the second address.

Huber and Hao Wu have a patch for the upstream: https://www.postgresql.org/message-id/flat/BN6PR05MB3492948E4FD76C156E747E8BC9160%40BN6PR05MB3492.namprd05.prod.outlook.com


state of the postmaster: ready vs dtmready

pg_autoctl checks the status of the postmaster. In postgresql, `ready` is the state for both primary and secondary when the instance is ready. But in greenplum, `ready` is a middle state of the coordinator. It may cause some error or dump annoying messages. We’d better take care of the states.


let the segments remember the active coordinator

In our previous discussion, when promotion happens, the promoted coordinator sends a notification to all segments(primary and mirror) to tell them “I’m the new coordinator, remember my id”. This logic is totally absent.


status of btree_gist and pg_stat_statements

btree_gist and pg_stat_statements are required by pg_auto_failover. However, they are not compiled and installed.

We need to check their status and understand why they are required by pg_auto_failover.


group_id is a keyword in Greenplum

group_id is a keyword in Greenplum, but it is used as the parameter name in some UDFs, which causes the syntax error in parsing the function. I have opened a github issue in upstream(https://github.com/citusdata/pg_auto_failover/issues/509). The upstream is considering to provide a PR for compatibility purposes. Currently, I refactor the parameter name group_id to group__id.


Impact on utility tools

As far as I know, the affected utility tools contain: gpinitsystem, gpstart, gpstop, gpactivatestandby, gpinitstandby, gpexpand. 

Init the Greenplum cluster

  1. Create a new GPDB cluster without auto failover.

  2. Create the monitor instance.

  3. Configure the coordinator and standby to be a pair in a group of the auto failover.

  4. Configure the metadata of the monitor somewhere. E.g. in the coordinator IF we save the metadata in the coordinator/standby.

Start the Greenplum cluster

  1. Get the metadata about the auto failover based on the input parameter

    1. If we get the metadata from a heap in the coordinator node, we retrieve it by connecting to the PG instance in single mode.

  2. Start the monitor instance.

  3. Get the metadata of the current coordinator from the monitor.

  4. Retrieve gp_segment_configuration from the coordinator in single mode.

  5. Start all the segments.

  6. Get the metadata of the coordinator and standby

  7. Start the coordinator and standby. The start options are saved in the configuration files managed by pg_auto_failover.

Stop the Greenplum cluster

  1. Get the metadata about the auto failover based on the input parameter.

  2. Retrieve the coordinator and standby metadata from the monitor instance.

  3. Get gp_segment_configuration from the coordinator.

  4. Stop the coordinator and standby

  5. Stop all segments in gp_segment_configuration.

  6. Stop the monitor instance

Besides the normal case, we also have to start/stop the cluster correctly if the monitor is down

gpexpand

gpexpend also needs to have some changes. The pgdata of the segment is copied from the coordinator, but the configuration file(postgresql-auto-failover.conf & postgresql-auto-failover-standby.conf) about the auto failover should be cleaned out.

gpactivestandby

With auto failover, gpactivestandby is not supposed to promote the standby. What will happen if we manually promote the standby? Will the pg_auto_failover handle this mistake correctly? Maybe gpactivestandby should check whether the auto failover is configured first and raise an error if the standby is part of auto failover.

gpinitstandby

If the standby is added to the pg_failover, `gpinitstandby -r` is a trouble maker. The old data directory configured auto failover may introduce risk.




The next step is to create more detailed stories if we decide to develop auto failover based on pg_auto_failover.

Welcome to post your concerns or issue.


Regards,

Hao Wu



Hubert Zhang

unread,
Nov 19, 2020, 9:41:50 PM11/19/20
to Hao Wu, Greenplum Developers, Asim Praveen, Simon Gao
Hi Hao,

Thanks for summarize all the existing issues when integrating pg_auto_failover with Greenplum.

Hot standby in pg_auto_failover

Hot standby is required by pg_auto_failover, but Greenplum doesn’t fully support hot standby. pg_auto_failover needs hot standby so that the monitor can connect to the standby for read only queries. The monitor still has to specify the connection in utility mode, which means the client can’t connect to the standby for read. Besides, the hot standby is still transaction read only, it disallows write on the standby.

To clarify, the latest Greenplum supports to enable hot standby in utility mode, but not dispatch mode. But in pg_auto_failover, it's enough to connect to hot standby in utility mode. So there is no dependency on fully support the  hotstandby feature, right?

Impact on utility tools

As far as I know, the affected utility tools contain: gpinitsystem, gpstart, gpstop, gpactivatestandby, gpinitstandby, gpexpand. 

One more thing to add is that we need [new] tools to recover the failed coordinator, standby or monitor when error happens, something like gprecoverseg​ for primaries and mirrors.

Thanks,
Hubert

From: Hao Wu <ha...@vmware.com>
Sent: Thursday, November 19, 2020 8:43 PM
To: Greenplum Developers <gpdb...@greenplum.org>
Cc: Asim Praveen <pa...@vmware.com>; Simon Gao <sim...@vmware.com>

Subject: Re: Proposal: Master Auto Failover for Greenplum

Hao Wu

unread,
Nov 19, 2020, 10:14:58 PM11/19/20
to Hubert Zhang, Greenplum Developers, Asim Praveen, Simon Gao
To clarify, the latest Greenplum supports to enable hot standby in utility mode, but not dispatch mode. But in pg_auto_failover, it's enough to connect to hot standby in utility mode. So there is no dependency on fully support the  hotstandby feature, right?
Yes, you are right. The monitor and the local keeper process connect to the coordinator/standby in utility mode. Queries are only for the coordinator/standby itself, so dispatch is unwanted. Current support of hot-standby seems to be enough for pg_auto_failover.


One more thing to add is that we need [new] tools to recover the failed coordinator, standby or monitor when error happens, something like gprecoverseg​ for primaries and mirrors.

Maybe not, if the cluster uses pg_auto_failover, we are not recommended to use gpactivestandby. If the coordinator is down, we start the coordinator with the same command line, like `pg_autoctl run --pgdata $pgdata`. When the old coordinator starts up as the coordinator, the monitor will order the old coordinator to do pg_rewind/pg_basebackup and demote to standby.

Regards,
Hao Wu

From: Hubert Zhang <zhu...@vmware.com>
Sent: Friday, November 20, 2020 10:41 AM
To: Hao Wu <ha...@vmware.com>; Greenplum Developers <gpdb...@greenplum.org>
Reply all
Reply to author
Forward
0 new messages