A proposal of master auto-failover

Gang Xiong

unread,

Oct 25, 2018, 5:29:09 AM10/25/18

to gpdb...@greenplum.org

The current status of HA:

1. we can configure a master and a standby master, but it requires the manual operation (gpactivatestandby) to activate the standby master when the master is down.

2. we can configure primary segments and mirror segments, an FTS process running on the master node checks the aliveness of all segments and performs primary/mirror transition when the primary segment is considered down. But it requires the manual operation (gprecoveryseg) to bring the down node back to cluster.

Goal:

1. master can auto failover to the standby master when it's not functioning.

2. can use one same way to handle segment primary/mirror transition and master auto failover. (Nice to have)

3. can automatically bring the failed node(master or primary segment) back and rejoin the cluster. (Not included in this proposal)

The proposal v1:

1. Deploy an etcd cluster in the GPDB cluster.

etcd is a distributed key-value store that provides a reliable way to store data across a cluster of machines. You can set, get, update, delete and atomic compare-and-swap a value in etcd. Also, you can set a value with a TTL (time to live) on the key, the key will be deleted when TTL expires.

2. Each postmaster process starts a child process EC (etcd client).

The master/standby and the primary/mirror segment are a pair, each pair has the same content id, and each instance has its own dbid. For example, master and standby's content id is -1, and master's dbid is 1, standby's dbid is 2. We call the active node of each pair the leader and call its peer the follower.

etcd is started before GPDB cluster. When the GPDB cluster starts, the master reads gp_segment_configuration and write a leader KV (content id/leader dbid) for each pair. All EC watches its own leader key, and it will get notified on the key expiration.

EC process is a loop:

a. Checks the aliveness of the postmaster process. If the postmaster is not functioning, shut down the instance.

b. Checks the leader key on etcd, if etcd is unreachable, notify postmaster to shut down.

For the leader, if the value is its own dbid or empty, extend the lease; if it's some other value, means mirror has taken over, notify postmaster to shut down.

For the follower, if the value is empty, set its own dbid and trigger promotion.

3. Leader's EC maintains a variable in the shared memory, it records the timestamp of the last time EC talks to etcd (LTE). QD and QE must make sure (current timestamp < LTE + TTL ) before committing a transaction. This is to guarantee no transaction got committed after standby/mirror is promoted.

4. The master's EC watches the leader key of all segments, it updates gp_segment_configuration when it's notified on primary/mirror transition.

5. When the leader failed to send WAL through streaming replication to the follower, it writes the 'out of sync' status in etcd. The follower checks this status before doing a promotion.

6. The follower should wait sometime before take over to endure network glitch.

The defect of this proposal:

Ideally, we should trigger a segment transition when any one of the following happens:

1. The segment postmaster process is killed or is not responding.

2. The segment is unreachable from the master.

3. The segment is unreachable from other segments.

Proposal v1 guarantees postmaster is alive and it's reachable from the etcd cluster, so it's built on such an assumption: if node A is not reachable from node B, then it's not reachable from any other node in the cluster. FTS is built on the same assumption, if one segment is alive to master while it's not reachable from other segments, FTS is not able to detect it.

If this assumption doesn't stand and it's probably not, we need to add connectivity test in proposal v2.

The proposal v2 (add connectivity test):

The EC can send an echo packet to other ECs, and acknowledge its content id when an echo packet is received.

scenario 1: a segment lost connection to the master.

a. Checks the aliveness of the postmaster process. If the postmaster is not functioning, shut down the instance.

b. Segments' EC probe the master, proceed to step c when received ACK, or else timeout and probe again. The leader will notify postmaster to shut down when it's notified the follower has taken over.

c. Checks the leader key on etcd, if etcd is unreachable, notify postmaster to shut down.

For the leader, if the value is its own dbid or empty, extend the lease; if it's some other value, means mirror has taken over, notify postmaster to shut down.

For the follower, if the value is empty, set its own dbid and trigger promotion.

scenario 2: a segment is connected to the master, but it's not connected to some of the segments.

scenario 3: the master is connected to etcd but it lost connections to some of the segments, while the standby master is connected to all the segments.

Can add some probe between segments in step b.

scenario 4: the master and one segment goes down at the same time

The mirror segment won't be able to promote itself because it can't connect to the old master. The standby master will promote itself and update the acting master in etcd. The mirror will probe the new master and promote itself.

Conclusion:

Like FTS to the segments, we need a third party to arbitrate between master and standby master and avoid split-brain when their network goes down. We also need this third party should be high availability itself. etcd is a good choice, and it's more convenient to integrate with K8S. But introducing etcd is also risky, not only we rely on etcd to provide the reliability but also we need to deploy and maintain etcd carefully. We probably need to include etcd binary in the installer, distribute it to all the nodes, run and monitor it, restart on some other nodes if one of the etcd instances goes down.

In order to test the connectivity, we need EC to be able to send heartbeat between each node, it's the same as FTS. So one other option is to use etcd for master/standby and keep using FTS for primary/mirror segments.

MPP Engine team.

Michael Schubert

unread,

Oct 25, 2018, 11:43:42 AM10/25/18

to Gang Xiong, gpdb...@greenplum.org

Besides etcd are there any other external components considered here? You could certainly do worse than etcd although a discussion about alternatives to consider might be useful.

MPP Engine team.

kon...@gmail.com

unread,

Oct 26, 2018, 11:10:15 AM10/26/18

to Greenplum Developers

+1

Having a common distributed key value stores for all GPDB features will be invaluable.

Here is a list of potential application for /etcd

1) PXF needs to store and distribute configuration files across multiple GPDB hosts.

2) GPText uses Apache Zookeeper for distributed configuration management.

Also,

PXF need to have HA capability, in order to provide reliable and scalable access to external data. Definitely, PXF can leverage /etcd or any equivalent solution (Zookeeper, and maybe Apache Geode/GemFire)

Thanks,

Kong

Also, GPText is using Apache Zookeeper for configuration.

Xin Zhang

unread,

Oct 26, 2018, 11:45:10 AM10/26/18

to kon...@gmail.com, gpdb...@greenplum.org

Thanks a lot, Kong and Gang.

Yeah, let's schedule a meeting to see what we can do for the common parts on the server side.

I will send out a meeting request. I am looking at Tuesday night here at PA and Wed morning at BJ.

Thanks a lot,

Shin

--

Shin

Pivotal | Sr. Principal Software Eng, Data R&D

Asim R P

unread,

Oct 26, 2018, 1:34:33 PM10/26/18

to xiong gang, gpdb...@greenplum.org

On Thu, Oct 25, 2018 at 2:29 AM Gang Xiong <gxi...@pivotal.io> wrote:
>
> The current status of HA:
> 1. we can configure a master and a standby master, but it requires the manual operation (gpactivatestandby) to activate the standby master when the master is down.

And what's worse, the database provides no guarantee to a user that
the standby was in-sync when the master failed. The ability to
determine whether the standby is worthy of promotion is the most
severe lacuna of current HA.

> 2. we can configure primary segments and mirror segments, an FTS process running on the master node checks the aliveness of all segments and performs primary/mirror transition when the primary segment is considered down. But it requires the manual operation (gprecoveryseg) to bring the down node back to cluster.

A detail worth noting is FTS probes only the primaries and relies on a
primary to keep state of its peer. In one way, this adds an element
of robustness - if a primary has lost connection to its mirror (even
when, hypothetically, FTS could connect to the mirror), the mirror is
treated as down.

>
> Goal:
> 1. master can auto failover to the standby master when it's not functioning.
> 2. can use one same way to handle segment primary/mirror transition and master auto failover. (Nice to have)
> 3. can automatically bring the failed node(master or primary segment) back and rejoin the cluster. (Not included in this proposal)

That's a very clear articulation of goals, thank you! Let's add the
ability to detect synchronization state of standby when master is not
available explicitly to point 1 above.

> The proposal v1:
> 1. Deploy an etcd cluster in the GPDB cluster.

+1. I've not used etcd before, cannot wait to be more familiar with it.

>
> 2. Each postmaster process starts a child process EC (etcd client).
> The master/standby and the primary/mirror segment are a pair, each pair has the same content id, and each instance has its own dbid. For example, master and standby's content id is -1, and master's dbid is 1, standby's dbid is 2. We call the active node of each pair the leader and call its peer the follower.
>
> etcd is started before GPDB cluster. When the GPDB cluster starts, the master reads gp_segment_configuration and write a leader KV (content id/leader dbid) for each pair. All EC watches its own leader key, and it will get notified on the key expiration.

Thinking of only master-standby pair being managed by etcd to begin
with, why keep master and standby status in gp_segment_configuration?
Can't etcd's key-value store be the only source of truth?

> EC process is a loop:
>

> 6. The follower should wait sometime before take over to endure network glitch.

Can you please elaborate this point?

>
> The defect of this proposal:
> Ideally, we should trigger a segment transition when any one of the following happens:
> 1. The segment postmaster process is killed or is not responding.
> 2. The segment is unreachable from the master.
> 3. The segment is unreachable from other segments.

In addition, even if a postmaster is reachable and alive, if its data
directory cannot be written to (e.g. disk full or I/O error), it is
effectively down. FTS probe handler currently checks this by writing
a page to disk before responding to a probe. FTS uses libpq messages
for probing. This means that if a postmaster cannot spawn a backend
process to handle a probe request, it is considered down.

>
> Conclusion:

> In order to test the connectivity, we need EC to be able to send heartbeat between each node, it's the same as FTS. So one other option is to use etcd for master/standby and keep using FTS for primary/mirror segments.

How about starting with etcd for master/standby and leaving
primary/mirror for FTS as first step?

Asim

Jim Doty

unread,

Oct 26, 2018, 3:02:50 PM10/26/18

to Asim Praveen, Gang Xiong, gpdb...@greenplum.org, GPText, Bharath Sitaraman

On Fri, Oct 26, 2018 at 11:34 AM Asim R P <apra...@pivotal.io> wrote:

> > The proposal v1:
> > 1. Deploy an etcd cluster in the GPDB cluster.
>
> +1. I've not used etcd before, cannot wait to be more familiar with it.

I would be interested to hear the GPText's teams experiences with Zookeeper as a runtime dependancy.

Would you recommend looking into Zookeeper, or is a bad fit for this use case?

Would there be any reasons not to recommend etcd for what is being discussed here?

Would you be in a position to leverage etcd if that came with GPDB[1]?

Do you have stories from customers that see a failure in GPDB/GPText, where the root cause was the Zookeeper process? Were those problems hard to diagnose or resolve? Does the team have thoughts on how much this extra distributed system adds operational complexity? I am hoping that regardless of the technology selected, we can head off a difficult operational experience for customers buy leveraging GPText's experience.

[1] Would something like https://coreos.com/blog/introducing-zetcd be helpful?

Ivan Novick

unread,

Oct 26, 2018, 3:14:54 PM10/26/18

to James Doty, Asim Praveen, xiong gang, gpdb...@greenplum.org, GPT...@pivotal.io, Bharath Sitaraman

Jim,

GPText is a seperate Pivotal software and not part of Greenplum Database. Its not relevant for this conversation.

Cheers,

Ivan

--

Ivan Novick, Product Manager Pivotal Greenplum

ino...@pivotal.io -- (Mobile) 408-230-6491

https://www.youtube.com/GreenplumDatabase

Gang Xiong

unread,

Oct 29, 2018, 5:23:21 AM10/29/18

to Asim Praveen, gpdb...@greenplum.org

Thanks, Asim.

It's the best we don't persist anything in etcd, then we can start etcd anywhere if etcd is

crashed and unrecoverable. But then we have to store the out-of-sync information to

segments when standby is disconnected from master, that would be tricky.

So I guess you are right, we trust etcd and use etcd as the only source of truth, then we

should retire gp_segment_configuration.

> EC process is a loop:
>
> 6. The follower should wait sometime before take over to endure network glitch.

Can you please elaborate this point?

For example, we set TTL to 30 seconds, and the leader may not be able to extend the lease

if it's too busy or the network is not stable, the follower will notice and start the transition.

The idea is to let the follower wait a little bit longer before triggering promotion.

>
> The defect of this proposal:
> Ideally, we should trigger a segment transition when any one of the following happens:
> 1. The segment postmaster process is killed or is not responding.
> 2. The segment is unreachable from the master.
> 3. The segment is unreachable from other segments.

In addition, even if a postmaster is reachable and alive, if its data
directory cannot be written to (e.g. disk full or I/O error), it is
effectively down. FTS probe handler currently checks this by writing
a page to disk before responding to a probe. FTS uses libpq messages
for probing. This means that if a postmaster cannot spawn a backend
process to handle a probe request, it is considered down.

I didn't know that. Thanks for the information. I think we should add more

checks in "step a" in addition to check postmaster process is alive.

>
> Conclusion:
> In order to test the connectivity, we need EC to be able to send heartbeat between each node, it's the same as FTS. So one other option is to use etcd for master/standby and keep using FTS for primary/mirror segments.

How about starting with etcd for master/standby and leaving
primary/mirror for FTS as first step?

Sounds good.

Asim

Gang Xiong

unread,

Oct 29, 2018, 5:30:59 AM10/29/18

to Michael Schubert, gpdb...@greenplum.org

Thanks. Both etcd and zookeeper fit our requirements. Choosing between them is more like choosing program language (go/c or java) and community (apache or k8s). There are some other alternatives, consul and so on, we are going to take a look.

Gang Xiong

unread,

Oct 30, 2018, 6:31:05 AM10/30/18

to Greenplum Developers

I got some feedback and someone showed their interest in the option that doesn't require 3rd-party software.

I can think of this way to use segments decide the leader master, but I haven't figure out an easy way to cover all the corner cases.

Proposal v3 (without etcd)

1. Keep using FTS for segment management.

2. The standby:

sends heartbeats to the master. Once the heartbeat is not responded, the standby does a health check and sends a failover request to all primary segments.

There're 4 kinds of response:

a. M means vote for the master.

b. S means vote for the standby.

c. O means master is out-of-sync with standby.

d. timeout means this primary is probably down.

If standby receives O, it shut down. If it gets 2/3 of S, it starts failover. If it gets a timeout, it adds the mirror of that segment in the probe list. Otherwise, it waits a while and sends the request again.

Segments:

Mirror segments don't respond to failover request.

Primary Segments maintains a variable 'standby_out_of_sync'. When received the failover request, primary segments respond O if 'standby_out_of_sync' is true. Otherwise, it probes the master and responds M if the master acknowledged, or else, it responds S.

The master:

If the master is not able to send WAL to the standby, it marks standby out-of-sync. Before writing next WAL record, QD (query dispatcher) will send a request to segments and segments will set 'standby_out_of_sync' to true. This is also passed to the mirror when the mirror is promoted.

3. Once the standby decided to promote itself. It will:

a. promote itself, including terminate the walreceiver process and start the FTS process.

b. FTS process makes sure there're enough active segments.

b. send a request to all primary segments and primary segments set 'gp_current_master' to standby.

c. accept new connections.

When a segment postmaster receives a connection request, or a QE (query executor) receives dispatch information from a QD, and the sender is not the same as 'gp_current_master', the postmaster or QE should reject the request and error out.

The problems:

1. The cluster is not available when the master and more than 1/3 segments are down at the same time. In this situation, master can't do primary/mirror transition, and standby can't get at least 2/3 votes.

2. Split-brain happens when this kind of network partition happens:

Master Standby

P1 P2 P3 P4 P5 P6

M1 M2 M3 M4 M5 M6

Master is connected to Primary 1 and Mirror 2, 3, 4, 5, 6. Standby is connected to Primary 2, 3, 4, 5, 6 and mirror 1.

Standby can get enough votes to start failover and it promotes mirror 1, while the master promotes mirror 2, 3, 4, 5, 6 and form another cluster.

On Tue, Oct 30, 2018 at 10:12 AM Gang Xiong <gxi...@pivotal.io> wrote:

Good to know. Thanks Goutam and Ivan.

On Tue, Oct 30, 2018 at 2:24 AM Ivan Novick <ino...@pivotal.io> wrote:
I was advised by CLoud Foundry engineering they have data loss with Consul, and to avoid it

On Mon, Oct 29, 2018 at 8:55 AM Goutam Tadi <gt...@pivotal.io> wrote:
Hi Gang,
FYI Regarding Consul:
There has been a document since June 2017 that explains the pain points of consul and reasons why Cloud Foundry is moving away from it.

I don't know if the same issues can happen to Greenplum also.

Thanks,
Goutam Tadi

Since this document is internal, I limited recipients of this reply to internal folks only.

Scott Kahler

unread,

Oct 30, 2018, 12:00:28 PM10/30/18

to xiong gang, gpdb...@greenplum.org

Automatically failing over the master doesn't matter if the end user clients and all utilities don't know to connect to the "new" master. Any promotion or change in state would need to be made externally visible so that the corresponding host knows to take control of the IP, DNS needs to modify to point to the new host. It would be exceptionally nice to have some sort of facility to execute scripts in location X if the system failover happens.

--

Scott Kahler | Pivotal, Greenplum Product Management | ska...@pivotal.io | 816.237.0610

Michael Schubert

unread,

Oct 30, 2018, 12:10:12 PM10/30/18

to Scott Kahler, Gang Xiong, gpdb...@greenplum.org

On Tue, Oct 30, 2018 at 12:00 PM Scott Kahler <ska...@pivotal.io> wrote:

Automatically failing over the master doesn't matter if the end user clients and all utilities don't know to connect to the "new" master. Any promotion or change in state would need to be made externally visible so that the corresponding host knows to take control of the IP, DNS needs to modify to point to the new host. It would be exceptionally nice to have some sort of facility to execute scripts in location X if the system failover happens.

The usual suspects for handling that (assuming no existing external LB) are pgbouncer, pgpool or generically haproxy isn't it? Any others?

Scott Kahler

unread,

Oct 30, 2018, 1:06:00 PM10/30/18

to Michael Schubert, xiong gang, gpdb...@greenplum.org

My feeling is that the number of people using GPDB + pgbouncer/pgpool is relatively few and the number that are using it on a system that is not master is even less.

I agree those are avenues. Just straight up load balancer solutions such as a BigIP could be in use too. Also detection via an event system like Nagios/Zabbix are all that some people have, which currently alerts them to do the manual process.

Andreas Scherbaum

unread,

Oct 30, 2018, 1:59:26 PM10/30/18

to Michael Schubert, Scott Kahler, gxi...@pivotal.io, gpdb...@greenplum.org

On Tue, Oct 30, 2018 at 4:10 PM Michael Schubert <msch...@pivotal.io> wrote:

On Tue, Oct 30, 2018 at 12:00 PM Scott Kahler <ska...@pivotal.io> wrote:
Automatically failing over the master doesn't matter if the end user clients and all utilities don't know to connect to the "new" master. Any promotion or change in state would need to be made externally visible so that the corresponding host knows to take control of the IP, DNS needs to modify to point to the new host. It would be exceptionally nice to have some sort of facility to execute scripts in location X if the system failover happens.

The usual suspects for handling that (assuming no existing external LB) are pgbouncer, pgpool or generically haproxy isn't it? Any others?

They all require a third host, which in turn must be HA by itself. That adds more problems and dependencies.

Yo can also fail over a virtual IP between master and standby master - but this requires at least two independent connections between master and standby, to exclude network problems.

--

Andreas Scherbaum

Principal Software Engineer

GoPivotal Deutschland GmbH

Hauptverwaltung und Sitz: Am Kronberger Hang 2a, 65824 Schwalbach/Ts., Deutschland

Amtsgericht Königstein im Taunus, HRB 8433

Geschäftsführer: Andrew Michael Cohen, Paul Thomas Dacier

Paul Guo

unread,

Oct 31, 2018, 12:33:23 AM10/31/18

to asche...@pivotal.io, msch...@pivotal.io, ska...@pivotal.io, gxi...@pivotal.io, gpdb...@greenplum.org

It seems that JDBC driver has had the support for that.

https://jdbc.postgresql.org/documentation/head/connect.html

jdbc:postgresql://host1:port1,host2:port2/database

If other drivers/clients do not support that, application itself or proxy kind of software or VIP, some of which were mentioned, could do that.

While for auto fail-over, If we want real one, I think we will still need consensus services based on standard algorithms, etcd, zk whatever in short time. It seems that if you use these standard consensus service, we do not need do invasive code change in greenplum but rather provide an independent component if I understand correctly. You could design APIs to accommodate various consensus services assuming the logics are similar.

Andreas Scherbaum <asche...@pivotal.io> 于2018年10月31日周三上午1:59写道：

--
You received this message because you are subscribed to the Google Groups "Greenplum Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-dev+u...@greenplum.org.

Asim R P

unread,

Oct 31, 2018, 2:41:14 PM10/31/18

to xiong gang, gpdb...@greenplum.org

On Tue, Oct 30, 2018 at 3:31 AM Gang Xiong <gxi...@pivotal.io> wrote:
>
> I got some feedback and someone showed their interest in the option that doesn't require 3rd-party software.
> I can think of this way to use segments decide the leader master, but I haven't figure out an easy way to cover all the corner cases.
>
> Proposal v3 (without etcd)

+1 to leverage segments rather than involving a third party.

At the minimum, the state of master/standby pair must be maintained by
one other segment. The challenge is when master and this segment go
down at the same time. But isn't this the same as, master as well as
the etcd service going down in your previous solution?

Asim

Ning Yu

unread,

Oct 31, 2018, 10:44:16 PM10/31/18

to Asim R P, xiong gang, gpdb...@greenplum.org

If etcd service is down all the segments (including master and standbys) should all kill themselves, so brain split will not happen.

Asim

Paul Guo

unread,

Nov 1, 2018, 1:52:25 AM11/1/18

to Ning Yu, Asim Praveen, gxi...@pivotal.io, gpdb...@greenplum.org

I do not think we need to care that much about etcd service down. etcd kind of service should be actually configured with HA in production environment and should thus be able to endure loss of 1 or more nodes.

If etcd service is unfortunately "down" so either

1) There are down nodes that exceeds the quorum and those nodes can not be automatically recovered.

2) There is network partition between etcd nodes & greenplum nodes.

For both rare cases, maybe just simply disable master auto-failover temporarily but keep the greenplum functionality, while waiting for fix.

Ning Yu <n...@pivotal.io> 于2018年11月1日周四上午10:44写道：

Asim R P

unread,

Nov 1, 2018, 12:00:11 PM11/1/18

to Paul Guo, Ning Yu, xiong gang, gpdb...@greenplum.org

On Wed, Oct 31, 2018 at 10:52 PM Paul Guo <pau...@gmail.com> wrote:
>
> I do not think we need to care that much about etcd service down. etcd kind of service should be actually configured with HA in production environment and should thus be able to endure loss of 1 or more nodes.

In the same vein, is it worth caring for the case when master as well
as a primary segment go down at the same time? Let's say we delegate
the responsibility of keeping master/standby state to seg0. And seg0
and master both go down. In that case, the cluster is not accessible.
Can this be seen similar to a double fault situation where primary and
its mirror both go down?

Daniel Gustafsson

unread,

Nov 1, 2018, 3:20:24 PM11/1/18

to Gang Xiong, gpdb...@greenplum.org

> On 25 Oct 2018, at 11:28, Gang Xiong <gxi...@pivotal.io> wrote:

> Conclusion:
> Like FTS to the segments, we need a third party to arbitrate between master and standby master and avoid split-brain when their network goes down.

For the record for those of us who haven’t been part of doing this research.
Which existing solutions for PostgreSQL HA (like Patroni etc) have been
evaluated and what are the drawbacks of them which validates inventing our own
rather than extending/using something existing?

cheers ./daniel

Gang Xiong

unread,

Nov 7, 2018, 5:08:06 AM11/7/18

to gpdb...@greenplum.org

After discussing with a lot of people, we prefer not introducing a 3rd-party

software like etcd. Instead, we are planning to use segments to do the election

between master and standby, with a raft-like leader election algorithm.

Here is the detail:

1. Introduce several segments that both the master and standby agree with as

the followers. For example, 3 primary segments have the smallest dbid on

distinct hosts.

2. Master and standby as the candidates.

3. The candidates send election request or heartbeat to the followers. The

followers vote to election request and respond ACK to heartbeat message.

4. Both the followers and the candidates have 2 states: E(election) and

H(heartbeating)

* A follower in state E, votes to the requester when got an election request,

goes to state H after voting.

* A follower in state H, starts a timer TimerA, sends ACK and reset TimerA

when receiving heartbeat; goes to state E when TimerA timeout.

* A candidate in state E, it sends an election request to all the followers and

starts a timer TimerB. If it doesn't get majority votes (more than half)

before TimerB timeout, it will sleep a random time and start a new election.

Or else, it goes to state H.

* A candidate in state H, periodically send heartbeat to all the followers,

reset TimerB when sending the heartbeat, go to state E if it doesn't get the

majority votes when TimerB timeout.

5. The leader.

A candidate becomes the leader when it goes to state H. If the candidate is in

standby mode, it starts a promotion after becomes the leader.

6. Timers.

Suppose heartbeat cycle is T.

TimerA is 2*T. It defines the period that the followers won't vote for

different candidates. For example, a candidate vote at t0, it won't respond to

a new election request until (t0+2*T)

TimerB is t. It defines the time limit between request and response, which is

bigger than RTT(round trip time). We should have t < T.

This setting guarantees the leader goes back to state E before the follower

goes to E from H, which avoid split-brain.

For example, a candidate starts election at t0 and becomes the leader before

(t0+t), but the heartbeat doesn't go to the majority followers, so the leader

goes to state E before (t0+t+t). The majority follower vote and go to state H

after t0, they won't accept new election request before (t0+2*T).

Another example, the leader sends a heartbeat to the majority followers at t0,

it gets majority acks and sends the next heartbeat at (t0+T), but this

heartbeat doesn't go to the majority followers, so the leader goes to state E

at (t0+T+t). The majority followers get the first heartbeat after t0 and reset

TimerB, so they won't go to state E before (t0+2*T).

7. The leader is out of sync with the peer.

The leader synchronously replicate WAL to its peer, when the peer is

disconnected, the backend processes will be blocked until the out-of-sync

information is written to the majority followers.

The out-of-sync information is sent to the followers together with a heartbeat,

the followers will persist it to the disk and send it to the peer as the

response of election request. The peer stops sending any election request after

received at least 1 out-of-sync information.

8. Add followers.

The algorithm can tolerate several followers down. For example, the cluster started

with 5 followers and 2 of them go down later, the cluster can still work.

The leader can add followers when it's connecting to the peer. The steps are:

a)the leader initializes a follower in state H and send heartbeat to it.

b)the leader tells the information the peer. The peer sends ACK and starts

including this follower to its election.

c)the leader gets ack and include the follower to its election.

The leader can add at most 1 follower at a time. So there's at most 1 gap

between the leader and the peer, which won't affect the result of the majority.

Thanks.

Paul Guo

unread,

Nov 11, 2018, 7:48:28 PM11/11/18

to gxi...@pivotal.io, gpdb...@greenplum.org

I read the goal in the first email again and thought a bit more on this. Currently just one standby master has the possibility to be promoted as master, so if you allow just 1 node down at the "same" time in a group of 3 nodes, it appears that the algorithm could be simpler without leader election.

For example, you configure a segment mirror as the leader for the decision of switch of master and standby master. Let's assume just 1 node of { the segment mirror, master, master mirror } is database level down. Given the segment mirror routinely monitors the status of master & standby mater.

1) If any of master & standby master is down (either due to real down, or just network partition between master & standby master), query is stalled. Any of the alive nodes will ask segment mirror about the status to decide auto failover or not. If there is network partition between master & standby master only, we can not failover before some steps (maybe intervention from operations, or maybe auto-populate a new standby master node and add it) - this is an issue for both this algorithm and previous leader election algorithm.

2) If the segment mirror is down, gpdb works fine, but we could probably set a segment mirror list and configure it in gpdb cluster in advance, master & standby master could update to use a new leader node using a simple UDF - that switch will be fast.

If any of master & standby master and the segment mirror is down at the same time, that will be a rare case given we could fast switch to a new leader (see 2 above)

so it seems that the algorithm could be a good balance of eng effort, user expectation and user experience.

I said "segment mirror" since segment mirror has less load than segment. In real production environemnt it could be configured on a proper node (e.g. if the cluster is run on a vm or in a container then the instance should be on another physical machine than that has master & standby master).

If we have more master/standby master thus we need to select which one should be promoted or we really want to endure more than 1 down nodes then we need a proven consensus algorithm.

Gang Xiong <gxi...@pivotal.io> 于2018年11月7日周三下午6:08写道：

--

Stanley Sung

unread,

Nov 12, 2018, 9:52:22 AM11/12/18

to Greenplum Developers, gxi...@pivotal.io

Have you considered keepalived?

https://docs.google.com/document/d/1KNmwXQxipcz0bvUvLfKV1us5MBhM94U_zQSGml6qY64/edit

To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-dev+unsubscribe@greenplum.org.

Asim R P

unread,

Nov 12, 2018, 12:07:32 PM11/12/18

to Paul Guo, xiong gang, gpdb...@greenplum.org

On Sun, Nov 11, 2018 at 4:48 PM Paul Guo <pau...@gmail.com> wrote:
>
> I read the goal in the first email again and thought a bit more on this. Currently just one standby master has the possibility to be promoted as master, so if you allow just 1 node down at the "same" time in a group of 3 nodes, it appears that the algorithm could be simpler without leader election.

+1 to simpler algorithms.

>
> For example, you configure a segment mirror as the leader for the decision of switch of master and standby master. Let's assume just 1 node of { the segment mirror, master, master mirror } is database level down. Given the segment mirror routinely monitors the status of master & standby mater.
>
> 1) If any of master & standby master is down (either due to real down, or just network partition between master & standby master), query is stalled. Any of the alive nodes will ask segment mirror about the status to decide auto failover or not. If there is network partition between master & standby master only, we can not failover before some steps (maybe intervention from operations, or maybe auto-populate a new standby master node and add it) - this is an issue for both this algorithm and previous leader election algorithm.
>

Why cannot master-standby pair be treated similar to a primary-mirror
pair in this case? FTS considers a mirror as down if the primary
thinks it's down. FTS doesn't bother connecting to the mirror
directly. Similarly, the segment designated to keep master-standby
status may consider the standby as down if master cannot connect to
it.

> 2) If the segment mirror is down, gpdb works fine, but we could probably set a segment mirror list and configure it in gpdb cluster in advance, master & standby master could update to use a new leader node using a simple UDF - that switch will be fast.
>
> If any of master & standby master and the segment mirror is down at the same time, that will be a rare case given we could fast switch to a new leader (see 2 above)
> so it seems that the algorithm could be a good balance of eng effort, user expectation and user experience.

This rare case of master and the designated segment going down at the
same time should be considered as a second step. Monitoring
master-standby pair from a segment is itself a great improvement from
what we have today.

>
> I said "segment mirror" since segment mirror has less load than segment. In real production environemnt it could be configured on a proper node (e.g. if the cluster is run on a vm or in a container then the instance should be on another physical machine than that has master & standby master).
>

Using a primary segment seems better because the state kept by a
primary will be replicated by WAL replication to the mirror, as long
as the state changes are logged in WAL.

Asim

Asim R P

unread,

Nov 12, 2018, 12:14:29 PM11/12/18

to xiong gang, gpdb...@greenplum.org

On Wed, Nov 7, 2018 at 2:08 AM Gang Xiong <gxi...@pivotal.io> wrote:
>
> After discussing with a lot of people, we prefer not introducing a 3rd-party
> software like etcd.

Is it because of the implications on initializing cluster, starting
stopping it, packaging the 3rd-party software, etc? Given that a
third-party master-standby monitoring solution for PostgreSQL is
already deployed, having the capability to integrate with it seems
valuable feature addition for Greenplum. That is in addition to a
natively implemented monitoring solution.

Asim

Michael Schubert

unread,

Nov 12, 2018, 12:17:24 PM11/12/18

to Stanley Sung, gpdb...@greenplum.org, Gang Xiong

On Mon, Nov 12, 2018 at 9:52 AM Stanley Sung <ys...@pivotal.io> wrote:

Have you considered keepalived?
https://docs.google.com/document/d/1KNmwXQxipcz0bvUvLfKV1us5MBhM94U_zQSGml6qY64/edit

That's GPLv2. At the very least we'd prefer things that are Apache 2.0 licensed (same as GPDB) or PostgreSQL/MIT/BSD licensed. While it'd be a standalone binary it's just one more thing to have to deal with and if there are reasonable alternatives with better licenses, that'd be preferred.

To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-dev+u...@greenplum.org.

Gang Xiong

unread,

Nov 13, 2018, 11:40:35 PM11/13/18

to Paul Guo, Asim Praveen, gpdb...@greenplum.org

Hi Paul and Asim,

I explored a little bit more on your idea of using one single segment (for example, segment 0) as the arbiter, which is similar to use FTS to arbitrate primary/mirror segments. The solution is practicable and much easier, except it can not tolerant any 2 out of the 3 nodes (arbiter segment, master and standby) fail at the same time.

TL;DR

The process:

1. arbiter segment regularly sends heartbeat to master and standby. when it detects master is down, it promotes the standby.

2. when master finds standby is not responsible, it stops all the transactions, then it write the standby out-of-sync status to arbiter segment and its mirror. If the status is written successfully, the master continues the transactions. If arbiter says 'the standby has already taken over', master shut down.

Failure mode:

1. The master is down.

Arbiter segment detects it and promotes standby.

2. Arbiter segment detects the master down, but the master is actually alive.

Arbiter segment detects it and promotes standby. After standby is promoted, the master lost the connection to the walreceiver, so it talks to the arbiter and finds the standby has taken over.

3. The standby is down

Master lost connection to the walreceiver, so it writes out-of-sync to the arbiter and resume transactions.

4. The arbiter segment is down

Master FTS detects it and negotiates with standby to assign the arbiter role to a new segment.

5. Master and standby down at the same time

We don't have any solution to handle this kind of double failure.

6. Master and arbiter segment down at the same time

The master fails first and arbiter fails before promoting the standby, or the arbiter fails first and the master fails before starting a new arbiter. In both cases, the system ends up with now valid master.

6. Standby and arbiter segment down at the same time

The standby fails first and the arbiter fails before the master write out-of-sync information, or the arbiter fails first and the standby fails before a new arbiter is assigned. In both cases, the master can't continue its work. It requires manual intervention.

Thanks.

Xin Zhang

unread,

Nov 14, 2018, 10:33:06 AM11/14/18

to xiong gang, Paul Guo, Asim Praveen, gpdb...@greenplum.org

Small typo:

6. Master and arbiter segment down at the same time

The master fails first and arbiter fails before promoting the standby, or the arbiter fails first and the master fails before starting a new arbiter. In both cases, the system ends up with NO valid master.

Maybe we can limit the cases to just single point of failure. To make it tolerant high number of failures, we need to extend the cluster HA model further, and might beyond the discussion of just master-failover.

Thanks,
Shin

--

Gang Xiong

unread,

Nov 15, 2018, 10:51:32 PM11/15/18

to Xin Zhang, Daniel Gustafsson, Paul Guo, Asim Praveen, gpdb...@greenplum.org

Thanks all for your inputs.

Some thinking about the solutions:

1. Daniel mentioned Postgresql solution like Patroni. I tried Patroni and I think it could be a good option for the users of Greenplum too. It requires the deployment of Patroni as well as etcd/consul/zookeeper, and I think we are more likely to figure out a built-in solution as the supplement of FTS to master/standby.

2. Built around a 3rd-party software like etcd. The good side is we can align the failover of primary/mirror segment and master/standby. But if we only want to handle master/standby, it seems not worth introducing a 3rd-party software.

3. The raft-like solution and the arbiter segment solution are basically the same. The different is one solo arbiter segment has less ability of fault-tolerant, but it's much simpler.

I am lean to the arbiter segment solution.

Thanks.

Asim R P

unread,

Nov 16, 2018, 3:52:54 PM11/16/18

to xiong gang, Paul Guo, gpdb...@greenplum.org

On Tue, Nov 13, 2018 at 8:40 PM Gang Xiong <gxi...@pivotal.io> wrote:
>

> I explored a little bit more on your idea of using one single segment (for example, segment 0) as the arbiter, which is similar to use FTS to arbitrate primary/mirror segments. The solution is practicable and much easier, except it can not tolerant any 2 out of the 3 nodes (arbiter segment, master and standby) fail at the same time.
>

Progressing in simple steps sounds good, +1. As it stands today,
Greenplum is effectively not tolerant to master node failure. With
the proposed simple step, master node failure will be tolerated.
That's a significant improvement.

> TL;DR
> The process:

Thank you for jotting down all the scenarios.

> 2. when master finds standby is not responsible, it stops all the transactions, then it write the standby out-of-sync status to arbiter segment and its mirror. If the status is written successfully, the master continues the transactions. If arbiter says 'the standby has already taken over', master shut down.

Alternatively, master can wait until the next heartbeat from the
arbiter and in response, indicates that the standby is down. That
eliminates the need for master/standby to know the arbiter. FTS
follows this model too. When a mirror goes down, primary indicates
'mirror down' in response to the next probe from FTS daemon.

>
> Failure mode:
> 1. The master is down.
> Arbiter segment detects it and promotes standby.
>
> 2. Arbiter segment detects the master down, but the master is actually alive.
> Arbiter segment detects it and promotes standby. After standby is promoted, the master lost the connection to the walreceiver, so it talks to the arbiter and finds the standby has taken over.
>
> 3. The standby is down
> Master lost connection to the walreceiver, so it writes out-of-sync to the arbiter and resume transactions.
>
> 4. The arbiter segment is down
> Master FTS detects it and negotiates with standby to assign the arbiter role to a new segment.
>

If the sync status is recorded in a table, it will be replicated to
the mirror of the arbiter segment. As soon as the mirror is promoted,
it can start acting as the new arbiter.

>
> 6. Standby and arbiter segment down at the same time
> The standby fails first and the arbiter fails before the master write out-of-sync information, or the arbiter fails first and the standby fails before a new arbiter is assigned. In both cases, the master can't continue its work. It requires manual intervention.
>

Yes, manual intervention is needed in this case. FTS on master cannot
make any change to gp_segment_configuration and cannot promote the
arbiter's mirror.

Asim

Ashwin Agrawal

unread,

Nov 19, 2018, 2:06:04 AM11/19/18

to Gang Xiong, Paul Guo, Asim Praveen, gpdb...@greenplum.org

On Nov 13, 2018, at 8:40 PM, Gang Xiong <gxi...@pivotal.io> wrote:

Hi Paul and Asim,

I explored a little bit more on your idea of using one single segment (for example, segment 0) as the arbiter, which is similar to use FTS to arbitrate primary/mirror segments. The solution is practicable and much easier, except it can not tolerant any 2 out of the 3 nodes (arbiter segment, master and standby) fail at the same time.

TL;DR
The process:
1. arbiter segment regularly sends heartbeat to master and standby. when it detects master is down, it promotes the standby.

If arbiter can’t reach master, how can we say master is down? It can be just connection between this segment and master is broken but rest all segments can reach master. Will unnecessarily promote standby and cause current transactions to abort.

2. when master finds standby is not responsible, it stops all the transactions, then it write the standby out-of-sync status to arbiter segment and its mirror. If the status is written successfully, the master continues the transactions. If arbiter says 'the standby has already taken over', master shut down.

Failure mode:
1. The master is down.
Arbiter segment detects it and promotes standby.

2. Arbiter segment detects the master down, but the master is actually alive.
Arbiter segment detects it and promotes standby. After standby is promoted, the master lost the connection to the walreceiver, so it talks to the arbiter and finds the standby has taken over.

3. The standby is down
Master lost connection to the walreceiver, so it writes out-of-sync to the arbiter and resume transactions.

4. The arbiter segment is down
Master FTS detects it and negotiates with standby to assign the arbiter role to a new segment.

How does it stop the previous arbiter segment from not continuing to run in system in this case? As today primary can be marked as down in configuration but still keep running.

Also, how is manual promotion allowed or disallowed with this scheme. Its very important one to protect which doesn’t exist today. Are we planning to completely remove gpactivatestandby? Seems we can’t, as should still have manual option to failover to standby.

Jialun Du

unread,

Nov 19, 2018, 2:49:29 AM11/19/18

to aagr...@pivotal.io, gxi...@pivotal.io, pg...@pivotal.io, apra...@pivotal.io, gpdb...@greenplum.org

If arbiter can’t reach master, how can we say master is down? It can be just connection between this segment and master is broken but rest all segments can reach master. Will unnecessarily promote standby and cause current transactions to abort.

Per failure mode 1&2, we will promote standby in case of master down or just arbiter segment detects the master down but master is actually alive. We can't distinguish master is really down or just connection failure between arbiter segment and master by this simple solution. Maybe we will consider some corner cases like arbiter detects master down but walrep is still working(master is alive), which we should not promote standby.

How does it stop the previous arbiter segment from not continuing to run in system in this case? As today primary can be marked as down in configuration but still keep running.

Master and standby will keep the dbid of arbiter, after negotiating, they will keep new dbid and refuse the arbitrating connection of old arbiter.

Also, how is manual promotion allowed or disallowed with this scheme. Its very important one to protect which doesn’t exist today. Are we planning to completely remove gpactivatestandby? Seems we can’t, as should still have manual option to failover to standby.

We won't completely remove gpactivatestandby. It is necessary in some extreme cases. For example, if the master node is broken, all the datas are lost and can not be recovered. Activate standby is the only choice though maybe some walrep logs are missing.

Ashwin Agrawal

unread,

Nov 19, 2018, 3:29:18 AM11/19/18

to Jialun Du, gxi...@pivotal.io, pg...@pivotal.io, apra...@pivotal.io, gpdb...@greenplum.org

On Nov 18, 2018, at 11:49 PM, Jialun Du <j...@pivotal.io> wrote:

If arbiter can’t reach master, how can we say master is down? It can be just connection between this segment and master is broken but rest all segments can reach master. Will unnecessarily promote standby and cause current transactions to abort.
Per failure mode 1&2, we will promote standby in case of master down or just arbiter segment detects the master down but master is actually alive. We can't distinguish master is really down or just connection failure between arbiter segment and master by this simple solution. Maybe we will consider some corner cases like arbiter detects master down but walrep is still working(master is alive), which we should not promote standby.

We need to very carefully thinking through sequence of steps and bridge all the holes / race conditions here. Master maybe marking this arbiter segment as down in configuration, where as same time this arbiter might be promoting the standby.

How does it stop the previous arbiter segment from not continuing to run in system in this case? As today primary can be marked as down in configuration but still keep running.
Master and standby will keep the dbid of arbiter, after negotiating, they will keep new dbid and refuse the arbitrating connection of old arbiter.

Also, how is manual promotion allowed or disallowed with this scheme. Its very important one to protect which doesn’t exist today. Are we planning to completely remove gpactivatestandby? Seems we can’t, as should still have manual option to failover to standby.
We won't completely remove gpactivatestandby. It is necessary in some extreme cases. For example, if the master node is broken, all the datas are lost and can not be recovered. Activate standby is the only choice though maybe some walrep logs are missing.

We need to keep gpactivatestandby is agreed. Though it can’t be in its current form. Protection needs to be added to it to fail promotion if master and standby are not in sync (for extreme support reasons how to by pass this check is separate topic) So, need to leverage the logic we come up with for auto failover to also enable implementing this protection in gpactivatestandby.

Also, how are transactions forbidden via master even after standby is promoted is not explicitly called out? What happens to currently running transactions initiated by master?

Jialun Du

unread,

Nov 19, 2018, 3:50:07 AM11/19/18

to aagr...@pivotal.io, gxi...@pivotal.io, pg...@pivotal.io, apra...@pivotal.io, gpdb...@greenplum.org

Also, how are transactions forbidden via master even after standby is promoted is not explicitly called out? What happens to currently running transactions initiated by master?

If standby is promoted, the master will lost the connection to walreceiver. Then it will pause all the running transactions and talk to the arbiter. Master will abort the running transactions after the it found the standby has been promoted.

Bhuvnesh Chaudhary

unread,

Nov 28, 2018, 4:57:38 PM11/28/18

to j...@pivotal.io, Ashwin Agrawal, gxi...@pivotal.io, Paul Guo, Asim Praveen, gpdb...@greenplum.org

Segments can be a potential bottleneck for arbitration.

For instance, when the segments are highly loaded (high IO/CPU), there can be significant delay which could potentially make the arbiter segment un-responsive.

Often, haven't we seen that happen between primary and mirrors and the mirror/primary being marked down?

Should this be a separate special segment whose job is only to arbiter the master - standby relationship?

Thanks,
Bhuvnesh Chaudhary
Mobile: +1-973.906.6976

jpa...@pivotal.io

unread,

Nov 28, 2018, 5:19:59 PM11/28/18

to Greenplum Developers

I like how this proposal is broken out into 6 steps. We could say that the auto-failover feature for master/standby can be broken out into the following 3 high level steps:

1. Probe for state

2. Save state to system of record

3.Taken action when state changes

Ask: Could we agree on an API for the 3 steps above? This would make the implementations swappable. For e.g on Kubernetes etcd is highly available and provided for free so we would like to use it as the system of record for step #2.

Ask: Furthermore the existing code in FTS is a great candidate for reuse in #1. Would it be possible to isolate just the probing part of FTS for reuse?

Thanks,

Larry & Jemish

Robert Eckhardt

unread,

Nov 28, 2018, 5:33:04 PM11/28/18

to Jemish Patel, gpdb...@greenplum.org

On Wed, Nov 28, 2018 at 4:20 PM <jpa...@pivotal.io> wrote:
>
>
> I like how this proposal is broken out into 6 steps. We could say that the auto-failover feature for master/standby can be broken out into the following 3 high level steps:
>
> 1. Probe for state
> 2. Save state to system of record
> 3.Taken action when state changes
>
> Ask: Could we agree on an API for the 3 steps above? This would make the implementations swappable. For e.g on Kubernetes etcd is highly available and provided for free so we would like to use it as the system of record for step #2.

+1 I think this is a good idea.

I actually think this is fairly well understood as it stands and we
just need to clearly document the state as is.

>
> Ask: Furthermore the existing code in FTS is a great candidate for reuse in #1. Would it be possible to isolate just the probing part of FTS for reuse?

One of the suggestions early on was to extract FTS from the DB and
have it running outside the system and be used for this purpose. IIRC
there was a reason not to go with that solution.

>
> Thanks,
> Larry & Jemish
>
>

Reply all

Reply to author

Forward