Replication slots in Greenplum

Robert Eckhardt

unread,

Jan 23, 2019, 6:53:56 PM1/23/19

to Greenplum Developers

It was asked in a PR if there was any documentation or working docs
that we were using in order to help explain the Replication slot work.

Replication slots are a feature from Postgres, first in 9.4
https://www.postgresql.org/docs/9.4/catalog-pg-replication-slots.html

"They are a persistent record of the state of a replica that is kept
on the master server even when the replica is offline and
disconnected." meaning that we can ensure that the WAL Logs that are
needed to bring a mirror back up exist on the Primary.
https://blog.2ndquadrant.com/postgresql-9-4-slots/

I apologize for some of the wording below. The verbiage used in
Greenplum is that there are primary/mirror pairs and there is always a
primary. When the Primary fails the mirror is promoted and there is no
longer a mirror but there is still a primary that was the mirror. The
lack of good language for this might cause confusion in my below
comments, please ask clarifying questions.

Fundamentally all we are doing with Replication Slots is plumbing them
into the utilities and the FTS system so that the usage is automated.
This work consists of a few stories.
* whenever there is a mirror added a replication slot is created to
enable incremental recovery if the mirror goes down then comes back up
* when a mirror is promoted then there is a replication slot created
to enable incremental recovery via pg_rewind of old primary
* Finally, relevant to the PR discussion, when a primary fails there
is still a replication slot on that (not sure what to call a dead
primary) When it becomes a mirror that (stale) replication slot needs
to be killed so that it doesn't store logs that aren't being flushed
because it isn't a primary anymore.

There are a few follow up stories around providing warnings about
setting up replications slots w/o using the utilities and making sure
that they can be super user only modified but essentially that is it.

-- Rob

Yandong Yao

unread,

Jan 23, 2019, 10:40:59 PM1/23/19

to Robert Eckhardt, Greenplum Developers

On Thu, Jan 24, 2019 at 7:54 AM Robert Eckhardt <reck...@pivotal.io> wrote:

It was asked in a PR if there was any documentation or working docs
that we were using in order to help explain the Replication slot work.

Replication slots are a feature from Postgres, first in 9.4
https://www.postgresql.org/docs/9.4/catalog-pg-replication-slots.html

"They are a persistent record of the state of a replica that is kept
on the master server even when the replica is offline and
disconnected." meaning that we can ensure that the WAL Logs that are
needed to bring a mirror back up exist on the Primary.

https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.2ndquadrant.com_postgresql-2D9-2D4-2Dslots_&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=_wjPWQXi0BcvFNIPAAsrOw&m=rI14kClqo9D5GRksIfjaWNpX7We69fPfwmAnbH-ZOIE&s=moYs0GoHNVTjLEWxjBZe6oaxbg30F2Yys6rV876e540&e=

Yeah, those public PG docs about replication slot are easy to find and not a problem. The ask is more about its usage in Greenplum. Overall big picture about HA based on walrep/pg_rewind/replication_slot will be super helpful.

I apologize for some of the wording below. The verbiage used in
Greenplum is that there are primary/mirror pairs and there is always a
primary. When the Primary fails the mirror is promoted and there is no
longer a mirror but there is still a primary that was the mirror. The
lack of good language for this might cause confusion in my below
comments, please ask clarifying questions.

Fundamentally all we are doing with Replication Slots is plumbing them
into the utilities and the FTS system so that the usage is automated.
This work consists of a few stories.
* whenever there is a mirror added a replication slot is created to
enable incremental recovery if the mirror goes down then comes back up
* when a mirror is promoted then there is a replication slot created
to enable incremental recovery via pg_rewind of old primary

If primary failed, and mirror is promoted, is there any tiny window, where mirror's xlog is

behind of master's xlog, and after mirror is promoted, some of latest changes are missed?

'pg_rewind of old primary': do you mean when primary is back and becomes the new mirror,

pg_rewind is used to bring old primary (new mirror) to sync state with old mirror (new primary)?

* Finally, relevant to the PR discussion, when a primary fails there
is still a replication slot on that (not sure what to call a dead
primary) When it becomes a mirror that (stale) replication slot needs
to be killed so that it doesn't store logs that aren't being flushed
because it isn't a primary anymore.

This helps a lot to understand corresponding PR. Thanks! So it will be removed

during startup as mirror (old primary). Then after startup, pg_rewind is used

to bring mirror up to date with primary. Is this true? Who will call pg_rewind? FTS or utilities such as gprecoverseg?

There are a few follow up stories around providing warnings about
setting up replications slots w/o using the utilities and making sure
that they can be super user only modified but essentially that is it.

-- Rob

--

Best Regards,

Yandong

Robert Eckhardt

unread,

Jan 23, 2019, 10:50:58 PM1/23/19

to Yandong Yao, Greenplum Developers

No, in order for 2 phase commit to commit a transaction the xlog needs
to be written to the mirror so we know it exists on the mirror.

>
> 'pg_rewind of old primary': do you mean when primary is back and becomes the new mirror,
> pg_rewind is used to bring old primary (new mirror) to sync state with old mirror (new primary)?
>>
>> * Finally, relevant to the PR discussion, when a primary fails there
>> is still a replication slot on that (not sure what to call a dead
>> primary) When it becomes a mirror that (stale) replication slot needs
>> to be killed so that it doesn't store logs that aren't being flushed
>> because it isn't a primary anymore.
>
> This helps a lot to understand corresponding PR. Thanks! So it will be removed
> during startup as mirror (old primary). Then after startup, pg_rewind is used
> to bring mirror up to date with primary. Is this true? Who will call pg_rewind? FTS or utilities such as gprecoverseg?

pg_rewind is called by gprecoverseg. Let me let someone else explain
the exact order of events that you mention. I'm not sure if the
replication slot is created before or after pg_rewind nor how they
interact.

-- Rob

Robert Eckhardt

unread,

Jan 23, 2019, 10:55:53 PM1/23/19

to Yandong Yao, Greenplum Developers

On Wed, Jan 23, 2019 at 10:50 PM Robert Eckhardt <reck...@pivotal.io> wrote:
[Snip]

Sorry I missed 2 questions

> > 'pg_rewind of old primary': do you mean when primary is back and becomes the new mirror,

yes. Exactly

> > pg_rewind is used to bring old primary (new mirror) to sync state with old mirror (new primary)?

Correct.

[Snip]

-- Rob

Yandong Yao

unread,

Jan 24, 2019, 6:06:54 AM1/24/19

to Robert Eckhardt, Greenplum Developers

>
> If primary failed, and mirror is promoted, is there any tiny window, where mirror's xlog is
> behind of master's xlog, and after mirror is promoted, some of latest changes are missed?

No, in order for 2 phase commit to commit a transaction the xlog needs
to be written to the mirror so we know it exists on the mirror.

Primary will commit transaction firstly, then ship xlog to mirror. If primary crashes just after sending xlog to mirror while before receiving ACK. What will happen? Will the mirror commit successfully after mirror is promoted, or mirror abort the transaction?

For this case, client will treat it as aborted anyway.

>
> 'pg_rewind of old primary': do you mean when primary is back and becomes the new mirror,
> pg_rewind is used to bring old primary (new mirror) to sync state with old mirror (new primary)?
>>
>> * Finally, relevant to the PR discussion, when a primary fails there
>> is still a replication slot on that (not sure what to call a dead
>> primary) When it becomes a mirror that (stale) replication slot needs
>> to be killed so that it doesn't store logs that aren't being flushed
>> because it isn't a primary anymore.
>
> This helps a lot to understand corresponding PR. Thanks! So it will be removed
> during startup as mirror (old primary). Then after startup, pg_rewind is used
> to bring mirror up to date with primary. Is this true? Who will call pg_rewind? FTS or utilities such as gprecoverseg?

pg_rewind is called by gprecoverseg. Let me let someone else explain
the exact order of events that you mention. I'm not sure if the
replication slot is created before or after pg_rewind nor how they
interact.

Thanks in advance!

-- Rob

--

Best Regards,

Yandong

Ashwin Agrawal

unread,

Jan 24, 2019, 2:21:45 PM1/24/19

to Yandong Yao, Robert Eckhardt, Greenplum Developers

On Thu, Jan 24, 2019 at 3:06 AM Yandong Yao <yy...@pivotal.io> wrote:

>
> If primary failed, and mirror is promoted, is there any tiny window, where mirror's xlog is
> behind of master's xlog, and after mirror is promoted, some of latest changes are missed?

No, in order for 2 phase commit to commit a transaction the xlog needs
to be written to the mirror so we know it exists on the mirror.

Primary will commit transaction firstly, then ship xlog to mirror. If primary crashes just after sending xlog to mirror while before receiving ACK. What will happen? Will the mirror commit successfully after mirror is promoted, or mirror abort the transaction?

2PC in Greenpum is coordinated by QD. If primary crashes after sending the commit xlog to mirror but before receiving ack from mirror, means primary has not sent ack to QD. Hence, QD is going to retry the commit to the promoted mirror and complete the transaction.

Important point to note primary waits to receive ack from mirror for both the phases of 2PC. First Prepare phase and then for commit.

So, the flow is:

QD -> (prepare) -> primary -> (waits preprare lsn xlog flush) -> mirror

QD <- (ack) <- primary <- (ack) <- mirror

QD (commit)

1 phase completes

QD -> (commit) -> primary -> (waits commit lsn xlog flush) -> mirror

QD <- (ack) <- primary <- (ack) <- mirror

QD (marks done)

2 phase completes

Just wish to point out for benefit of other readers, that this question doesn't relate anything with replication slots, and is discussing general 2PC working with walreplication.

For this case, client will treat it as aborted anyway.

Why client would treat it as aborted, QD is our shield not to expose primary failure to clients for this case. Only if primary crashes before 1 phase completion of 2PC, the transaction is aborted.

>
> 'pg_rewind of old primary': do you mean when primary is back and becomes the new mirror,
> pg_rewind is used to bring old primary (new mirror) to sync state with old mirror (new primary)?
>>
>> * Finally, relevant to the PR discussion, when a primary fails there
>> is still a replication slot on that (not sure what to call a dead
>> primary) When it becomes a mirror that (stale) replication slot needs
>> to be killed so that it doesn't store logs that aren't being flushed
>> because it isn't a primary anymore.
>
> This helps a lot to understand corresponding PR. Thanks! So it will be removed
> during startup as mirror (old primary). Then after startup, pg_rewind is used
> to bring mirror up to date with primary. Is this true? Who will call pg_rewind? FTS or utilities such as gprecoverseg?

gprecoverseg calls pg_rewind, FTS can't perform this action as it has to be manually initiated event. pg_rewind needs to be run first to rollback extra transactions on old primary before it can be converted and connected back as mirror. Hence, that happens as first step, and after same that segment is connected back as mirror.

As part of pg_rewind or pg_basebackup (for full mirror recovery) will copy over primaries replication slot. So, during start of segment as mirror (irrespective of how it was created) deletes the internal gp replication slot as not supposed to continue retaining xlog on mirror.

Yandong Yao

unread,

Jan 29, 2019, 9:49:23 PM1/29/19

to Ashwin Agrawal, Robert Eckhardt, Greenplum Developers

On Fri, Jan 25, 2019 at 3:21 AM Ashwin Agrawal <aagr...@pivotal.io> wrote:

On Thu, Jan 24, 2019 at 3:06 AM Yandong Yao <yy...@pivotal.io> wrote:
>
> If primary failed, and mirror is promoted, is there any tiny window, where mirror's xlog is
> behind of master's xlog, and after mirror is promoted, some of latest changes are missed?

No, in order for 2 phase commit to commit a transaction the xlog needs
to be written to the mirror so we know it exists on the mirror.

Primary will commit transaction firstly, then ship xlog to mirror. If primary crashes just after sending xlog to mirror while before receiving ACK. What will happen? Will the mirror commit successfully after mirror is promoted, or mirror abort the transaction?

2PC in Greenpum is coordinated by QD. If primary crashes after sending the commit xlog to mirror but before receiving ack from mirror, means primary has not sent ack to QD. Hence, QD is going to retry the commit to the promoted mirror and complete the transaction.

Important point to note primary waits to receive ack from mirror for both the phases of 2PC. First Prepare phase and then for commit.

So, the flow is:

QD -> (prepare) -> primary -> (waits preprare lsn xlog flush) -> mirror
QD <- (ack) <- primary <- (ack) <- mirror
QD (commit)
1 phase completes
QD -> (commit) -> primary -> (waits commit lsn xlog flush) -> mirror
QD <- (ack) <- primary <- (ack) <- mirror
QD (marks done)
2 phase completes

Thanks for the detailed information. So for about master and standby for same question?

"master will commit transaction firstly, then ship xlog to standby. If master crashes just after sending xlog to standby while before receiving ACK. What will happen? Will the standby commit successfully after standby is promoted, or standby abort the transaction?"

And what is the behavior for client application?

Just wish to point out for benefit of other readers, that this question doesn't relate anything with replication slots, and is discussing general 2PC working with walreplication.

You are right, maybe we should finish this thread after the last question above.

For this case, client will treat it as aborted anyway.

Why client would treat it as aborted, QD is our shield not to expose primary failure to clients for this case. Only if primary crashes before 1 phase completion of 2PC, the transaction is aborted.

>
> 'pg_rewind of old primary': do you mean when primary is back and becomes the new mirror,
> pg_rewind is used to bring old primary (new mirror) to sync state with old mirror (new primary)?
>>
>> * Finally, relevant to the PR discussion, when a primary fails there
>> is still a replication slot on that (not sure what to call a dead
>> primary) When it becomes a mirror that (stale) replication slot needs
>> to be killed so that it doesn't store logs that aren't being flushed
>> because it isn't a primary anymore.
>
> This helps a lot to understand corresponding PR. Thanks! So it will be removed
> during startup as mirror (old primary). Then after startup, pg_rewind is used
> to bring mirror up to date with primary. Is this true? Who will call pg_rewind? FTS or utilities such as gprecoverseg?

gprecoverseg calls pg_rewind, FTS can't perform this action as it has to be manually initiated event. pg_rewind needs to be run first to rollback extra transactions on old primary before it can be converted and connected back as mirror. Hence, that happens as first step, and after same that segment is connected back as mirror.

As part of pg_rewind or pg_basebackup (for full mirror recovery) will copy over primaries replication slot. So, during start of segment as mirror (irrespective of how it was created) deletes the internal gp replication slot as not supposed to continue retaining xlog on mirror.

Noted with thanks!

--

Best Regards,

Yandong

Ashwin Agrawal

unread,

Jan 30, 2019, 2:08:01 PM1/30/19

to Yandong Yao, Robert Eckhardt, Greenplum Developers

On Tue, Jan 29, 2019 at 6:49 PM Yandong Yao <yy...@pivotal.io> wrote:

On Fri, Jan 25, 2019 at 3:21 AM Ashwin Agrawal <aagr...@pivotal.io> wrote:

On Thu, Jan 24, 2019 at 3:06 AM Yandong Yao <yy...@pivotal.io> wrote:
>
> If primary failed, and mirror is promoted, is there any tiny window, where mirror's xlog is
> behind of master's xlog, and after mirror is promoted, some of latest changes are missed?

No, in order for 2 phase commit to commit a transaction the xlog needs
to be written to the mirror so we know it exists on the mirror.

Primary will commit transaction firstly, then ship xlog to mirror. If primary crashes just after sending xlog to mirror while before receiving ACK. What will happen? Will the mirror commit successfully after mirror is promoted, or mirror abort the transaction?

2PC in Greenpum is coordinated by QD. If primary crashes after sending the commit xlog to mirror but before receiving ack from mirror, means primary has not sent ack to QD. Hence, QD is going to retry the commit to the promoted mirror and complete the transaction.

Important point to note primary waits to receive ack from mirror for both the phases of 2PC. First Prepare phase and then for commit.

So, the flow is:

QD -> (prepare) -> primary -> (waits preprare lsn xlog flush) -> mirror
QD <- (ack) <- primary <- (ack) <- mirror
QD (commit)
1 phase completes
QD -> (commit) -> primary -> (waits commit lsn xlog flush) -> mirror
QD <- (ack) <- primary <- (ack) <- mirror
QD (marks done)
2 phase completes

Thanks for the detailed information. So for about master and standby for same question?

"master will commit transaction firstly, then ship xlog to standby. If master crashes just after sending xlog to standby while before receiving ACK. What will happen? Will the standby commit successfully after standby is promoted, or standby abort the transaction?"

Master crashed after sending xlog but before receiving the ack. Behavior depends on how far the xlog made on standby. If xlog was written and flushed on standby then on promote it will be committed. If xlog never made to the standby, then obviously stands aborted.

Note: since master crashed after commit before receiving ack from standby, means, it didn't yet send commit to segments for 2PC. So, irrespective of abort or commit on standby promotion, it will be perform same action on segments causing no inconsistency.