Recover to latest and promote

Stuart Bishop

unread,

Mar 10, 2017, 12:48:36 PM3/10/17

to wa...@googlegroups.com

Hi.

Is there a way to recover to the latest available target_recovery_time
or target_recovery_xid, and automatically promote? The best I seem to
be able to do is watch the logs and manually promote once wal fetching
starts failing and I've run out of files to replay.

I believe that to do automatic promotion after recovery I need to
specify a target_recovery_time or target_recovery_xid and I can't see
any way to determine that. Unless perhaps I download the wal files,
navigating timeline switches, and analyze them.

--
Stuart Bishop <stu...@stuartbishop.net>
http://www.stuartbishop.net/

Daniel Farina

unread,

Mar 10, 2017, 2:50:05 PM3/10/17

to stu...@stuartbishop.net, wa...@googlegroups.com

On Fri, Mar 10, 2017 at 9:48 AM Stuart Bishop <stu...@stuartbishop.net> wrote:

Hi.

Is there a way to recover to the latest available target_recovery_time
or target_recovery_xid, and automatically promote? The best I seem to
be able to do is watch the logs and manually promote once wal fetching
starts failing and I've run out of files to replay.

I believe that to do automatic promotion after recovery I need to
specify a target_recovery_time or target_recovery_xid and I can't see
any way to determine that. Unless perhaps I download the wal files,
navigating timeline switches, and analyze them.

You can turn standby_mode off, and then the first WAL-E download failure will cause a promotion.

I am moderately cautious on this: if WAL-E, or any wrapping program exit with an unexpected status code, the system will leave recovery and start up.

Postgres treats all exit codes greater than 125 as such a case:

[....]

* However, if the failure was due to any sort of signal, it's best to

* punt and abort recovery. (If we "return false" here, upper levels will

* assume that recovery is complete and start up the database!) It's

* essential to abort on child SIGINT and SIGQUIT, because per spec

* system() ignores SIGINT and SIGQUIT while waiting; if we see one of

* those it's a good bet we should have gotten it too.

*

* On SIGTERM, assume we have received a fast shutdown request, and exit

* cleanly. It's pure chance whether we receive the SIGTERM first, or the

* child process. If we receive it first, the signal handler will call

* proc_exit, otherwise we do it here. If we or the child process received

* SIGTERM for any other reason than a fast shutdown request, postmaster

* will perform an immediate shutdown when it sees us exiting

* unexpectedly.

*

* Per the Single Unix Spec, shells report exit status > 128 when a called

* command died on a signal. Also, 126 and 127 are used to report

* problems such as an unfindable command; treat those as fatal errors

* too.

*/

if (WIFSIGNALED(rc) && WTERMSIG(rc) == SIGTERM)

proc_exit(1);

signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

I have modestly tried to make WAL-E safe to this purpose, but it has never quite sat right with me to trust this mechanism in Postgres to promote a database. Many programs, e.g. wrappers like envdir or whathaveyou, are not guaranteed to emit status codes > 125 in all non-archive-recovery-failure cases.

That said, it will work nearly 100% of the time.

Stuart Bishop

unread,

Mar 13, 2017, 2:08:45 AM3/13/17

to Daniel Farina, wa...@googlegroups.com

On 11 March 2017 at 02:49, Daniel Farina <dan...@fdr.io> wrote:
> On Fri, Mar 10, 2017 at 9:48 AM Stuart Bishop <stu...@stuartbishop.net>
> wrote:
>>
>> Hi.
>>
>> Is there a way to recover to the latest available target_recovery_time
>> or target_recovery_xid, and automatically promote? The best I seem to
>> be able to do is watch the logs and manually promote once wal fetching
>> starts failing and I've run out of files to replay.
>>
>> I believe that to do automatic promotion after recovery I need to
>> specify a target_recovery_time or target_recovery_xid and I can't see
>> any way to determine that. Unless perhaps I download the wal files,
>> navigating timeline switches, and analyze them.
>
>
> You can turn standby_mode off, and then the first WAL-E download failure
> will cause a promotion.
>
> I am moderately cautious on this: if WAL-E, or any wrapping program exit
> with an unexpected status code, the system will leave recovery and start up.
>
> Postgres treats all exit codes greater than 125 as such a case:
>
> [....]
>

> I have modestly tried to make WAL-E safe to this purpose, but it has never
> quite sat right with me to trust this mechanism in Postgres to promote a
> database. Many programs, e.g. wrappers like envdir or whathaveyou, are not
> guaranteed to emit status codes > 125 in all non-archive-recovery-failure
> cases.
>
> That said, it will work nearly 100% of the time.

Ah, great. Thanks for all this. Its really helped clarify my design,
and will avoid automatic promotion when I can.

I believe it would also work if I could download all the WAL files
after downloading the backup, ensuring I had them all in place before
restarting. I think I would need to teach WAL-E to navigate timeline
changes to do that reliably (unless I just waste bandwidth and disk
space and download all future timelines, which is probably good enough
for real life use cases)

Daniel Farina

unread,

Mar 13, 2017, 2:27:35 PM3/13/17

to Stuart Bishop, wa...@googlegroups.com

On Sun, Mar 12, 2017 at 11:08 PM Stuart Bishop <stu...@stuartbishop.net> wrote:

I believe it would also work if I could download all the WAL files
after downloading the backup, ensuring I had them all in place before
restarting. I think I would need to teach WAL-E to navigate timeline
changes to do that reliably (unless I just waste bandwidth and disk
space and download all future timelines, which is probably good enough
for real life use cases)

Another thing you might find useful is the default behavior for this promotion is to merely "pause" recovery. Thus, you have an opportunity to check timestamps and the data for suitability. If you want to continue archive retries, you then can call a function to continue function.

See https://www.postgresql.org/docs/9.6/static/functions-admin.html, xlog_replay_resume.

Reply all

Reply to author

Forward