postgres (slave) stopped doing wal-fetch

27 views
Skip to first unread message

Ben Hitz

unread,
Oct 26, 2016, 3:29:23 PM10/26/16
to wal-e
So this has happened multiple times in the last 48 hours.   I suspect some sort of EC/AWS network errors, but I really would like to be sure.

I bring up a machine that's set to recovery - it has a recovery.conf like this:

# recovery.conf must be linked into data dir to do anything

recovery_target_timeline = 'latest'

restore_command = '/opt/wal-e/bin/wal-e --aws-instance-profile --s3-prefix="$(cat /etc/postgresql/9.3/main/wale_s3_prefix)" wal-fetch "%f" "%p"'

standby_mode = on


So, we when we bring one up - we do a backup push and then build a new machine which connnects and starts restoring it's database.
First it does the backup then starts reading the WALs off of S3.

But for some reason it just ... stopped.  For > 2 hrs.  With no notation in log or anything.
I confimed that the MASTER was indeed correctly writing WAL via wal-push and they were indeed getting up to S3.

This system has worked as is reasonably well for a few years (PG 9.3)... it's possibly that our rate of WAL-push has increased during prime hours significantly.

After poking around, I just restarted postgres on the slave and it kicked right back up again.   However, I am somewhat worried that it will stop.

Any suggestions for increasing logging or debugging this in either wal-e (or postgres?)

Thanks,
Ben
(oh you can view a snippet of the PG log here: http://dpaste.com/09J7ZP2 at least until 10/3 or so)


Daniel Farina

unread,
Oct 26, 2016, 8:27:58 PM10/26/16
to cisc...@gmail.com, wal-e
On Wed, Oct 26, 2016 at 12:29 PM Ben Hitz <cisc...@gmail.com> wrote:
So this has happened multiple times in the last 48 hours.   I suspect some sort of EC/AWS network errors, but I really would like to be sure.

Can you check to see if there is a stuck wal-e process sitting around doing nothing, blocking Postgres from getting more WAL? I haven't seen a non-termination issue in quite some time.

If there is such a stuck process, see if killing it allows replication to resume.

It would be ideal if you can get into that process with `gdb` after having installed some debug symbols as to see the stack trace.

Another common thing that causes elusive errors with WAL-E is out-of-disk space, or indeed any problem that prevents it from writing in a temporary directory like /tmp.

Ben Hitz

unread,
Oct 27, 2016, 11:03:22 AM10/27/16
to wal-e, cisc...@gmail.com
Nope, disks are all fine.
I didn't find a wal-e process, but I am not sure what to look for.

I did see this;

1 S postgres  51951  51950  0  80   0 - 61672 -      00:03 ?        00:00:04 postgres: startup process   recovering 0000001B000007D80000006D waiting 




The master has pushed

0000001B000007D8000000CF


So it's 100s of WAL behind.  It's been idle again for about 12 hrs.

Ben

Ben Hitz

unread,
Oct 27, 2016, 11:04:01 AM10/27/16
to wal-e
Could it possibly be a wal-e version mismatch?  The new slave was created after 10/24, it probably has the lastest version while the master is a few weeks old.

Ben

Daniel Farina

unread,
Oct 27, 2016, 1:22:36 PM10/27/16
to cisc...@gmail.com, wal-e
On Thu, Oct 27, 2016 at 8:16 AM Ben Hitz <cisc...@gmail.com> wrote:
Could it possibly be a wal-e version mismatch?  The new slave was created after 10/24, it probably has the lastest version while the master is a few weeks old.

Unlikely, the format hasn't changed in ages, and can't really since some people have old and/or lots of archives.

WAL-E emits messages on start-up yet you say you cannot see anything. Try invoking wal-e wal-fetch manually in the same way as restore command.

When doing this, as to ensure your environment is more similar to what the Postgres processes have, I suggest using "sudo -i -u postgres" to get an initial log-in shell as the Postgres user.

Benjamin C. Hitz

unread,
Oct 27, 2016, 2:10:33 PM10/27/16
to Daniel Farina, wal-e
It fetches just fine manually.

postgres@ip-1*****:~$ /opt/wal-e/bin/wal-e --aws-instance-profile --s3-prefix="$(cat /etc/postgresql/9.3/main/wale_s3_prefix)" wal-fetch 0000001B000007D700000093 /var/lib/postgresql/9.3/main/pg_xlog/0000001B000007D700000093
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2016-10-27T18:08:55.062847-00 pid=13685
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2016-10-27T18:08:55.091696-00 pid=13685 action=wal-fetch key=s3://encoded-backups-prod/production/wal_005/0000001B000007D700000093.lzo prefix=production/ seg=0000001B000007D700000093

But it doesn’t actually restore it...

I have another machine I brought up a day or so after the first one, it seems to be doing was-fetch (last 1 10 minutes ago), but the WAL it fetched was pushed up to S3 18 hours ago.
Someone in a sister project suggested that maybe we have too many WALs in our backup bucket?

Ben
--
Benjamin C. Hitz
ENCODE DCC * Dept. of Genetics
hi...@stanford.edu



Benjamin C. Hitz

unread,
Oct 27, 2016, 6:24:58 PM10/27/16
to Benjamin C. Hitz, Daniel Farina, wal-e
Further work on this:
I bounced our “candidate” (aka recovery) server again (it was , and it immediately downloaded ALL oustanding WALs from S3 and began applying them. Then, (of course) the master wrote a few more. But these were not applied. They don’t make it into pg_xlog/archive_status

They WERE downloaded (they are in pg_xlog) — presumably by WAL-E, but NO note of them in log file!

Here is a snippet of log for timing:
2016-10-27 01:41:22 UTC LOG: restored log file "0000001B000007D80000006A" from archive
2016-10-27 01:42:22 UTC LOG: restored log file "0000001B000007D80000006B" from archive
2016-10-27 01:43:22 UTC LOG: restored log file "0000001B000007D80000006C" from archive
2016-10-27 01:44:22 UTC LOG: restored log file "0000001B000007D80000006D" from archive
(bounced postgres at 22:06:19)
2016-10-27 22:06:19 UTC LOG: restored log file "0000001B.history" from archive
2016-10-27 22:06:21 UTC LOG: restored log file "0000001B000007D80000006B" from archive
2016-10-27 22:06:21 UTC LOG: restored log file "0000001B000007D80000006C" from archive
2016-10-27 22:06:21 UTC LOG: restored log file "0000001B000007D80000006D" from archive
2016-10-27 22:06:21 UTC LOG: restored log file "0000001B000007D80000006E" from archive
2016-10-27 22:06:21 UTC LOG: restored log file "0000001B000007D80000006F" from archive

It’s like the history is not getting written correctly?

Ben
> --
> You received this message because you are subscribed to a topic in the Google Groups "wal-e" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/wal-e/O-APZYJ-3jw/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to wal-e+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages