After upgrading from 0.7.2 to 0.8c1, I've noticed that my typical run of wal-e backup-fetch + wal-fetch in recovery.conf now fails unless I disable prefetching. This ultimately means that PostgreSQL will fail to start after a backup-fetch. The logs from a successful restore / recovery without prefetch and a failed restore / recovery with it are included below.
So far as environment goes, both runs were done from a clean Ubuntu 12.04 VM with an empty PostgreSQL 9.3 cluster directory, and both runs pulled from the same S3 endpoint. I initially suspected that the S3 endpoint was to blame -- it's a local Ceph mirror of what's in AWS -- but the error also occurred when pointing to the original bucket in AWS. I was also able to verify that the WAL file in question (0000000300000042000000F5) does not exist in S3.
So, apart from my own user error, my guess is that the prefetch is writing out something it shouldn't have when trying (and necessarily failing) to fetch the non-existing, N + 1th WAL file. But, that's just a guess. I need to spend a little more time reading through the directories that prefetch creates (and how PostgreSQL interacts with them) to offer any better opinion. My apologies for not doing so before posting this -- especially since fully understanding PostgreSQL's restore_command is Worth the Trouble.
Please let me know if this should be filed as a bug on Github, and also if there are any more details I can provide about the state of this particular WAL-E archive. It's private data, of course, but it is a static archive at this point (i.e. no more WAL files are being written to this particular prefix, nor have they been for 4-5 days), so I should at least be able to provide consistent answers to any questions you might have. This WAL-E restore from a fresh VM is also already automated, so it's easy to re-run this again with extra logging or whatnot, if that helps.
with default (`--prefetch 8`). recovery.conf:
restore_command = '/usr/bin/env AWS_ACCESS_KEY_ID=XXX \
AWS_SECRET_ACCESS_KEY=YYY WALE_S3_ENDPOINT=https+path://Z \
/opt/wal-e/virtualenv/bin/wal-e \
--s3-prefix s3://BUCKET/PREFIX/ wal-fetch "%f" "%p"'
log:
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2014-10-21T23:43:31.730119-00 pid=15828 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000EF.lzo prefix=PREFIX/ seg=0000000300000042000000EF
2014-10-21 23:43:31 UTC LOG: restored log file "0000000300000042000000EF" from archive
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:43:32.319833-00 pid=15854
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2014-10-21T23:43:32.365143-00 pid=15854 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F0.lzo prefix=PREFIX/ seg=0000000300000042000000F0
2014-10-21 23:43:32 UTC LOG: restored log file "0000000300000042000000F0" from archive
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:43:32.942442-00 pid=15872
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2014-10-21T23:43:32.986935-00 pid=15872 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F1.lzo prefix=PREFIX/ seg=0000000300000042000000F1
2014-10-21 23:43:33 UTC LOG: restored log file "0000000300000042000000F1" from archive
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:43:33.507799-00 pid=15893
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2014-10-21T23:43:33.549269-00 pid=15893 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F2.lzo prefix=PREFIX/ seg=0000000300000042000000F2
2014-10-21 23:43:33 UTC LOG: restored log file "0000000300000042000000F2" from archive
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:43:34.103761-00 pid=15907
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2014-10-21T23:43:34.155012-00 pid=15907 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F3.lzo prefix=PREFIX/ seg=0000000300000042000000F3
2014-10-21 23:43:34 UTC LOG: restored log file "0000000300000042000000F3" from archive
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:43:34.677411-00 pid=15935
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2014-10-21T23:43:34.726169-00 pid=15935 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.lzo prefix=PREFIX/ seg=0000000300000042000000F4
2014-10-21 23:43:34 UTC LOG: restored log file "0000000300000042000000F4" from archive
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:43:35.191866-00 pid=15947
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2014-10-21T23:43:35.257994-00 pid=15947 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F5.lzo prefix=PREFIX/ seg=0000000300000042000000F5
2014-10-21 23:43:35 UTC FATAL: archive file "0000000300000042000000F5" has wrong size: 0 instead of 16777216
2014-10-21 23:43:35 UTC LOG: startup process (PID 15621) exited with exit code 1
2014-10-21 23:43:35 UTC LOG: terminating any other active server processes
with `--prefetch 0`. recovery.conf:
restore_command = '/usr/bin/env AWS_ACCESS_KEY_ID=XXX \
AWS_SECRET_ACCESS_KEY=YYY WALE_S3_ENDPOINT=https+path://Z \
/opt/wal-e/virtualenv/bin/wal-e \
--s3-prefix s3://BUCKET/PREFIX/ \
wal-fetch --prefetch 0 "%f" "%p"'
log:
STRUCTURED: time=2014-10-21T23:57:09.337312-00 pid=15932 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F5.l
zo prefix=PREFIX/ seg=0000000300000042000000F5 state=complete
2014-10-21 23:57:09 UTC LOG: WAL file is from different database system: WAL file database system identifier is 6072803227874029436, pg_control database system identifier is 5943550808455905278.
2014-10-21 23:57:09 UTC LOG: redo done at 42/F456F620
2014-10-21 23:57:09 UTC LOG: last completed transaction was at log time 2014-10-16 05:30:02.567065+00
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:57:09.533006-00 pid=15939
wal_e.operator.backup INFO MSG: begin wal restore
STRUCTURED: time=2014-10-21T23:57:09.563344-00 pid=15939 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.l
zo prefix=PREFIX/ seg=0000000300000042000000F4 state=begin
wal_e.blobstore.s3.s3_util INFO MSG: completed download and decompression
DETAIL: Downloaded and decompressed "s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.lzo" to "pg_xlog/RECOVERYXLOG"
STRUCTURED: time=2014-10-21T23:57:10.137669-00 pid=15939
wal_e.operator.backup INFO MSG: complete wal restore
STRUCTURED: time=2014-10-21T23:57:10.138563-00 pid=15939 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.l
zo prefix=PREFIX/ seg=0000000300000042000000F4 state=complete
2014-10-21 23:57:10 UTC LOG: restored log file "0000000300000042000000F4" from archive
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:57:10.353051-00 pid=15959
wal_e.operator.backup INFO MSG: begin wal restore
STRUCTURED: time=2014-10-21T23:57:10.385350-00 pid=15959 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000004.history.lzo prefi
x=PREFIX/ seg=00000004.history state=begin
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING MSG: could no longer locate object while performing wal restore
DETAIL: The absolute URI that could not be located is s3://BUCKET/PREFIX/wal_005/00000004.history.lzo.
HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
STRUCTURED: time=2014-10-21T23:57:10.514732-00 pid=15959
wal_e.operator.backup INFO MSG: complete wal restore
STRUCTURED: time=2014-10-21T23:57:10.515860-00 pid=15959 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000004.history.lzo prefi
x=PREFIX/ seg=00000004.history state=complete
2014-10-21 23:57:10 UTC LOG: selected new timeline ID: 4
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:57:10.728339-00 pid=15966
wal_e.operator.backup INFO MSG: begin wal restore
STRUCTURED: time=2014-10-21T23:57:10.761165-00 pid=15966 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000003.history.lzo prefi
x=PREFIX/ seg=00000003.history state=begin
wal_e.blobstore.s3.s3_util INFO MSG: completed download and decompression
DETAIL: Downloaded and decompressed "s3://BUCKET/PREFIX/wal_005/00000003.history.lzo" to "pg_xlog/RECOVERYHISTORY"
STRUCTURED: time=2014-10-21T23:57:10.988149-00 pid=15966
wal_e.operator.backup INFO MSG: complete wal restore
STRUCTURED: time=2014-10-21T23:57:10.989599-00 pid=15966 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000003.history.lzo prefix=PREFIX/ seg=00000003.history state=complete
2014-10-21 23:57:11 UTC LOG: restored log file "00000003.history" from archive
2014-10-21 23:57:11 UTC LOG: archive recovery complete
2014-10-21 23:57:11 UTC LOG: database system is ready to accept connections