Cannot wal-push

769 views
Skip to first unread message

Paul Tiseo

unread,
Aug 12, 2014, 4:48:38 PM8/12/14
to wa...@googlegroups.com
On Ubuntu 14, I have a strange feeling that something is a little off, after following the instructions on the Github site and the Opbeat blog.

postgres does not seem to want to run the archive command properly.

If I manually run envdir /etc/wal-e.d/env wal-e wal-push /var/lib/postgresql/9.3/main/pg_xlog/00000001000000000000001A, it seems to work.

If I manually run envdir /etc/wal-e.d/env wal-e wal-push pg_xlog/00000001000000000000001A while in/var/lib/postgresql/9.3/main, it seems to work.

But, my archive_command, 'envdir /etc/wal-e.d/env wal-e wal-push %p', seems to error out.

Also, envdir /etc/wal-e.d/env wal-e backup-push /var/lib/postgresql/9.3/main does backup to S3 (I do end up with a file in S3), but gets stuck with "NOTICE:  pg_stop_backup cleanup done, waiting for required WAL segments to be archived" errors. I also end up with a .backup file in pg_xlog.

I'm assuming that whatever is jamming wal-push in the archive_command is hitting an ending step of backup-push? Any suggestions?

PS: Not to confuse the issue, but it may be the same root problem. The command  envdir /etc/wal-e.d/env wal-e backup-list shows nothing but headers, even though I have files in WAL and base directories.


Cody Caughlan

unread,
Aug 12, 2014, 4:51:32 PM8/12/14
to paulx...@gmail.com, wa...@googlegroups.com
Try using absolute paths in all of the commands in your archive_command, e.g.:

archive_command = '/usr/local/bin/envdir /etc/wal-e.d/env /usr/local/bin/wal-e wal-push %'

I've seen similar problems when it comes down to $PATH issues.

--
You received this message because you are subscribed to the Google Groups "wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wal-e+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Paul Tiseo

unread,
Aug 12, 2014, 5:04:49 PM8/12/14
to wa...@googlegroups.com, paulx...@gmail.com
Cody:

As per your suggestion, I tried to manually run envdir /etc/wal-e.d/env /usr/local/bin/wal-e backup-push /var/lib/postgresql/9.3/main and it still failed, waiting for WAL segments to be archived and dangling .backup files, which I assume indicated incomplete execution.

Daniel Farina

unread,
Aug 12, 2014, 6:00:28 PM8/12/14
to paulx...@gmail.com, wal-e
On Tue, Aug 12, 2014 at 1:48 PM, Paul Tiseo <paulx...@gmail.com> wrote:
> On Ubuntu 14, I have a strange feeling that something is a little off, after
> following the instructions on the Github site and the Opbeat blog.
>
> postgres does not seem to want to run the archive command properly.
>
> If I manually run envdir /etc/wal-e.d/env wal-e wal-push
> /var/lib/postgresql/9.3/main/pg_xlog/00000001000000000000001A, it seems to
> work.
>
> If I manually run envdir /etc/wal-e.d/env wal-e wal-push
> pg_xlog/00000001000000000000001A while in/var/lib/postgresql/9.3/main, it
> seems to work.
>
> But, my archive_command, 'envdir /etc/wal-e.d/env wal-e wal-push %p', seems
> to error out.

Do you have any logs from that?

Try syslog or the pg_log directory.

> I'm assuming that whatever is jamming wal-push in the archive_command is hitting an ending step of backup-push?

Yup.

Paul Tiseo

unread,
Aug 12, 2014, 7:16:23 PM8/12/14
to wa...@googlegroups.com, paulx...@gmail.com
I kinda looked there already and didn't see any notable error messages. When I do a manual backup-push, I have to hit a ctrl-C to interrupt it, so syslog seems to only contain non-error messages:

Aug 12 22:44:00 ip-10-123-148-194 wal_e.operator.backup INFO     MSG: start upload postgres version metadata#012        DETAIL: Uploading to s3://production_postgresql/basebackups_005/base_000000010000000000000020_00000040/extended_version.txt.#012        STRUCTURED: time=2014-08-12T22:44:00.429147-00 pid=4964
Aug 12 22:44:00 ip-10-123-148-194 wal_e.operator.backup INFO     MSG: postgres version metadata upload complete#012        STRUCTURED: time=2014-08-12T22:44:00.553103-00 pid=4964
Aug 12 22:44:00 ip-10-123-148-194 wal_e.worker.upload INFO     MSG: beginning volume compression#012        DETAIL: Building volume 0.#012        STRUCTURED: time=2014-08-12T22:44:00.844788-00 pid=4964
Aug 12 22:44:02 ip-10-123-148-194 wal_e.worker.upload INFO     MSG: begin uploading a base backup volume#012        DETAIL: Uploading to "s3://production_postgresql/basebackups_005/base_000000010000000000000020_00000040/tar_partitions/part_00000000.tar.lzo".#012        STRUCTURED: time=2014-08-12T22:44:02.007589-00 pid=4964
Aug 12 22:44:02 ip-10-123-148-194 wal_e.worker.upload INFO     MSG: finish uploading a base backup volume#012        DETAIL: Uploading to "s3://production_postgresql/basebackups_005/base_000000010000000000000020_00000040/tar_partitions/part_00000000.tar.lzo" complete at 15728.8KiB/s. #012        STRUCTURED: time=2014-08-12T22:44:02.459143-00 pid=4964

The log in /var/log/postgresql is likewise uninformative. The error from archive_command is repeated instances of this:

2014-08-12 23:11:00.520 UTC,,,30174,,53ea0337.75de,2679,,2014-08-12 12:06:15 UTC,,0,LOG,00000,"archive command failed with exit code 111","The failed archive command was: envdir /etc/wal-e.d/env wal-e wal-push pg_xlog/000000010000000000000009.00000028.backup",,,,,,,,""
2014-08-12 23:11:00.520 UTC,,,30174,,53ea0337.75de,2680,,2014-08-12 12:06:15 UTC,,0,WARNING,01000,"archiving transaction log file ""000000010000000000000009.00000028.backup"" failed too many times, will try again later",,,,,,,,,""

The above file being complained about is found in an archive_status folder in pg_xlog.

Daniel Farina

unread,
Aug 12, 2014, 7:29:32 PM8/12/14
to Paul Tiseo, wal-e
On Tue, Aug 12, 2014 at 4:16 PM, Paul Tiseo <paulx...@gmail.com> wrote:
> I kinda looked there already and didn't see any notable error messages. When
> I do a manual backup-push, I have to hit a ctrl-C to interrupt it, so syslog
> seems to only contain non-error messages:
>
> Aug 12 22:44:00 ip-10-123-148-194 wal_e.operator.backup INFO MSG: start
> upload postgres version metadata#012 DETAIL: Uploading to
> s3://production_postgresql/basebackups_005/base_000000010000000000000020_00000040/extended_version.txt.#012
> STRUCTURED: time=2014-08-12T22:44:00.429147-00 pid=4964
> Aug 12 22:44:00 ip-10-123-148-194 wal_e.operator.backup INFO MSG:
> postgres version metadata upload complete#012 STRUCTURED:
> time=2014-08-12T22:44:00.553103-00 pid=4964
> Aug 12 22:44:00 ip-10-123-148-194 wal_e.worker.upload INFO MSG:
> beginning volume compression#012 DETAIL: Building volume 0.#012
> STRUCTURED: time=2014-08-12T22:44:00.844788-00 pid=4964
> Aug 12 22:44:02 ip-10-123-148-194 wal_e.worker.upload INFO MSG: begin
> uploading a base backup volume#012 DETAIL: Uploading to
> "s3://production_postgresql/basebackups_005/base_000000010000000000000020_00000040/tar_partitions/part_00000000.tar.lzo".#012
> STRUCTURED: time=2014-08-12T22:44:02.007589-00 pid=4964
> Aug 12 22:44:02 ip-10-123-148-194 wal_e.worker.upload INFO MSG: finish
> uploading a base backup volume#012 DETAIL: Uploading to
> "s3://production_postgresql/basebackups_005/base_000000010000000000000020_00000040/tar_partitions/part_00000000.tar.lzo"
> complete at 15728.8KiB/s. #012 STRUCTURED:
> time=2014-08-12T22:44:02.459143-00 pid=4964

These are for base backups, so not very interesting given your problem.

> The log in /var/log/postgresql is likewise uninformative. The error from
> archive_command is repeated instances of this:
>
> 2014-08-12 23:11:00.520 UTC,,,30174,,53ea0337.75de,2679,,2014-08-12 12:06:15
> UTC,,0,LOG,00000,"archive command failed with exit code 111","The failed
> archive command was: envdir /etc/wal-e.d/env wal-e wal-push
> pg_xlog/000000010000000000000009.00000028.backup",,,,,,,,""
> 2014-08-12 23:11:00.520 UTC,,,30174,,53ea0337.75de,2680,,2014-08-12 12:06:15
> UTC,,0,WARNING,01000,"archiving transaction log file
> ""000000010000000000000009.00000028.backup"" failed too many times, will try
> again later",,,,,,,,,""

Exit code 111: this is probably emitted by the shell attempting to run
WAL-E rather than WAL-E itself.. Like Cody wrote, consider using
absolute paths to wal-e if $PATH isn't quite right as seen from the
respect of the postgres user. Here it shows you are relying on $PATH
being configured for Postgres: envdir /etc/wal-e.d/env **wal-e**
wal-push. (double-star emphasis mine). Consider not doing that.

Also possible is that the wal-e executable bit is not set for the relevant user.

To test the theory that this is a $PATH resolution problem, you may
want to adjust your archive_command like this to get more output:

"type wal-e; [...the rest...]"

or possibly:

"which wal-e; [...the rest...]"

Paul Tiseo

unread,
Aug 12, 2014, 8:55:50 PM8/12/14
to wa...@googlegroups.com, paulx...@gmail.com
Under the postgres user manually, and when added in archive_command, type returns /usr/local/bin/wal-e, which is correct. The log (not the csv) also says "envdir: fatal: unable to run wal-e: file does not exist".

BTW, in the original post, the wal-push commands were run under the postgres user. And, they work that way. It's on the tail end of a manual backup-push or under the archive_command context that I get a fail. Very confused. Is there any way to put wal-e into a more verbose mode?

Attached is my last postgres .log file after the restart for adding the type item.
postgresql-2014-08-13_004013.log

Paul Tiseo

unread,
Aug 12, 2014, 9:01:14 PM8/12/14
to wa...@googlegroups.com, paulx...@gmail.com

Also, I noted it seems to be hung up on "IOError: [Errno 2] No such file or directory: 'pg_xlog/000000010000000000000009.00000028.backup". Please note that here's my current pg_xlog content:

drwx------  3 postgres postgres     4096 Aug 12 22:44 .
drwx------ 16 postgres postgres     4096 Aug 13 00:40 ..
-rw-------  1 postgres postgres 16777216 Aug 12 20:59 00000001000000000000001C
-rw-------  1 postgres postgres 16777216 Aug 12 20:59 00000001000000000000001D
-rw-------  1 postgres postgres      332 Aug 12 20:59 00000001000000000000001D.00000028.backup
-rw-------  1 postgres postgres 16777216 Aug 12 21:02 00000001000000000000001E
-rw-------  1 postgres postgres      332 Aug 12 21:02 00000001000000000000001E.00000028.backup
-rw-------  1 postgres postgres 16777216 Aug 12 22:44 00000001000000000000001F
-rw-------  1 postgres postgres 16777216 Aug 12 22:44 000000010000000000000020
-rw-------  1 postgres postgres      332 Aug 12 22:44 000000010000000000000020.00000028.backup
-rw-------  1 postgres postgres 16777216 Aug 13 00:24 000000010000000000000021
-rw-------  1 postgres postgres 16777216 Aug 13 00:40 000000010000000000000022
-rw-------  1 postgres postgres 16777216 Aug 13 00:40 000000010000000000000023
drwx------  2 postgres postgres     4096 Aug 13 00:40 archive_status


And, here's archive_status directory:

-rw------- 1 postgres postgres    0 Aug 12 12:06 000000010000000000000009.00000028.backup.ready
-rw------- 1 postgres postgres    0 Aug 12 12:41 00000001000000000000000B.00000028.backup.ready
-rw------- 1 postgres postgres    0 Aug 12 13:23 00000001000000000000000D.00000028.backup.ready
-rw------- 1 postgres postgres    0 Aug 12 13:29 00000001000000000000000F.00000028.backup.ready
-rw------- 1 postgres postgres    0 Aug 12 13:38 000000010000000000000011.00000028.backup.ready
-rw------- 1 postgres postgres    0 Aug 12 13:57 000000010000000000000013.00000028.backup.ready
-rw------- 1 postgres postgres    0 Aug 12 16:10 000000010000000000000015.00000028.backup.ready
-rw------- 1 postgres postgres    0 Aug 12 16:29 000000010000000000000017.00000028.backup.ready
-rw------- 1 postgres postgres    0 Aug 12 20:02 000000010000000000000019.00000028.backup.ready
-rw------- 1 postgres postgres    0 Aug 12 20:34 00000001000000000000001B.00000028.backup.ready
-rw------- 1 postgres postgres    0 Aug 12 20:59 00000001000000000000001C.ready
-rw------- 1 postgres postgres    0 Aug 12 20:59 00000001000000000000001D.00000028.backup.ready
-rw------- 1 postgres postgres    0 Aug 12 20:59 00000001000000000000001D.ready
-rw------- 1 postgres postgres    0 Aug 12 21:02 00000001000000000000001E.00000028.backup.ready
-rw------- 1 postgres postgres    0 Aug 12 21:02 00000001000000000000001E.ready
-rw------- 1 postgres postgres    0 Aug 12 22:44 00000001000000000000001F.ready
-rw------- 1 postgres postgres    0 Aug 12 22:44 000000010000000000000020.00000028.backup.ready
-rw------- 1 postgres postgres    0 Aug 12 22:44 000000010000000000000020.ready
-rw------- 1 postgres postgres    0 Aug 13 00:24 000000010000000000000021.ready
-rw------- 1 postgres postgres    0 Aug 13 00:40 000000010000000000000022.ready

Daniel Farina

unread,
Aug 13, 2014, 11:45:45 AM8/13/14
to Paul Tiseo, wal-e
On Tue, Aug 12, 2014 at 5:55 PM, Paul Tiseo <paulx...@gmail.com> wrote:
> Under the postgres user manually, and when added in archive_command, type
> returns /usr/local/bin/wal-e, which is correct. The log (not the csv) also
> says "envdir: fatal: unable to run wal-e: file does not exist".
>
> BTW, in the original post, the wal-push commands were run under the postgres
> user. And, they work that way. It's on the tail end of a manual backup-push
> or under the archive_command context that I get a fail. Very confused. Is
> there any way to put wal-e into a more verbose mode?

envdir is telling you what's wrong on the tin:

$ envdir /tmp/empty does-not-exist
envdir: fatal: unable to run does-not-exist: file does not exist

> Attached is my last postgres .log file after the restart for adding the type
> item.

That said, your log file suggests it is not saying that error
nowadays. Something has changed.

Okay, I think you are close to working: the problem is somehow your
archive_status directory has gotten out of synchronization with your
pg_xlog directory, so the archiver is spewing confusion about
uploading a non-existent file, because archive_status is what Postgres
uses to remember what to upload. However, regular segments appearing
now and then seem to upload fine.

I don't know how that would have happened, but deleting the
000000010000000000000009.00000028.backup.ready out of archive_status
file seems advisable to me. It looks like you have some postdated
backups already (e.g. 000000010000000000000020.00000028.backup).

There seems to be more evidence of this desynchronization in your file
listings, so you may want to delete all *.backup.ready files and then
take one fresh backup for good measure.

Paul Tiseo

unread,
Aug 13, 2014, 10:22:57 PM8/13/14
to wa...@googlegroups.com, paulx...@gmail.com
That did it.

I compared and some WAL had transferred to S3, but at one point, the rest did not. I know early on, I had put in a bad S3 prefix but corrected it. Presumably, that put the system in a bad state.

And, so, I cleared archive_status of all files, and the rest started transferring at the next cycle. I ran a backup-push, and it did not get stuck. backup-list shows one file now, because I finally ended up with a JSON file.

Looks like we're good now. Thanks all for the help.

Daniel Farina

unread,
Aug 13, 2014, 10:42:21 PM8/13/14
to Paul Tiseo, wal-e
On Wed, Aug 13, 2014 at 7:22 PM, Paul Tiseo <paulx...@gmail.com> wrote:
> That did it.
>
> I compared and some WAL had transferred to S3, but at one point, the rest
> did not. I know early on, I had put in a bad S3 prefix but corrected it.
> Presumably, that put the system in a bad state.
>
> And, so, I cleared archive_status of all files, and the rest started
> transferring at the next cycle. I ran a backup-push, and it did not get
> stuck. backup-list shows one file now, because I finally ended up with a
> JSON file.
>
> Looks like we're good now. Thanks all for the help.

Glad to help out! This the first report I've seen of archive_status
desynchronization. If it shows up for other reporters (or you) I'll
look a bit more for a bug.

Also sucky that there's an ugly stack trace that occurs in this case,
but maybe I'll wait for more complainants (or a contributed patch)
before expending the complexity given the rarity so far.

Stuart Bishop

unread,
Aug 13, 2014, 10:55:43 PM8/13/14
to paulx...@gmail.com, wa...@googlegroups.com
On 13 August 2014 07:55, Paul Tiseo <paulx...@gmail.com> wrote:
> Under the postgres user manually, and when added in archive_command, type
> returns /usr/local/bin/wal-e, which is correct. The log (not the csv) also
> says "envdir: fatal: unable to run wal-e: file does not exist".

/usr/local/bin is likely not in the $PATH being used by the postgres
process. This is likely different to the path you are manually testing
with, as that session will be a login shell.

As others have stated, putting the full path to the wal-e binary in
your archive_command will likely fix WAL archiving
(archive_command='envdir /etc/wal-e.d/env /usr/local/bin/wal-e
wal-push %p', which I don't think you tried yet).

First get WAL archiving working, so you are no longer getting those
111 errors spit out in your postgresql log file. Only then will your
backups have a hope.



--
Stuart Bishop <stu...@stuartbishop.net>
http://www.stuartbishop.net/

Daniel Farina

unread,
Aug 13, 2014, 11:18:10 PM8/13/14
to stu...@stuartbishop.net, wal-e
This response came out of order because of the spam filter. Hopefully
it has learned you now. Sorry I was slow at plucking it out of the filter.
Reply all
Reply to author
Forward
0 new messages