I started using WAL-E, and things were going fine for about 4 hours when all of a sudden it appears that postgres just shut down. Connections show in the logs as "terminating connection due to administrator command". There's nothing outwardly obvious in the logs preceding that from Postgres, just:
Aug 3 04:19:07 prd-pg1 postgres[12577]: [3-1] 2016-08-03 00:19:07 EDT [12577-4] LOG: received fast shutdown request
Aug 3 04:19:07 prd-pg1 postgres[12577]: [4-1] 2016-08-03 00:19:07 EDT [12577-5] LOG: aborting any active transactions
Aug 3 04:19:07 prd-pg1 postgres[12582]: [3-1] 2016-08-03 00:19:07 EDT [12582-2] LOG: autovacuum launcher shutting down
No one issued any such commands to PG that would have initiated a shut down. This server had previously been running for 8 months without any unscheduled downtime, now starting to use wal-e there is an event 4 hours later. Don't know if it's just a coincidence or not. I did also see in my logs a message from repmgr which we use for our clustering:
Aug 3 04:20:07 prd-pg1 postgres[25349]: [3-1] 2016-08-03 00:20:07 EDT [25349-1] repmgr@[unknown] WARNING: terminating connection because of crash of another server process
Aug 3 04:20:07 prd-pg1 postgres[25349]: [3-2] 2016-08-03 00:20:07 EDT [25349-2] repmgr@[unknown] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
I'm wondering if the abnormal exit was the archive_command? I have pretty much the archive_command verbatim from the README:
archive_command = 'envdir /etc/wal-e.d/env wal-e wal-push %p'
It seems to be working just fine, except for that one event so far, but I would like to avoid random issues if possible. Would it be unsafe for the archiving process to wrap the archive_command in a script, and capture any errors that occur?
Thanks,
Steve