Hi - can anyone help me please?
I've set up a postgresql database to both archive WAL files and run a nightly base backup to S3. All files are available. I'm testing the backups, and I get a failure with 'invalid checkpoint record'.
My setup is as follows:
1 server running ubuntu 14.04 in EC2, postgres 9.3, wal-e 0.7.3
/etc/postgresql/9.3/main/postgresql.conf contains:
archive_command = 'envdir /etc/wal-e.d/env /usr/local/bin/wal-e wal-push /var/lib/postgresql/9.3/main/%p'
archive_mode = 'on'
archive_timeout = '60'
data_directory = '/var/lib/postgresql/9.3/main'
wal_level = 'archive'
postgres' crontab contains:
45 4 * * * envdir /etc/wal-e.d/env /usr/local/bin/wal-e backup-push /var/lib/postgresql/9.3/main || logger -p local2.emerg 'ERROR Full Postgres backup has failed. Immediate action required.'
wal-e env contains the S3 keys and WALE_S3_PREFIX = s3://<bucket>/live/postgresql/svr04 where stuff in <bucket> is the s3 bucket.
I've been running this setup for a few days now, so I have multiple full and WAL archive backups all on present. I'm testing that I can restore from this backup.
To test restore I do the following:
- run a chef postgres cookbook using test kitchen on a vagrant / virtual box running on my laptop. This recreates the live environment.
- log into the vagrant vm
- stop postgres
- update the wal-e S3 accesskey and secret to one that has permission to read (the backup key doesn't have this permission, both keys can list)
- remove the contents of /var/lib/postgresql/9.3/main completely.
- su - postgres
- envdir /etc/wal-e.d/env /usr/local/bin/wal-e backup-fetch /var/lib/postgresql/9.3/main LATEST
wal_e.worker.s3.s3_worker INFO MSG: beginning partition download
DETAIL: The partition being downloaded is part_00000000.tar.lzo.
HINT: The absolute S3 key is live/postgresql/svr04/basebackups_005/base_000000010000001B0000006C_00000040/tar_partitions/part_00000000.tar.lzo.
STRUCTURED: time=2015-02-11T12:00:50.299401-00 pid=27943
- create a recovery.conf file as per the wal-e info page.
restore_command = 'envdir /etc/wal-e.d/env wal-e wal-fetch "%f" "%p"'
I've tried with and without a recovery_target_time equal to the time of the backup and it doesn't change things.
- start postgres
Postgres won't start, here are the logs:
Feb 11 10:53:10 vagrant postgres[26779]: [2-1] 2015-02-11 10:53:10 GMT LOG: database system was interrupted; last known up at 2015-02-11 04:45:01 GMT
Feb 11 10:53:10 vagrant postgres[26779]: [3-1] 2015-02-11 10:53:10 GMT LOG: starting point-in-time recovery to 2015-02-11 04:45:01+00
Feb 11 10:53:10 vagrant postgres[26779]: [4-1] 2015-02-11 10:53:10 GMT LOG: invalid checkpoint record
Feb 11 10:53:10 vagrant postgres[26779]: [5-1] 2015-02-11 10:53:10 GMT FATAL: could not locate required checkpoint record
Feb 11 10:53:10 vagrant postgres[26779]: [5-2] 2015-02-11 10:53:10 GMT HINT: If you are not restoring from a backup, try removing the file "/var/lib/postgresql/9.3/main/backup_label".
Feb 11 10:53:10 vagrant postgres[26778]: [2-1] 2015-02-11 10:53:10 GMT LOG: startup process (PID 26779) exited with exit code 1
Feb 11 10:53:10 vagrant postgres[26778]: [3-1] 2015-02-11 10:53:10 GMT LOG: aborting startup due to startup process failure
If I do exactly the same procedure with two vagrant VMs (ie backup one to s3 and restore the other) everything goes swimmingly well, I'm using the same S3 keys, buckets, policies etc.
Looking around, most comments on "invalid checkpoint record" seem to be to do with WAL files not getting archived during the backup, however as far as I can see that's ok. I assume that if postgres called the recovery_command to fetch a wal backup and that failed, I'd see the call to wal-e in the logs.
I'd be grateful if anyone could help, or even just give me a more detailed description of what 'invalid checkpoint record' means.
Daniel