WAL-E test restores fail with "invalid checkpoint record"

281 views
Skip to first unread message

Daniel

unread,
Feb 11, 2015, 7:12:15 AM2/11/15
to wa...@googlegroups.com

Hi - can anyone help me please?

I've set up a postgresql database to both archive WAL files and run a nightly base backup to S3.  All files are available.  I'm testing the backups, and I get a failure with 'invalid checkpoint record'.

My setup is as follows:

1 server running ubuntu 14.04 in EC2, postgres 9.3, wal-e 0.7.3

/etc/postgresql/9.3/main/postgresql.conf contains:

archive_command = 'envdir /etc/wal-e.d/env /usr/local/bin/wal-e wal-push /var/lib/postgresql/9.3/main/%p'
archive_mode = 'on'
archive_timeout = '60'
data_directory = '/var/lib/postgresql/9.3/main'
wal_level = 'archive'

postgres' crontab contains:

45 4 * * * envdir /etc/wal-e.d/env /usr/local/bin/wal-e backup-push /var/lib/postgresql/9.3/main || logger -p local2.emerg 'ERROR Full Postgres backup has failed. Immediate action required.'

wal-e env contains the S3 keys and WALE_S3_PREFIX = s3://<bucket>/live/postgresql/svr04  where stuff in <bucket> is the s3 bucket.

I've been running this setup for a few days now, so I have multiple full and WAL archive backups all on present.  I'm testing that I can restore from this backup.

To test restore I do the following:

- run a chef postgres cookbook using test kitchen on a vagrant / virtual box running on my laptop.  This recreates the live environment.
- log into the vagrant vm
- stop postgres
- update the wal-e S3 accesskey and secret to one that has permission to read (the backup key doesn't have this permission, both keys can list)
- remove the contents of /var/lib/postgresql/9.3/main completely.
- su - postgres
- envdir /etc/wal-e.d/env /usr/local/bin/wal-e backup-fetch /var/lib/postgresql/9.3/main LATEST

wal_e.worker.s3.s3_worker INFO     MSG: beginning partition download
        DETAIL: The partition being downloaded is part_00000000.tar.lzo.
        HINT: The absolute S3 key is live/postgresql/svr04/basebackups_005/base_000000010000001B0000006C_00000040/tar_partitions/part_00000000.tar.lzo.
        STRUCTURED: time=2015-02-11T12:00:50.299401-00 pid=27943

- create a recovery.conf file as per the wal-e info page.
  restore_command = 'envdir /etc/wal-e.d/env wal-e wal-fetch "%f" "%p"'
  
  I've tried with and without a recovery_target_time equal to the time of the backup and it doesn't change things.
  
- start postgres

Postgres won't start, here are the logs:

Feb 11 10:53:10 vagrant postgres[26779]: [2-1] 2015-02-11 10:53:10 GMT   LOG:  database system was interrupted; last known up at 2015-02-11 04:45:01 GMT
Feb 11 10:53:10 vagrant postgres[26779]: [3-1] 2015-02-11 10:53:10 GMT   LOG:  starting point-in-time recovery to 2015-02-11 04:45:01+00
Feb 11 10:53:10 vagrant postgres[26779]: [4-1] 2015-02-11 10:53:10 GMT   LOG:  invalid checkpoint record
Feb 11 10:53:10 vagrant postgres[26779]: [5-1] 2015-02-11 10:53:10 GMT   FATAL:  could not locate required checkpoint record
Feb 11 10:53:10 vagrant postgres[26779]: [5-2] 2015-02-11 10:53:10 GMT   HINT:  If you are not restoring from a backup, try removing the file "/var/lib/postgresql/9.3/main/backup_label".
Feb 11 10:53:10 vagrant postgres[26778]: [2-1] 2015-02-11 10:53:10 GMT   LOG:  startup process (PID 26779) exited with exit code 1
Feb 11 10:53:10 vagrant postgres[26778]: [3-1] 2015-02-11 10:53:10 GMT   LOG:  aborting startup due to startup process failure

If I do exactly the same procedure with two vagrant VMs (ie backup one to s3 and restore the other) everything goes swimmingly well, I'm using the same S3 keys, buckets, policies etc.

Looking around, most comments on "invalid checkpoint record" seem to be to do with WAL files not getting archived during the backup, however as far as I can see that's ok.  I assume that if postgres called the recovery_command to fetch a wal backup and that failed, I'd see the call to wal-e in the logs.

I'd be grateful if anyone could help, or even just give me a more detailed description of what 'invalid checkpoint record' means.

Daniel

Jeff Frost

unread,
Feb 11, 2015, 12:39:27 PM2/11/15
to dstr...@gmail.com, wa...@googlegroups.com

> On Feb 11, 2015, at 4:12 AM, Daniel <dstr...@gmail.com> wrote:
>
>
>
> - create a recovery.conf file as per the wal-e info page.
> restore_command = 'envdir /etc/wal-e.d/env wal-e wal-fetch "%f" "%p"'
>
> I've tried with and without a recovery_target_time equal to the time of the backup and it doesn't change things.
>
> - start postgres
>
> Postgres won't start, here are the logs:
>
> Feb 11 10:53:10 vagrant postgres[26779]: [2-1] 2015-02-11 10:53:10 GMT LOG: database system was interrupted; last known up at 2015-02-11 04:45:01 GMT
> Feb 11 10:53:10 vagrant postgres[26779]: [3-1] 2015-02-11 10:53:10 GMT LOG: starting point-in-time recovery to 2015-02-11 04:45:01+00
> Feb 11 10:53:10 vagrant postgres[26779]: [4-1] 2015-02-11 10:53:10 GMT LOG: invalid checkpoint record
> Feb 11 10:53:10 vagrant postgres[26779]: [5-1] 2015-02-11 10:53:10 GMT FATAL: could not locate required checkpoint record
> Feb 11 10:53:10 vagrant postgres[26779]: [5-2] 2015-02-11 10:53:10 GMT HINT: If you are not restoring from a backup, try removing the file "/var/lib/postgresql/9.3/main/backup_label".
> Feb 11 10:53:10 vagrant postgres[26778]: [2-1] 2015-02-11 10:53:10 GMT LOG: startup process (PID 26779) exited with exit code 1
> Feb 11 10:53:10 vagrant postgres[26778]: [3-1] 2015-02-11 10:53:10 GMT LOG: aborting startup due to startup process failure
>
> If I do exactly the same procedure with two vagrant VMs (ie backup one to s3 and restore the other) everything goes swimmingly well, I'm using the same S3 keys, buckets, policies etc.
>
> Looking around, most comments on "invalid checkpoint record" seem to be to do with WAL files not getting archived during the backup, however as far as I can see that's ok. I assume that if postgres called the recovery_command to fetch a wal backup and that failed, I'd see the call to wal-e in the logs.
>
> I'd be grateful if anyone could help, or even just give me a more detailed description of what 'invalid checkpoint record' means.

Based on the fact that your log indicates no WAL files were restored, I’d guess your wal-fetch is failing for some reason.

Usually wal-e will log some stuff to syslog. Can you look in /var/log/syslog and see if there are any failure reports and hopefully a helpful error?


Daniel Strawson

unread,
Feb 11, 2015, 1:08:47 PM2/11/15
to wa...@googlegroups.com, dstr...@gmail.com
Hi thanks for your response.  I've been waiting for my message to be mod'd so I can respond.  In the mean time I've worked it out.

Its annoyingly simple - the wal-e command was installed in /usr/local/bin and wasn't in the path, I fixed and suddenly everythings now a lot happier!  I remember now that I had the same issue with the archive command going the other way but the error message or docs were clearer and it was fixed without noticing

There wasn't any other information about not being able to run the recovery command in syslog that I could see.

Thanks once again, and sorry it was such a trivial thing.!

Daniel

Daniel Farina

unread,
Feb 11, 2015, 1:17:41 PM2/11/15
to dstr...@gmail.com, wal-e
On Wed, Feb 11, 2015 at 10:08 AM, Daniel Strawson <dstr...@gmail.com> wrote:
> Thanks once again, and sorry it was such a trivial thing.!

No worries, please keep writing in...the archives will provide value
to those who are searching for that error message.

miguel...@medulla.com.au

unread,
Oct 10, 2015, 3:13:53 PM10/10/15
to wal-e, dstr...@gmail.com
Hi Daniel,

I'm expriencing same issue here:  restores fail with "invalid checkpoint record".
What path are you talking about? Can you please tell me how have you fix it?  
Thanks,
Miguel
Reply all
Reply to author
Forward
0 new messages