Error after error when trying to recover

2,461 views
Skip to first unread message

Bob Ardalan

unread,
Apr 2, 2015, 5:52:19 PM4/2/15
to pgba...@googlegroups.com
Hello, first of all thanks for all of your work on Barman, and I apologize in advance for what is probably user error.

0. We're not doing Continuous WAL Archiving because we already have that setup to our hot standby with streaming replication.  We just want to use barman for nightly backups without PITR.

1. I took 2 backups using the command without error:
[barman@dbbackup01 ~]$ barman backup pg01

2. I recovered using the command without error: 
[barman@dbbackup01 ~]$ barman recover pg01 latest /data/pgdata

3. When I try to start the recovered database, I get the following error:
"invalid checkpoint record, could not locate required checkpoint record,  If you are not restoring from a backup, try removing the file "/data/pgdata/backup_label".

4. So I rename the backup_label file, but when I try to start the recovered database, I get the following error:
"invalid primary checkpoint record, invalid secondary checkpoint record, could not locate a valid checkpoint record"

5. So I run pg_resetxlog even though everyone online says it's a mistake, but they also say that there is no other solution.  But when I try to start the recovered database, I get the following error:
"could not access status of transaction 677405110, Could not open file "pg_clog/0286": No such file or directory."

6. At this point, I'm at a loss.  Some people online are recreating the pg_clog file with all 00's or all 01's to assume non-commits or commits, but I'd really rather not even be making either assumption.  I feel like we have a very vanilla setup here, but nobody else is having these issues online...so it has to be me, right?

7. I will reply to my own post with barman diagnose and the actual transactions that I was using.

Thank you for your help.

Bob Ardalan

unread,
Apr 2, 2015, 5:55:37 PM4/2/15
to pgba...@googlegroups.com
[barman@dbbackup01 ~]$ barman diagnose
{
    "global": {
        "config": {
            "barman_home": "/data/barman",
            "barman_user": "barman",
            "configuration_files_directory": "/etc/barman.d",
            "log_file": "/var/log/barman/barman.log",
            "network_compression": "true",
            "reuse_backup": "link"
        },
        "system_info": {
            "barman_ver": "1.4.0",
            "kernel_ver": "Linux dbbackup01 3.10.0-123.6.3.el7.x86_64 #1 SMP Wed Aug 6 21:12:36 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux",
            "python_ver": "Python 2.7.5",
            "release": "RedHat Linux CentOS Linux release 7.0.1406 (Core)",
            "rsync_ver": "rsync  version 3.0.9  protocol version 30",
            "ssh_ver": "OpenSSH_6.4p1, OpenSSL 1.0.1e-fips 11 Feb 2013"
        }
    },
    "servers": {
        "pg01": {
            "backups": {
                "20150330T171039": {
                    "backup_id": "20150330T171039",
                    "backup_label": "None",
                    "begin_offset": "15640",
                    "begin_time": "Mon Mar 30 17:10:41 2015",
                    "begin_wal": "000000010000010E00000071",
                    "begin_xlog": "10E/71003D18",
                    "config_file": "/var/lib/pgsql/9.3/data/postgresql.conf",
                    "deduplicated_size": 26548391872,
                    "end_offset": "10211312",
                    "end_time": "Mon Mar 30 17:30:27 2015",
                    "end_wal": "000000010000010F000000AF",
                    "end_xlog": "10F/AF9BCFF0",
                    "error": null,
                    "hba_file": "/var/lib/pgsql/9.3/data/pg_hba.conf",
                    "ident_file": "/var/lib/pgsql/9.3/data/pg_ident.conf",
                    "mode": "default",
                    "pgdata": "/var/lib/pgsql/9.3/data",
                    "server_name": "pg01",
                    "size": 208517995737,
                    "status": "DONE",
                    "tablespaces": [
                        [
                            "anems_data",
                            18613,
                            "/data/pgdata/anems_data"
                        ],
                        [
                            "anems_idx",
                            18614,
                            "/data/pgdata/anems_idx"
                        ]
                    ],
                    "timeline": 1,
                    "version": 90306
                },
                "20150330T194853": {
                    "backup_id": "20150330T194853",
                    "backup_label": "None",
                    "begin_offset": "40",
                    "begin_time": "Mon Mar 30 19:48:54 2015",
                    "begin_wal": "000000010000011900000009",
                    "begin_xlog": "119/9000028",
                    "config_file": "/var/lib/pgsql/9.3/data/postgresql.conf",
                    "deduplicated_size": 26535447416,
                    "end_offset": "6196232",
                    "end_time": "Mon Mar 30 20:08:39 2015",
                    "end_wal": "000000010000011A0000004D",
                    "end_xlog": "11A/4D5E8C08",
                    "error": null,
                    "hba_file": "/var/lib/pgsql/9.3/data/pg_hba.conf",
                    "ident_file": "/var/lib/pgsql/9.3/data/pg_ident.conf",
                    "mode": "default",
                    "pgdata": "/var/lib/pgsql/9.3/data",
                    "server_name": "pg01",
                    "size": 219627346065,
                    "status": "DONE",
                    "tablespaces": [
                        [
                            "anems_data",
                            18613,
                            "/data/pgdata/anems_data"
                        ],
                        [
                            "anems_idx",
                            18614,
                            "/data/pgdata/anems_idx"
                        ]
                    ],
                    "timeline": 1,
                    "version": 90306
                }
            },
            "config": {
                "active": true,
                "backup_directory": "/data/barman/pg01",
                "backup_options": "exclusive_backup",
                "bandwidth_limit": null,
                "barman_home": "/data/barman",
                "basebackup_retry_sleep": 30,
                "basebackup_retry_times": 0,
                "basebackups_directory": "/data/barman/pg01/base",
                "compression": null,
                "conninfo": "host=postgres01 user=postgres",
                "custom_compression_filter": null,
                "custom_decompression_filter": null,
                "description": "pg01 PostgreSQL Database",
                "immediate_checkpoint": false,
                "incoming_wals_directory": "/data/barman/pg01/incoming",
                "last_backup_maximum_age": null,
                "lock_file": "/data/barman/pg01/pg01.lock",
                "minimum_redundancy": 0,
                "name": "pg01",
                "network_compression": true,
                "post_archive_script": null,
                "post_backup_script": null,
                "pre_archive_script": null,
                "pre_backup_script": null,
                "retention_policy": null,
                "retention_policy_mode": "auto",
                "reuse_backup": "link",
                "ssh_command": "ssh postgres@postgres01",
                "tablespace_bandwidth_limit": null,
                "wal_retention_policy": "main",
                "wals_directory": "/data/barman/pg01/wals"
            },
            "status": {
                "archive_command": "rsync -az %p postgres@postgres02:/data/pgarchive/%f",
                "archive_mode": "on",
                "config_file": "/var/lib/pgsql/9.3/data/postgresql.conf",
                "current_xlog": "000000010000021F00000021",
                "data_directory": "/var/lib/pgsql/9.3/data",
                "hba_file": "/var/lib/pgsql/9.3/data/pg_hba.conf",
                "ident_file": "/var/lib/pgsql/9.3/data/pg_ident.conf",
                "last_archived_wal": "000000010000021F00000020",
                "pgespresso_installed": false,
                "server_txt_version": "9.3.6"
            },
            "system_info": {
                "kernel_ver": "Linux postgres01 3.10.0-123.6.3.el7.x86_64 #1 SMP Wed Aug 6 21:12:36 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux",
                "python_ver": "Python 2.7.5",
                "release": "RedHat Linux CentOS Linux release 7.0.1406 (Core)",
                "rsync_ver": "rsync  version 3.0.9  protocol version 30",
                "ssh_ver": "OpenSSH_6.4p1, OpenSSL 1.0.1e-fips 11 Feb 2013"
            }
        }
    }
}

Bob Ardalan

unread,
Apr 2, 2015, 6:00:14 PM4/2/15
to pgba...@googlegroups.com
[barman@dbbackup01 ~]$ barman recover pg01 latest /data/pgdata
Starting local restore for server pg01 using backup 20150330T194853
Destination directory: /data/pgdata
        18613, anems_data, /data/pgdata/anems_data
        18614, anems_idx, /data/pgdata/anems_idx
Using safe horizon time for smart rsync copy: 2015-03-30 19:48:54.596957+00:00
Copying the base backup.
Copying required wal segments.
Generating archive status files
Disabling dangerous settings in destination directory.
The archive_command was set to 'false' to prevent data losses.

Your PostgreSQL server has been successfully prepared for recovery!

Please review network and archive related settings in the PostgreSQL
configuration file before starting the just recovered instance.

Recovery completed successful.


[barman@dbbackup01 ~]$ exit
[bob.ardalan@dbbackup01 /data]$ chown -R postgres:postgres /data/pgdata
[bob.ardalan@dbbackup01 /data]$ su - postgres
[postgres@dbbackup01 /data/pgdata]$ vi postgresql.conf   #(in order to change the port and the shared_buffers)


[postgres@dbbackup01 /data/pgdata]$ pg_ctl start -wD /data/pgdata
waiting for server to start....< 2015-04-02 21:06:55.782 UTC >LOG:  redirecting log output to logging collector process
< 2015-04-02 21:06:55.782 UTC >HINT:  Future log output will appear in directory "pg_log".
.... stopped waiting
pg_ctl: could not start server
Examine the log output.


[postgres@dbbackup01 /data/pgdata]$ cat pg_log/postgresql-Thu.log
< 2015-04-02 21:06:55.784 UTC >LOG:  database system was interrupted; last known up at 2015-03-30 20:05:28 UTC
< 2015-04-02 21:06:55.785 UTC >LOG:  invalid checkpoint record
< 2015-04-02 21:06:55.785 UTC >FATAL:  could not locate required checkpoint record
< 2015-04-02 21:06:55.785 UTC >HINT:  If you are not restoring from a backup, try removing the file "/data/pgdata/backup_label".
< 2015-04-02 21:06:55.787 UTC >LOG:  startup process (PID 18877) exited with exit code 1
< 2015-04-02 21:06:55.787 UTC >LOG:  aborting startup due to startup process failure


[postgres@dbbackup01 /data/pgdata]$ mv backup_label backup_label.20150402   #(trying to solve the above error. why is this file here?)


[postgres@dbbackup01 /data/pgdata]$ pg_ctl start -wD /data/pgdata
waiting for server to start....< 2015-04-02 21:09:51.845 UTC >LOG:  redirecting log output to logging collector process
< 2015-04-02 21:09:51.845 UTC >HINT:  Future log output will appear in directory "pg_log".
.... stopped waiting
pg_ctl: could not start server
Examine the log output.


[postgres@dbbackup01 /data/pgdata]$ cat pg_log/postgresql-Thu.log | grep -P '^(?!\< 2015-04-02 21:06)'
< 2015-04-02 21:09:51.848 UTC >LOG:  database system was interrupted; last known up at 2015-03-30 20:05:28 UTC
< 2015-04-02 21:09:51.848 UTC >LOG:  invalid primary checkpoint record
< 2015-04-02 21:09:51.848 UTC >LOG:  invalid secondary checkpoint record
< 2015-04-02 21:09:51.848 UTC >PANIC:  could not locate a valid checkpoint record
< 2015-04-02 21:09:51.848 UTC >LOG:  startup process (PID 18919) was terminated by signal 6: Aborted
< 2015-04-02 21:09:51.848 UTC >LOG:  aborting startup due to startup process failure


[postgres@dbbackup01 /data/pgdata]$ pg_resetxlog /data/pgdata
The database server was not shut down cleanly.
Resetting the transaction log might cause data to be lost.
If you want to proceed anyway, use -f to force reset.

[postgres@dbbackup01 /data/pgdata]$ pg_resetxlog -f /data/pgdata   #(everyone on the internet says that pg_resetxlog is a mistake, but they have no other solution)
Transaction log reset


[postgres@dbbackup01 /data/pgdata]$ pg_ctl start -wD /data/pgdata
waiting for server to start....< 2015-04-02 21:25:54.741 UTC >LOG:  redirecting log output to logging collector process
< 2015-04-02 21:25:54.741 UTC >HINT:  Future log output will appear in directory "pg_log".
.... stopped waiting
pg_ctl: could not start server
Examine the log output.


[postgres@dbbackup01 /data/pgdata]$ cat pg_log/postgresql-Thu.log | grep -P '^(?!\< 2015-04-02 (21:06|21:09))'
< 2015-04-02 21:25:54.745 UTC >LOG:  database system was shut down at 2015-04-02 21:25:30 UTC
< 2015-04-02 21:25:54.748 UTC >FATAL:  could not access status of transaction 677405110
< 2015-04-02 21:25:54.748 UTC >DETAIL:  Could not open file "pg_clog/0286": No such file or directory.
< 2015-04-02 21:25:54.749 UTC >LOG:  startup process (PID 19147) exited with exit code 1
< 2015-04-02 21:25:54.749 UTC >LOG:  aborting startup due to startup process failure


#(and now i'm missing a pg_clog file?)

Gabriele Bartolini

unread,
Apr 3, 2015, 4:42:47 AM4/3/15
to pgba...@googlegroups.com
Hi Bob,

  thanks for writing and for your nice words.

  Before continuing with the rest of your items, I will focus on item "0" and a feature in our roadmap.

  A physical backup in Postgres is made up not only of a copy of the data files. It includes also WAL files (see http://docs.pgbarman.org/#introduction). Also, a pre-requisite of Barman is that continuous archiving is set between the master and the backup server (see http://docs.pgbarman.org/#getting_started). Data files, without WAL files cannot make a consistent snapshot. In other words, Barman without continuous archiving does not work.

  Based on your email, I'd be really glad if you could consider funding the development, testing and maintenance of the WAL hub functionality for Barman.

  With such a feature, Barman will continuously collect WAL files through archive_command, while your standby servers will be connected via streaming replication to the master. In case of temporary connection issues, by setting a proper restore command, the standby could fetch the required WAL files from the barman server. This will grant you an almost 'infinite' stream of WAL files as a fallback option for your standby.

  I spoke about this feature in my last talk in New York last week for PGConfUS. The feature is highlighted on slide 72 at http://goo.gl/7HwgKZ .

  This feature would be a very handy addition to Barman and Postgres for business continuity. I'd really love that one or more companies already using Barman - even in this list - would step forward and contact us. ;)

Regards,
Gabriele

--
 Gabriele Bartolini - 2ndQuadrant Italia - Managing Director
 PostgreSQL Training, Services and Support
 gabriele....@2ndQuadrant.it | www.2ndQuadrant.it

--
--
You received this message because you are subscribed to the "Barman for PostgreSQL" group.
To post to this group, send email to pgba...@googlegroups.com
To unsubscribe from this group, send email to
pgbarman+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/pgbarman?hl=en?hl=en-GB

---
You received this message because you are subscribed to the Google Groups "Barman, Backup and Recovery Manager for PostgreSQL" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pgbarman+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bob Ardalan

unread,
Apr 3, 2015, 10:52:36 AM4/3/15
to pgba...@googlegroups.com
Hi Gabriele, 
  1. Thank you for your prompt reply. 

  2. Would you please elaborate on which WAL logs are required in order to create a consistent backup.  For example, if I want to create backups for April 1st, 2nd, and 3rd (on midnight of each night), which WAL logs would definitely have to be there in order to successfully recover each backup?  
    a. Do I require all of the WAL logs since the creation of the database? (seems impractical)
    b. Do I require the WAL log from just before the backups started?
    c. Do I require the WAL logs generated while the backups are taking place?
    d. Do I require the WAL log from just after the backups finished?
    e. Do I require all of the WAL logs that were created after the backups finished going forward indefinitely? (seems impractical)

    I ask because the scenarios (a) and (e) don't seem practical for a disaster recovery situation where continuous WAL archiving may have failed because the master failed.  Do you know what I mean?

  3. Do I have to run a warm standby on the backup server?  Or can these WAL logs will just sit there untouched until I try to recover the database.  Or do I need to set up postgresql on the backup server to continuously read the WAL logs as they come in?

  4. I'm thinking of putting 2 rsync commands in the "archive_command".  One to rsync WAL logs to the standby server, and a second to rsync WAL logs to the backup server.  That way the WAL logs just go to both places.  (We really like the idea of our standby server being independent of the backup server).  Do you see a reason why this method won't work?
    archive_command = 'rsync -az %p postgres@postgres02:/data/pgarchive/%f ; rsync -az %p barman@dbbackup01:INCOMING_WALS_DIRECTORY/%f'
Thank you again for sharing your knowledge,
   Bob


MP

unread,
Apr 20, 2015, 9:31:01 AM4/20/15
to pgba...@googlegroups.com
2] all info in postgresql manual pages about backup/recovery PITR
3] both cases, in other words, you need WALs only when you run server which is in "recovery" mode - and replication is recovery mode too.
4] read postgresql manual about archive_command - so instead ";" you use "&&" or write all in one script and that script is in archive_command.

Dne pátek 3. dubna 2015 16:52:36 UTC+2 Bob Ardalan napsal(a):
Reply all
Reply to author
Forward
0 new messages