Restore with pgbackrest and Create Cluster based on Backup fails partially

1,662 views
Skip to first unread message

Mat thias

unread,
Apr 12, 2021, 10:15:51 AM4/12/21
to Postgres Operator
Hi, 
i've trouble with restore and create cluster based on Backups. 

In the following attempt, I communicate directly with the API (without pgo.client). For this, I have recorded the communication of the pgo client during the restore command using wireshark and taken over the necessary parameters.

The following data was used:
URL: https://IP_ADRESSE/restore
Data:
{"FromCluster":"pgo-benchmark-test-exp","PITRTarget":"2021-03-22 10:29:08.000000+01","RestoreOpts":"--type=time --set=20210322-092855F","BackrestStorageType":"","ClientVersion":"4.6.1","Namespace":"exp"}

Hint: In case the BackrestStorageType field iritates, I have emulated the pgo-client according to Wireshark. Only fields like nodelabel and etc., which were also empty, were left out. Tests without the BackrestStorageType field return the same error.

-> Show Backup:
cluster: pgo-benchmark-test-exp
storage type: posix

stanza: db
status: ok
cipher: none

db (current) wal archive min/max (12-1) full backup: 20210322-092855F timestamp start/stop: 2021-03-22 10:28:55 +0100 CET / 2021-03-22 10:29:08 +0100 CET wal start/stop: 000000010000000000000004 / 000000010000000000000004 database size: 54.1MiB, backup size: 54.1MiB repository size: 6.2MiB, repository backup size: 6.2MiB backup reference list: incr backup: 20210322-092855F_20210323-115542I timestamp start/stop: 2021-03-23 12:55:42 +0100 CET / 2021-03-23 12:55:45 +0100 CET wal start/stop: 0000000100000000000000EA / 0000000100000000000000EA database size: 54.5MiB, backup size: 14.7MiB repository size: 6.2MiB, repository backup size: 1.7MiB backup reference list: 20210322-092855F

The Bootstrap Pod goes into error state shortly after start-up and shuts down. The following Pod runs permanently with the familiar error message:

�[0;32mThu Mar 25 09:01:06 UTC 2021 INFO: Correct the issue, remove '/pgdata/pgo-benchmark-test-exp.initializing', and try again�[0m
�[0;32mThu Mar 25 09:01:06 UTC 2021 INFO: Your data might be in: /pgdata/pgo-benchmark-test-exp_2021-03-25-08-54-00�[0m
�[0;33mThu Mar 25 09:01:16 UTC 2021 WARN: Detected an earlier failed attempt to initialize�[0m

Could you help me how to solve this issue? 

Many greetings


Attachment:
Log-Information: 
/tmp:5432 - rejecting connections
/tmp:5432 - rejecting connections
/tmp:5432 - no response
2021-03-25 08:54:00,685 INFO: removing initialize key after failed attempt to bootstrap the cluster
2021-03-25 08:54:00,734 INFO: renaming data directory to /pgdata/pgo-benchmark-test-exp_2021-03-25-08-54-00
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 170, in main
    return patroni_main()
  File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 138, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.6/site-packages/patroni/daemon.py", line 100, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 108, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.6/site-packages/patroni/daemon.py", line 59, in run
    self._run_cycle()
  File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 111, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1452, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1346, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1242, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1235, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'

Jonathan S. Katz

unread,
Apr 17, 2021, 9:59:09 AM4/17/21
to Mat thias, Postgres Operator
Hi,

Using this:

�[0;32mThu Mar 25 09:01:06 UTC 2021 INFO: Correct the issue, remove '/pgdata/pgo-benchmark-test-exp.initializing', and try again�[0m
�[0;32mThu Mar 25 09:01:06 UTC 2021 INFO: Your data might be in: /pgdata/pgo-benchmark-test-exp_2021-03-25-08-54-00�[0m
�[0;33mThu Mar 25 09:01:16 UTC 2021 WARN: Detected an earlier failed attempt to initialize�[0m

Have you tried to remove the "/pgdata/pgo-benchmark-test-exp.initializing" file from the data directory?

Thanks,

Jonathan

Jonathan S. Katz
VP Platform Engineering

Crunchy Data
Enterprise PostgreSQL 


Mat thias

unread,
Apr 21, 2021, 7:32:47 AM4/21/21
to Postgres Operator, Jonathan, Postgres Operator, Mat thias
Hi,
thank you very much for your answer.

Yes, ending this state is possible, but mostly only by following steps:
1. remove the backup folder on the pod (which is mentioned in the error message).
2. pgo-client: pgo restore xyz
3. remove the intialize file and remove the pod

The new pod will then perform the restore, but unfortunately only the standard restore without time and snapshot details.

The problem is that there must be a problem passing the data (snaphost and time) to the api server. The ? in the path speaks for me for faulty data but I can't find it in the json which is sent to the ApiServer. ( pgdata/pgo-benchmark-test-exp_2021-03-25-08-54-00�[0m )

As I said, it is not about the pgo-client (cli) but about the direct communication with the apiserver (curl).

The JSON used:
{"FromCluster":"pgo-benchmark-test-exp","PITRTarget":"2021-03-22 10:29:08.000000+01","RestoreOpts":"--type=time --set=20210322-092855F","BackrestStorageType":"","ClientVersion":"4.6.1","Namespace":"exp"}

Many greetings

Jonathan S. Katz

unread,
Apr 21, 2021, 11:39:56 AM4/21/21
to Mat thias, Postgres Operator
On Wed, Apr 21, 2021 at 7:32 AM 'Mat thias' via Postgres Operator <postgres...@crunchydata.com> wrote:
Hi,
thank you very much for your answer.

Yes, ending this state is possible, but mostly only by following steps:
1. remove the backup folder on the pod (which is mentioned in the error message).
2. pgo-client: pgo restore xyz
3. remove the intialize file and remove the pod

The new pod will then perform the restore, but unfortunately only the standard restore without time and snapshot details.

The problem is that there must be a problem passing the data (snaphost and time) to the api server. The ? in the path speaks for me for faulty data but I can't find it in the json which is sent to the ApiServer. ( pgdata/pgo-benchmark-test-exp_2021-03-25-08-54-00�[0m )

As I said, it is not about the pgo-client (cli) but about the direct communication with the apiserver (curl).

The JSON used:
{"FromCluster":"pgo-benchmark-test-exp","PITRTarget":"2021-03-22 10:29:08.000000+01","RestoreOpts":"--type=time --set=20210322-092855F","BackrestStorageType":"","ClientVersion":"4.6.1","Namespace":"exp"}

What encoding are you using? The Operator containers are built against en_us.UTF8.

Thanks,

Jonathan 

Mat thias

unread,
May 1, 2021, 6:56:30 AM5/1/21
to Postgres Operator, jonath...@crunchydata.com, Postgres Operator, Mat thias
Hi, 
Thank you very much for the advice. I have explicitly set the encoding again in the communication and we are one step further :) 

Unfortunately, an error remains. This now occurs both via API and via pgo cli-client. 
At the first restore attempt, the message appears in the log: 

2021-05-01 08:36:49,061 INFO: Running custom bootstrap script: /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-create-replica.sh primary 
[0;32mSat May 1 08:36:49 UTC 2021 INFO: Valid PGDATA dir found for primary, a delta restore will be peformed
[0m ERROR: [038]: unable to restore while PostgreSQL is running HINT: presence of 'postmaster.pid' in '/pgdata/pgo-restore-exp-rmgv' indicates PostgreSQL is running. HINT: remove 'postmaster.pid' only if PostgreSQL is not running.
[0;31mSat May 1 08:36:49 UTC 2021 ERROR: pgBackRest primary Creation: pgBackRest restore failed when creating primary
[0m 2021-05-01 08:36:49,096 INFO: removing initialize key after failed attempt to bootstrap the cluster 2021-05-01 08:36:49,144 INFO: renaming data directory to /pgdata/pgo-restore-exp-rmgv_2021-05-01-08-36-49

The error message is clear and distinct, but I have not yet been able to find a solution or workaround. 
-> Shutting down the primary pod beforehand (deployment = 0) does nothing (waiting time 5-30 min.) The error always comes back at the first attempt. 

To solve this, I have to start the bootstrap-running pod (remove all inside /pgdata/) and the restore again. 

Is there a solution or workaround so that the restore works directly? 

Many greetings and have a nice weekend

Jonathan S. Katz

unread,
May 1, 2021, 5:19:12 PM5/1/21
to Mat thias, Postgres Operator
On Sat, May 1, 2021 at 6:56 AM 'Mat thias' via Postgres Operator <postgres...@crunchydata.com> wrote:
Hi, 
Thank you very much for the advice. I have explicitly set the encoding again in the communication and we are one step further :) 

Unfortunately, an error remains. This now occurs both via API and via pgo cli-client. 
At the first restore attempt, the message appears in the log: 

2021-05-01 08:36:49,061 INFO: Running custom bootstrap script: /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-create-replica.sh primary 
[0;32mSat May 1 08:36:49 UTC 2021 INFO: Valid PGDATA dir found for primary, a delta restore will be peformed
[0m ERROR: [038]: unable to restore while PostgreSQL is running HINT: presence of 'postmaster.pid' in '/pgdata/pgo-restore-exp-rmgv' indicates PostgreSQL is running. HINT: remove 'postmaster.pid' only if PostgreSQL is not running.
[0;31mSat May 1 08:36:49 UTC 2021 ERROR: pgBackRest primary Creation: pgBackRest restore failed when creating primary
[0m 2021-05-01 08:36:49,096 INFO: removing initialize key after failed attempt to bootstrap the cluster 2021-05-01 08:36:49,144 INFO: renaming data directory to /pgdata/pgo-restore-exp-rmgv_2021-05-01-08-36-49

The error message is clear and distinct, but I have not yet been able to find a solution or workaround. 
-> Shutting down the primary pod beforehand (deployment = 0) does nothing (waiting time 5-30 min.) The error always comes back at the first attempt. 

That's not the correct way to shut down the cluster. There is a "--shutdown" flag on the "pgo update cluster" command:


Some of the shutdown workflow is covered in the documentation:

 
To solve this, I have to start the bootstrap-running pod (remove all inside /pgdata/) and the restore again. 

Is there a solution or workaround so that the restore works directly? 

The question I ask is what is going on to cause either Postgres to be running or the presence of a PID file in there. This could also occur if you are trying to do things manually within the Pod -- trying to manually circumvent PGO functionality can cause issues if you're not careful. 

Jonathan

Mat thias

unread,
May 3, 2021, 7:36:04 AM5/3/21
to Postgres Operator, jonath...@crunchydata.com, Postgres Operator, Mat thias

Hi, 
Sorry, I must have written the whole thing down a bit confused. 

The current restore problem is based on this error message in the log: 

>> 2021-05-01 08:36:49,061 INFO: Running custom bootstrap script: /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-create-replica.sh primary 
>> [0;32mSat May 1 08:36:49 UTC 2021 INFO: Valid PGDATA dir found for primary, a delta restore will be peformed
>> [0m ERROR: [038]: unable to restore while PostgreSQL is running HINT: presence of 'postmaster.pid' in '/pgdata/pgo-restore-exp-rmgv' indicates PostgreSQL is running. HINT: remove 'postmaster.pid' only if PostgreSQL is not running.
>> [0;31mSat May 1 08:36:49 UTC 2021 ERROR: pgBackRest primary Creation: pgBackRest restore failed when creating primary
>> [0m 2021-05-01 08:36:49,096 INFO: removing initialize key after failed attempt to bootstrap the cluster 2021-05-01 08:36:49,144 INFO: renaming data directory to /pgdata/pgo-restore-exp-rmgv_2021-05-01-08-36-49

This problem manifests itself in the error state of the restore pod, which is then replaced by a new one. This then runs in a continuous loop again, as the .initalized file is present in the /pgdata folder. 

I solve the continuous loop as follows: 
-> remove file and delete pod
-> the new pod can now perform the restore. 

But I find it problematic to always make these interventions in the restore. For this reason, I have made the following attempts so that the restore works directly at the first attempt. 

1. pgo update cluster xyz --shutdown and wait a while (if the key is considered for checking whether pg is still running). 
-> The Restore Pod reports completely unexpectedly that there is no connection to the Backrest Pod. The restore is therefore not possible. 

-> From this attempt, it is worth trying to shut down only the db-pod, which leads us to. 2 attempt. 

2. set pod-deployment (db-pod) to zero and wait a while, as with the 1st attempt).
-> The Restore Pod still reports that Postgres seems to be running and goes into error state, which can be fixed as discussed above. 


This behaviour occurs both when using the pgo client and when using the Apiserver directly. 

My question now is how this behaviour can be avoided. The desired behaviour would be that the restore is carried out directly the first time. The problem here is checking whether Postgres is still running. 

Many greetings

Mat thias

unread,
May 14, 2021, 5:38:20 PM5/14/21
to Postgres Operator, Mat thias, jonath...@crunchydata.com, Postgres Operator
Hi,
unfortunately i still have the problem that a restore via pgo-client as well as directly via api crashes with the error message:

[0m ERROR: [038]: unable to restore while PostgreSQL is running

crashes.
Can anyone help me with this?

Currently in use:
Version: 4.6.2
on Openshift 4.7


Many greetings and have a nice weekend



Samir Faci

unread,
May 16, 2021, 3:24:52 PM5/16/21
to Mat thias, Postgres Operator, jonath...@crunchydata.com
This isn't ideal, it's always easier to use the tools but you can always use the postgres tools if everything fails, it might be worth giving it a go.

kubectl -n pgo port-forward svc/dashboard 5432:5432
pg_dump -h localhost -U postgres  -f prod.sql -p 5433 dashboard   ##backup
psql -h localhost -W -U postgres  -f prod.sql dashboard    ## restore

Of course if you can't take a backup of the data and need to restore this doesn't help you unless you have an existing dataset you can already use or a pgdump available.

You can also try this pattern using your existing database snapshot.

pgo create cluster  -n pgo -u dbadmin --password SECRET  --password-superuser SECRET  --pvc-size=20Gi  --pgbackrest-pvc-size=100Gi   dashboard-prod-20-100  --restore-from="dashboard" --restore-opts "--type=time --target='2021-05-04 10:11:30-07'"

In my example above, my current cluster is named dashboard and I'm creating a new one called dashboard-prod-20-100.  The DNS name would be using dashboard-prod-20-100 instead of my previous version named dashboard.  The database though is still going to be called dashboard.

This will create a new cluster from your dataset and then you'll simply need to update the references, likely a K8s secret/configmap in order for the app to pick up the changes. 

 Essentially: "rollout restart deployment <YourApp>"


Neither of these are ideal, but worth considering if nothing else works.  



--
You received this message because you are subscribed to the Google Groups "Postgres Operator" group.
To unsubscribe from this group and stop receiving emails from it, send an email to postgres-opera...@crunchydata.com.

Jonathan S. Katz

unread,
May 16, 2021, 3:34:31 PM5/16/21
to Samir Faci, Mat thias, Postgres Operator
Hi,

A few things to unpack:

unfortunately i still have the problem that a restore via pgo-client as well as directly via api crashes with the error message:
[0m ERROR: [038]: unable to restore while PostgreSQL is running

Without logs indicating a crash, I don't know what "API crash" this is referring to, or where that error message is from.
The error, however, is accurate: you can't restore while Postgres is running. If you are doing a restore-in-place, "pgo restore" does take care of that for you.

However, Samir's guidance later on, doing a restore to a new cluster, is likely the better approach for a lot of reasons, especially as you will minimize the risk of data loss. That approach does presume that your current cluster has a valid backup set to restore from, which I'm unsure if we've determined.

On Sun, May 16, 2021 at 3:22 PM Samir Faci <sa...@es.net> wrote:
This isn't ideal, it's always easier to use the tools but you can always use the postgres tools if everything fails, it might be worth giving it a go.

kubectl -n pgo port-forward svc/dashboard 5432:5432
pg_dump -h localhost -U postgres  -f prod.sql -p 5433 dashboard   ##backup
psql -h localhost -W -U postgres  -f prod.sql dashboard    ## restore

Good thought, but I would recommend "pg_dumpall" to ensure all the data is captured. If you do only want to capture as single database, and if it's of a nontrivial size, I would recommend using "pg_dump -Fd" so you can leverage parallel dump, followed by "pg_restore" for a parallel restore.

Note that this is a logical backup; based upon the size of the database, both the "backup" and "restore" can be incredibly slow. The physical backups that PGO manages use pgBackRest, which are both faster to backup/restore and have parallelism capabilities as well.

Of course if you can't take a backup of the data and need to restore this doesn't help you unless you have an existing dataset you can already use or a pgdump available.

You can also try this pattern using your existing database snapshot.

pgo create cluster  -n pgo -u dbadmin --password SECRET  --password-superuser SECRET  --pvc-size=20Gi  --pgbackrest-pvc-size=100Gi   dashboard-prod-20-100  --restore-from="dashboard" --restore-opts "--type=time --target='2021-05-04 10:11:30-07'"

In my example above, my current cluster is named dashboard and I'm creating a new one called dashboard-prod-20-100.  The DNS name would be using dashboard-prod-20-100 instead of my previous version named dashboard.  The database though is still going to be called dashboard.

This will create a new cluster from your dataset and then you'll simply need to update the references, likely a K8s secret/configmap in order for the app to pick up the changes. 

What Samir suggests above is what I generally recommend in any disaster recovery (DR) scenario: it's generally better to create a new cluster from the existing data so you can inspect it without altering the original data set.

Thanks,

Jonathan 

Mat thias

unread,
May 21, 2021, 1:03:10 PM5/21/21
to Postgres Operator, jonath...@crunchydata.com, Mat thias, Postgres Operator, Samir Faci
Hi, 
Thank you very much for your message and your advice.

Basic information: 
As mentioned, the errors referred to the use of pgo restore via pgo-client and direct communication with the operator api. 
Thus, the error occurs with pgo restore (pgo-client), although pgo should take care that the DB is stopped. 


Furthermore, as already mentioned in the first messages about the problem, problems also occur when using pgo create cluster xyz --restore-from=abc.

I have tried the following command via pgo-client: 
- pgo create cluster restore-470-douc-three --restore-from=pgo-poc-documentstore-exp
- pgo create cluster restore-470-douc-drei --restore-from=pgo-poc-documentstore-exp -restore-opts "--type=time --target='2021-05-21 13:10:30-07'"

Each bootstrap pod crashes and prints the following errors in the pod's log: 


2021-05-21 11:10:45,527 INFO: No PostgreSQL configuration items changed, nothing to reload. 2021-05-21 11:10:45,531 INFO: Lock owner: None; I am restore-470-douc-drei-bootstrap-v4b8l 2021-05-21 11:10:45,660 INFO: trying to bootstrap a new cluster 
2021-05-21 11:10:45,660 INFO: Running custom bootstrap script: /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-create-replica.sh primary 
[0;32mFri May 21 11:10:45 UTC 2021 INFO: Empty PGDATA dir found for primary, a non-delta restore will be peformed [0m WARN: repo1: [FileMissingError] unable to load info file '/backrestrepo/pgo-poc-documentstore-exp-backrest-shared-repo/backup/db/backup.info' or '/backrestrepo/pgo-poc-documentstore-exp-backrest-shared-repo/backup/db/backup.info.copy': FileMissingError: raised from remote-0 protocol on 'pgo-poc-documentstore-exp-backrest-shared-repo.exp.svc.cluster.local.': unable to open missing file '/backrestrepo/pgo-poc-documentstore-exp-backrest-shared-repo/backup/db/backup.info' for read FileMissingError: raised from remote-0 protocol on 'pgo-poc-documentstore-exp-backrest-shared-repo.exp.svc.cluster.local.': unable to open missing file '/backrestrepo/pgo-poc-documentstore-exp-backrest-shared-repo/backup/db/backup.info.copy' for read HINT: backup.info cannot be opened and is required to perform a backup. HINT: has a stanza-create been performed? ERROR: [075]: no backup set found to restore
[0;31mFri May 21 11:10:45 UTC 2021 ERROR: pgBackRest primary Creation: pgBackRest restore failed when creating primary
[0m 2021-05-21 11:10:45,959 INFO: removing initialize key after failed attempt to bootstrap the cluster 
2021-05-21 11:10:46,009 INFO: renaming data directory to /pgdata/restore-470-douc-drei_2021-05-21-11-10-46

These problems also occur when the information on s3 is given.

The backrest pods mentioned do not use local storage, but only s3 for backups.

Can you help me with this? 

 Many greetings

Mat thias

unread,
May 21, 2021, 1:03:12 PM5/21/21
to Postgres Operator, jonath...@crunchydata.com, Mat thias, Postgres Operator, Samir Faci

Hi together, 
please excuse me, tried for a long time and still the realisation comes shortly after sending the mail :/ 

I forgot --repo-type=s3 in the restore-opts, so the restore was tried via backrest-pvc, which is of course correctly empty. 

My mistake, I will try everything else based on this and get back to you, but I think the problem should be solved. 

Many thanks and best regards

jonath...@crunchydata.com schrieb am Sonntag, 16. Mai 2021 um 21:34:31 UTC+2:

Jonathan S. Katz

unread,
May 21, 2021, 1:04:13 PM5/21/21
to Mat thias, Postgres Operator, Samir Faci
Glad it works.

For the record, the error you encountered typically occurs when there is an invalid or missing backup.

Jonathan S. Katz
VP Platform Engineering

Crunchy Data
Enterprise PostgreSQL 


Reply all
Reply to author
Forward
0 new messages