Hi,
We are evaluating this operator, and for that I'm trying the PGO examples, and I tried to understand how the HA operator works, and I added a configuration with 2 replicas.
.. and followed the indications to verify self-healing as shown in Tutorial->High Availability.
By running:
I get this:
JIC, the PRIMARY POD is hippo-instance1-tdxx
My assumption is (maybe I'm completely wrong), that with replicas=2 we have an HA setup with 1 Primary / 1 Follower
Thus:
hippo-instance1-tdxx - Primary
hippo-instance1-ksbq - Slave
Before continuing with the 2 tests mentioned in the tutorial I tried to understand the secondary pod status.. In order to understand this, I tried opening a terminal in each pod, and in primary pod I can enter and launch PSQL:
And once I do the same in the secondary pod, I get this:
Thus seems -perhaps- we have some issue in the secondary pod... by analyzing pod's logs I found something strange in secondary pod:
Primary POD hippo-instance1-tdxx:
Secondary POD hippo-instance1-ksbq:
Thus seems error is related to pg_basebackup (seems is a tool to take base backups of a running database cluster), as it cannot reach the primary host?!:
In my case I'm using OKE on Oracle's cloud OCI, which is CNCF-certified. I also tried this with Microk8s and have a similar behavior (slave pod with same error log),
Thus, I'd like to ask:
JIC, I initially post this same question in the PGO Examples Repo, and I got this response:
hippo-instance1-tdxx-0.hippo-pods is suspicious -- I'm unsure where .hippo-pods is coming from given it appears you've deployed to the postgres-operator namespace. I would potentially check your networking layer. I'll run some local tests against this. It appears that there is something going on with the DNS resolution in your environment, I would investigate that.
That said, this repo is specifically for the examples around PGO. Please see the Support page for where you can ask support questions or report bugs.
Separately, the terminology is "secondary", "replica" or "follower".
Thanks in advance for your help.
Best
Javier
Thus:
hippo-instance1-tdxx - Primary
hippo-instance1-ksbq -SlaveReplica
Secondary POD hippo-instance1-ksbq:
...
2021-10-27 15:38:20,836 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:38:20,836 WARNING: Trying again in 5 seconds 2021-10-27 15:38:30,780 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:30,780 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress
2021-10-27 15:38:40,813 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:40,813 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known
2021-10-27 15:38:45,857 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:38:45,857 ERROR: failed to bootstrap from leader 'hippo-instance1-tdxx-0'
2021-10-27 15:38:45,857 INFO: Removing data directory: /pgdata/pg13
2021-10-27 15:38:50,781 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:51,024 INFO: trying to bootstrap from leader 'hippo-instance1-tdxx-0'
2021-10-27 15:39:00,779 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:39:00,852 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress
2021-10-27 15:39:10,810 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:39:10,811 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known
2021-10-27 15:39:11,054 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:39:11,054 WARNING: Trying again in 5 seconds ...Thus seems error is related to pg_basebackup (seems is a tool to take base backups of a running database cluster), as it cannot reach the primary host?!:
pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known
Thus, I'd like to ask:
- Are my assumptions ok (with replicas=2 I have 1 Primary/Master and 1 Secondary/Follower , and both should be available?)
- In ideal conditions, should the secondary POD hippo-instance1-ksbq-0 report 3/3 in Ready status?
- In ideal conditions, should both Primary and Secondary PODs respond to the a PSQL query?
- Based on the logs, I assume there's something with the networking.. Is there any additional steps I might be missing, or any advice to troubleshoot this issue? FYI, I just followed the steps as described in tutorial.