Error on 2-Replica Cluster - Bootstrap from leader - Could not translate hostname

1,212 views
Skip to first unread message

Javier Roca

unread,
Oct 27, 2021, 12:54:18 PM10/27/21
to Postgres Operator

Hi,

We are evaluating this operator, and for that I'm trying the PGO examples, and I tried to understand how the HA operator works, and I added a configuration with 2 replicas.

kind: PostgresCluster
metadata:
  name: hippo
spec:
  postgresVersion: 13
  instances:
    - name: instance1
      replicas: 2
      dataVolumeClaimSpec:
        accessModes:
          - "ReadWriteOnce"
        resources:
          requests:
            storage: 1Gi
  backups:
    pgbackrest:
      repos:
        - name: repo1
          volume:
            volumeClaimSpec:
              accessModes:
                - "ReadWriteOnce"
              resources:
                requests:
                  storage: 1Gi

.. and followed the indications to verify self-healing as shown in Tutorial->High Availability.

By running:

I get this:

NAME READY STATUS RESTARTS AGE
hippo-instance1-ksbq-0 2/3 Running 0 97m
hippo-instance1-tdxx-0 3/3 Running 0 96m

JIC, the PRIMARY POD is hippo-instance1-tdxx

My assumption is (maybe I'm completely wrong), that with replicas=2 we have an HA setup with 1 Primary / 1 Follower

Thus:
hippo-instance1-tdxx - Primary
hippo-instance1-ksbq - Slave

Before continuing with the 2 tests mentioned in the tutorial I tried to understand the secondary pod status.. In order to understand this, I tried opening a terminal in each pod, and in primary pod I can enter and launch PSQL:

bash-4.4$ psql
psql (13.4) Type "help" for help.
postgres=# SELECT NOT pg_catalog.pg_is_in_recovery() is_primary; is_primary ------------ t (1 row)

And once I do the same in the secondary pod, I get this:

bash-4.4$ psql
psql: error: could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/tmp/postgres/.s.PGSQL.5432"? 

Thus seems -perhaps- we have some issue in the secondary pod... by analyzing pod's logs I found something strange in secondary pod:

Primary POD hippo-instance1-tdxx:

... 2021-10-27 15:36:10,787 INFO: no action. I am (hippo-instance1-tdxx-0) the leader with the lock
2021-10-27 15:36:21,017 INFO: no action. I am (hippo-instance1-tdxx-0) the leader with the lock
2021-10-27 15:36:30,785 INFO: no action. I am (hippo-instance1-tdxx-0) the leader with the lock
2021-10-27 15:36:40,787 INFO: no action. I am (hippo-instance1-tdxx-0) the leader with the lock ...

Secondary POD hippo-instance1-ksbq:

...
2021-10-27 15:38:20,836 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:38:20,836 WARNING: Trying again in 5 seconds 2021-10-27 15:38:30,780 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:30,780 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress
2021-10-27 15:38:40,813 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:40,813 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known
2021-10-27 15:38:45,857 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:38:45,857 ERROR: failed to bootstrap from leader 'hippo-instance1-tdxx-0'
2021-10-27 15:38:45,857 INFO: Removing data directory: /pgdata/pg13
2021-10-27 15:38:50,781 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:51,024 INFO: trying to bootstrap from leader 'hippo-instance1-tdxx-0'
2021-10-27 15:39:00,779 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:39:00,852 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress
2021-10-27 15:39:10,810 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:39:10,811 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known
2021-10-27 15:39:11,054 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:39:11,054 WARNING: Trying again in 5 seconds ...

Thus seems error is related to pg_basebackup (seems is a tool to take base backups of a running database cluster), as it cannot reach the primary host?!:

pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known

In my case I'm using OKE on Oracle's cloud OCI, which is CNCF-certified. I also tried this with Microk8s and have a similar behavior (slave pod with same error log),

Thus, I'd like to ask:

  1. Are my assumptions ok (with replicas=2 I have 1 Primary/Master and 1 Secondary/Follower , and both should be available?)
  2. In ideal conditions, should the secondary POD hippo-instance1-ksbq-0 report 3/3 in Ready status?
  3. In ideal conditions, should both Primary and Secondary PODs respond to the a PSQL query?
  4. Based on the logs, I assume there's something with the networking.. Is there any additional steps I might be missing, or any advice to troubleshoot this issue? FYI, I just followed the steps as described in tutorial.

JIC, I initially post this same question in the PGO Examples Repo, and I got this response:

hippo-instance1-tdxx-0.hippo-pods is suspicious -- I'm unsure where .hippo-pods is coming from given it appears you've deployed to the postgres-operator namespace. I would potentially check your networking layer. I'll run some local tests against this. It appears that there is something going on with the DNS resolution in your environment, I would investigate that.

That said, this repo is specifically for the examples around PGO. Please see the Support page for where you can ask support questions or report bugs.

Separately, the terminology is "secondary", "replica" or "follower".

Thanks in advance for your help.

Best

Javier

Jonathan S. Katz

unread,
Oct 27, 2021, 1:30:04 PM10/27/21
to Javier Roca, Postgres Operator
Hi,

Comments inline:

That assumption is correct.
 

Thus:
hippo-instance1-tdxx - Primary
hippo-instance1-ksbq - Slave Replica

You can check this explicitly with:

kubectl -n postgres-operator get pods \
 

Secondary POD hippo-instance1-ksbq:

...
2021-10-27 15:38:20,836 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:38:20,836 WARNING: Trying again in 5 seconds 2021-10-27 15:38:30,780 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:30,780 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress
2021-10-27 15:38:40,813 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:40,813 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known
2021-10-27 15:38:45,857 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:38:45,857 ERROR: failed to bootstrap from leader 'hippo-instance1-tdxx-0'
2021-10-27 15:38:45,857 INFO: Removing data directory: /pgdata/pg13
2021-10-27 15:38:50,781 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:51,024 INFO: trying to bootstrap from leader 'hippo-instance1-tdxx-0'
2021-10-27 15:39:00,779 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:39:00,852 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress
2021-10-27 15:39:10,810 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:39:10,811 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known
2021-10-27 15:39:11,054 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:39:11,054 WARNING: Trying again in 5 seconds ...

Thus seems error is related to pg_basebackup (seems is a tool to take base backups of a running database cluster), as it cannot reach the primary host?!:

pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known

I would recommend checking a few things:

    kubectl -n postgres-operator get svc
 
Is there a "hippo-pods" Service available?

Additionally, are there any errors in the "postgres-operator" Pod logs?

Thus, I'd like to ask:

  1. Are my assumptions ok (with replicas=2 I have 1 Primary/Master and 1 Secondary/Follower , and both should be available?)
Yes 
  1. In ideal conditions, should the secondary POD hippo-instance1-ksbq-0 report 3/3 in Ready status?
Yes 
  1. In ideal conditions, should both Primary and Secondary PODs respond to the a PSQL query?
Yes 
  1. Based on the logs, I assume there's something with the networking.. Is there any additional steps I might be missing, or any advice to troubleshoot this issue? FYI, I just followed the steps as described in tutorial.
I would also check the Postgres Operator logs as well to see if there were any reconciliation issues.

Thanks,

Jonathan 
Message has been deleted

Javier Roca

unread,
Oct 27, 2021, 5:22:45 PM10/27/21
to Postgres Operator, jonath...@crunchydata.com, Postgres Operator, Javier Roca
Dear Jonathan,

Hi. Thanks for your previous answers! I redid a new clean installation on the cluster, and I still have the same issue.
NAME                     READY   STATUS    RESTARTS   AGE   ROLE
hippo-instance1-7t78-0   3/3     Running   0          32m   master
hippo-instance1-pwvz-0   2/3     Running   0          33m   
 
The follower is still 2/3, logs shows the same issue:

 kubectl logs pod/hippo-instance1-pwvz-0 --namespace=postgres-operator --container=database --since=0

2021-10-27 20:40:50,153 INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-10-27 20:40:50,160 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0
2021-10-27 20:40:50,242 INFO: trying to bootstrap from leader 'hippo-instance1-7t78-0'
2021-10-27 20:40:51,759 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0
2021-10-27 20:40:51,783 INFO: bootstrap from leader 'hippo-instance1-7t78-0' in progress
2021-10-27 20:41:01,791 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0
2021-10-27 20:41:01,791 INFO: bootstrap from leader 'hippo-instance1-7t78-0' in progress
pg_basebackup: error: could not translate host name "hippo-instance1-7t78-0.hippo-pods" to address: Name or service not known
2021-10-27 20:41:10,272 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 20:41:10,273 WARNING: Trying again in 5 seconds
2021-10-27 20:41:11,770 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0
2021-10-27 20:41:11,770 INFO: bootstrap from leader 'hippo-instance1-7t78-0' in progress
2021-10-27 20:41:21,758 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0
2021-10-27 20:41:21,758 INFO: bootstrap from leader 'hippo-instance1-7t78-0' in progress
2021-10-27 20:41:31,794 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0
2021-10-27 20:41:31,794 INFO: bootstrap from leader 'hippo-instance1-7t78-0' in progress
pg_basebackup: error: could not translate host name "hippo-instance1-7t78-0.hippo-pods" to address: Name or service not known
2021-10-27 20:41:35,300 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 20:41:35,300 ERROR: failed to bootstrap from leader 'hippo-instance1-7t78-0'
2021-10-27 20:41:35,300 INFO: Removing data directory: /pgdata/pg13
2021-10-27 20:41:41,771 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0
2021-10-27 20:41:41,937 INFO: trying to bootstrap from leader 'hippo-instance1-7t78-0'
2021-10-27 20:41:51,772 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0
2021-10-27 20:41:51,794 INFO: bootstrap from leader 'hippo-instance1-7t78-0' in progress
2021-10-27 20:42:01,794 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0
2021-10-27 20:42:01,794 INFO: bootstrap from leader 'hippo-instance1-7t78-0' in progress
pg_basebackup: error: could not translate host name "hippo-instance1-7t78-0.hippo-pods" to address: Name or service not known
2021-10-27 20:42:01,965 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 20:42:01,966 WARNING: Trying again in 5 seconds
....

As you recommended, I did

kubectl -n postgres-operator get svc

NAME              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
hippo-ha          ClusterIP   10.96.62.185    <none>        5432/TCP   39m
hippo-ha-config   ClusterIP   None            <none>        <none>     38m
hippo-pods        ClusterIP   None            <none>        <none>     39m
hippo-primary     ClusterIP   None            <none>        5432/TCP   39m
hippo-replicas    ClusterIP   10.96.192.179   <none>        5432/TCP   39m

About the operator pod logs, I saw an error (in red):

kubectl logs pod/pgo-b95d7bbd-dcn9z --namespace=postgres-operator --container=operator --since=0

time="2021-10-27T20:28:42Z" level=debug msg="debug flag set to true" file="cmd/postgres-operator/main.go:62" func=main.main version=5.0.3-0
time="2021-10-27T20:28:44Z" level=info msg="metrics server is starting to listen" addr=":8080" file="sigs.k8s.io/controlle...@v0.8.3/pkg/log/deleg.go:130" func="log.(*DelegatingLogger).Info" version=5.0.3-0
time="2021-10-27T20:28:44Z" level=info msg="starting controller runtime manager and will wait for signal to exit" file="cmd/postgres-operator/main.go:83" func=main.main version=5.0.3-0
time="2021-10-27T20:28:44Z" level=info msg="starting metrics server" file="sigs.k8s.io/controlle...@v0.8.3/pkg/manager/internal.go:385" func="manager.(*controllerManager).serveMetrics.func2" path=/metrics version=5.0.3-0
time="2021-10-27T20:28:44Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:44Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:44Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:44Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:45Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:45Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:45Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:45Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:45Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:45Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:45Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:45Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:45Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:45Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:46Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0
time="2021-10-27T20:28:46Z" level=info msg="Starting Controller" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:173" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0
time="2021-10-27T20:28:46Z" level=info msg="Starting workers" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:211" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0 worker count=2
time="2021-10-27T20:28:46Z" level=debug msg=deleting file="internal/controller/postgrescluster/controller.go:139" func="postgrescluster.(*Reconciler).Reconcile" name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster result="{Requeue:false RequeueAfter:0s}" version=5.0.3-0
E1027 20:28:50.367436       1 reflector.go:138] k8s.io/clie...@v0.20.8/tools/cache/reflector.go:167: Failed to watch *v1beta1.PostgresCluster: the server could not find the requested resource (get postgresclusters.postgres-operator.crunchydata.com)
E1027 20:28:51.742217       1 reflector.go:138] k8s.io/clie...@v0.20.8/tools/cache/reflector.go:167: Failed to watch *v1beta1.PostgresCluster: failed to list *v1beta1.PostgresCluster: the server could not find the requested resource (get postgresclusters.postgres-operator.crunchydata.com)
E1027 20:28:54.192785       1 reflector.go:138] k8s.io/clie...@v0.20.8/tools/cache/reflector.go:167: Failed to watch *v1beta1.PostgresCluster: failed to list *v1beta1.PostgresCluster: the server could not find the requested resource (get postgresclusters.postgres-operator.crunchydata.com)
E1027 20:28:57.496771       1 reflector.go:138] k8s.io/clie...@v0.20.8/tools/cache/reflector.go:167: Failed to watch *v1beta1.PostgresCluster: failed to list *v1beta1.PostgresCluster: the server could not find the requested resource (get postgresclusters.postgres-operator.crunchydata.com)
time="2021-10-27T20:36:20Z" level=debug msg="reconciled instance" file="internal/controller/postgrescluster/instance.go:1094" func="postgrescluster.(*Reconciler).reconcileInstance" instance=hippo-instance1-pwvz name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0
time="2021-10-27T20:37:20Z" level=debug msg="reconciled instance" file="internal/controller/postgrescluster/instance.go:1094" func="postgrescluster.(*Reconciler).reconcileInstance" instance=hippo-instance1-7t78 name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0
time="2021-10-27T20:37:20Z" level=debug msg="reconciled instance set" file="internal/controller/postgrescluster/instance.go:988" func="postgrescluster.(*Reconciler).scaleUpInstances" instance-set=instance1 name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0
time="2021-10-27T20:37:20Z" level=debug msg=Normal file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/recorder/recorder.go:98" func="recorder.(*Provider).getBroadcaster.func1.1" message="created pgBackRest repository host StatefulSet/hippo-repo-host" object="{PostgresCluster postgres-operator hippo dc803837-62b3-4fb7-bb95-e27c4dab6763 postgres-operator.crunchydata.com/v1beta1 1412169 }" reason=RepoHostCreated version=5.0.3-0
time="2021-10-27T20:39:21Z" level=debug msg="reconciled cluster" file="internal/controller/postgrescluster/controller.go:299" func="postgrescluster.(*Reconciler).Reconcile" name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0
time="2021-10-27T20:39:21Z" level=debug msg="patched cluster status" file="internal/controller/postgrescluster/controller.go:171" func="postgrescluster.(*Reconciler).Reconcile.func2" name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0
time="2021-10-27T20:41:21Z" level=debug msg="reconciled instance" file="internal/controller/postgrescluster/instance.go:1094" func="postgrescluster.(*Reconciler).reconcileInstance" instance=hippo-instance1-pwvz name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0
time="2021-10-27T20:42:21Z" level=debug msg="reconciled instance" file="internal/controller/postgrescluster/instance.go:1094" func="postgrescluster.(*Reconciler).reconcileInstance" instance=hippo-instance1-7t78 name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0
time="2021-10-27T20:42:21Z" level=debug msg="reconciled instance set" file="internal/controller/postgrescluster/instance.go:988" func="postgrescluster.(*Reconciler).scaleUpInstances" instance-set=instance1 name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0

Any help will be welcome!. As I mentioned, I'm just using Oracle's OKE out-of-the-box without any tweak (no network settings changes).

JIC, I'm using a fork form 23 days ago, https://github.com/chapaco/postgres-operator-examples?organization=chapaco&organization=chapaco 

Kind Regards!

Marc Palacín Marfil

unread,
Nov 22, 2023, 1:24:44 PM11/22/23
to Postgres Operator, Javier Roca, jonath...@crunchydata.com, Postgres Operator

Hi all! Sorry about resurrecting this old post, but as of today I'm facing the same issues as Javier exposed in this thread, with exactly the same errors in the Postgres pods:
My Kubernetes stack is as follows:
  • Kubernetes version 1.28.2
  • Cluster infrastructure:
    • 1 control-plane node (named `nearbyone-singlenode-cluster.novalocal`)
    • 2 worker nodes (named `nearbyone-worker-1.novalocal` and `nearbyone-worker-2.novalocal`)
  • Pod network CNI: calico v3.26.1
Here's a summary of the deployed cluster without the High Availability example applied, pretty vanilla apart from the NFS provisioner required for the Persistent Volume Claims:
``
$ kubectl get all -A
NAMESPACE           NAME                                                                 READY   STATUS    RESTARTS   AGE   IP               NODE                                     NOMINATED NODE   READINESS GATES
default             pod/kubernetes-ingress-d5f66d8fc-8f4x7                               1/1     Running   0          24h   10.244.130.1     nearbyone-worker-1.novalocal             <none>           <none>
default             pod/kubernetes-ingress-d5f66d8fc-zv8zn                               1/1     Running   0          24h   10.244.114.65    nearbyone-worker-2.novalocal             <none>           <none>
default             pod/nfs-subdir-external-provisioner-79dfbfd847-9kkxj                 1/1     Running   0          19h   10.244.114.68    nearbyone-worker-2.novalocal             <none>           <none>
default             pod/nfs-subdir-external-provisioner-79dfbfd847-pv96c                 1/1     Running   0          19h   10.244.130.4     nearbyone-worker-1.novalocal             <none>           <none>
kube-system         pod/calico-kube-controllers-7ddc4f45bc-xlfj7                         1/1     Running   0          37h   10.244.172.195   nearbyone-singlenode-cluster.novalocal   <none>           <none>
kube-system         pod/calico-node-8xjsq                                                1/1     Running   0          37h   10.0.30.242      nearbyone-worker-2.novalocal             <none>           <none>
kube-system         pod/calico-node-brkl5                                                1/1     Running   0          37h   10.0.30.234      nearbyone-singlenode-cluster.novalocal   <none>           <none>
kube-system         pod/calico-node-gwjtf                                                1/1     Running   0          37h   10.0.30.229      nearbyone-worker-1.novalocal             <none>           <none>
kube-system         pod/coredns-5dd5756b68-f2pkl                                         1/1     Running   0          37h   10.244.172.193   nearbyone-singlenode-cluster.novalocal   <none>           <none>
kube-system         pod/coredns-5dd5756b68-w7xq2                                         1/1     Running   0          37h   10.244.172.194   nearbyone-singlenode-cluster.novalocal   <none>           <none>
kube-system         pod/etcd-nearbyone-singlenode-cluster.novalocal                      1/1     Running   37         37h   10.0.30.234      nearbyone-singlenode-cluster.novalocal   <none>           <none>
kube-system         pod/kube-apiserver-nearbyone-singlenode-cluster.novalocal            1/1     Running   10         37h   10.0.30.234      nearbyone-singlenode-cluster.novalocal   <none>           <none>
kube-system         pod/kube-controller-manager-nearbyone-singlenode-cluster.novalocal   1/1     Running   9          37h   10.0.30.234      nearbyone-singlenode-cluster.novalocal   <none>           <none>
kube-system         pod/kube-proxy-2r9nx                                                 1/1     Running   0          37h   10.0.30.234      nearbyone-singlenode-cluster.novalocal   <none>           <none>
kube-system         pod/kube-proxy-549jh                                                 1/1     Running   0          37h   10.0.30.229      nearbyone-worker-1.novalocal             <none>           <none>
kube-system         pod/kube-proxy-vfrp7                                                 1/1     Running   0          37h   10.0.30.242      nearbyone-worker-2.novalocal             <none>           <none>
kube-system         pod/kube-scheduler-nearbyone-singlenode-cluster.novalocal            1/1     Running   10         37h   10.0.30.234      nearbyone-singlenode-cluster.novalocal   <none>           <none>
postgres-operator   pod/dnsutils                                                         1/1     Running   0          24m   10.244.130.9     nearbyone-worker-1.novalocal             <none>           <none>
postgres-operator   pod/pgo-6cc745c948-52s8z                                             1/1     Running   0          19h   10.244.130.2     nearbyone-worker-1.novalocal             <none>           <none>

NAMESPACE     NAME                         TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                                    AGE   SELECTOR
default       service/kubernetes           ClusterIP   10.96.0.1        <none>        443/TCP                                                    37h   <none>
default       service/kubernetes-ingress   NodePort    10.108.136.182   <none>        80:31307/TCP,443:31782/TCP,1024:31396/TCP,6060:32104/TCP   24h   app.kubernetes.io/instance=kubernetes-ingress,app.kubernetes.io/name=kubernetes-ingress
kube-system   service/kube-dns             ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP                                     37h   k8s-app=kube-dns

NAMESPACE     NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE   CONTAINERS    IMAGES                               SELECTOR
kube-system   daemonset.apps/calico-node   3         3         3       3            3           kubernetes.io/os=linux   37h   calico-node   docker.io/calico/node:v3.26.1        k8s-app=calico-node
kube-system   daemonset.apps/kube-proxy    3         3         3       3            3           kubernetes.io/os=linux   37h   kube-proxy    registry.k8s.io/kube-proxy:v1.28.3   k8s-app=kube-proxy

NAMESPACE           NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS                        IMAGES                                                                           SELECTOR
default             deployment.apps/kubernetes-ingress                2/2     2            2           24h   kubernetes-ingress-controller     haproxytech/kubernetes-ingress:1.10.9                                            app.kubernetes.io/instance=kubernetes-ingress,app.kubernetes.io/name=kubernetes-ingress
default             deployment.apps/nfs-subdir-external-provisioner   2/2     2            2           19h   nfs-subdir-external-provisioner   registry.k8s.io/sig-storage/nfs-subdir-external-provisioner:v4.0.2               app=nfs-subdir-external-provisioner,release=nfs-subdir-external-provisioner
kube-system         deployment.apps/calico-kube-controllers           1/1     1            1           37h   calico-kube-controllers           docker.io/calico/kube-controllers:v3.26.1                                        k8s-app=calico-kube-controllers
kube-system         deployment.apps/coredns                           2/2     2            2           37h   coredns                           registry.k8s.io/coredns/coredns:v1.10.1                                          k8s-app=kube-dns
postgres-operator   deployment.apps/pgo                               1/1     1            1           19h   operator                          registry.developers.crunchydata.com/crunchydata/postgres-operator:ubi8-5.4.3-0   postgres-operator.crunchydata.com/control-plane=postgres-operator

NAMESPACE           NAME                                                         DESIRED   CURRENT   READY   AGE   CONTAINERS                        IMAGES                                                                           SELECTOR
default             replicaset.apps/kubernetes-ingress-d5f66d8fc                 2         2         2       24h   kubernetes-ingress-controller     haproxytech/kubernetes-ingress:1.10.9                                            app.kubernetes.io/instance=kubernetes-ingress,app.kubernetes.io/name=kubernetes-ingress,pod-template-hash=d5f66d8fc
default             replicaset.apps/nfs-subdir-external-provisioner-79dfbfd847   2         2         2       19h   nfs-subdir-external-provisioner   registry.k8s.io/sig-storage/nfs-subdir-external-provisioner:v4.0.2               app=nfs-subdir-external-provisioner,pod-template-hash=79dfbfd847,release=nfs-subdir-external-provisioner
kube-system         replicaset.apps/calico-kube-controllers-7ddc4f45bc           1         1         1       37h   calico-kube-controllers           docker.io/calico/kube-controllers:v3.26.1                                        k8s-app=calico-kube-controllers,pod-template-hash=7ddc4f45bc
kube-system         replicaset.apps/coredns-5dd5756b68                           2         2         2       37h   coredns                           registry.k8s.io/coredns/coredns:v1.10.1                                          k8s-app=kube-dns,pod-template-hash=5dd5756b68
postgres-operator   replicaset.apps/pgo-6cc745c948                               1         1         1       19h   operator                          registry.developers.crunchydata.com/crunchydata/postgres-operator:ubi8-5.4.3-0   pod-template-hash=6cc745c948,postgres-operator.crunchydata.com/control-plane=postgres-operator
```
You can ignore the `dns-utils` Pod as it is just a dnsutils container for debugging purposes.

I'm using the `kustomize/high-available` example in the `postgres-operator-examples` repository without modifications:
```
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: hippo-ha
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-15.4-1
  postgresVersion: 15
  instances:
    - name: pgha1

      replicas: 2
      dataVolumeClaimSpec:
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 1Gi
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            podAffinityTerm:
              topologyKey: kubernetes.io/hostname
              labelSelector:
                matchLabels:
                  postgres-operator.crunchydata.com/cluster: hippo-ha
                  postgres-operator.crunchydata.com/instance-set: pgha1
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.47-1

      repos:
      - name: repo1
        volume:
          volumeClaimSpec:
            accessModes:
            - "ReadWriteOnce"
            resources:
              requests:
                storage: 1Gi
  proxy:
    pgBouncer:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbouncer:ubi8-1.19-5
      replicas: 2
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            podAffinityTerm:
              topologyKey: kubernetes.io/hostname
              labelSelector:
                matchLabels:
                  postgres-operator.crunchydata.com/cluster: hippo-ha
                  postgres-operator.crunchydata.com/role: pgbouncer
```
Once deployed, after some minutes of initialization, this is the status of both the pods and the services:
```
$ kubectl -n postgres-operator get pods -o wide
NAME                                  READY   STATUS    RESTARTS   AGE     IP              NODE                           NOMINATED NODE   READINESS GATES
dnsutils                              1/1     Running   0          54m     10.244.130.9    nearbyone-worker-1.novalocal   <none>           <none>
hippo-ha-pgbouncer-7b67d94d88-h8tjt   2/2     Running   0          5m6s    10.244.130.10   nearbyone-worker-1.novalocal   <none>           <none>
hippo-ha-pgbouncer-7b67d94d88-mbbdt   2/2     Running   0          5m6s    10.244.114.76   nearbyone-worker-2.novalocal   <none>           <none>
hippo-ha-pgha1-5bgg-0                 4/4     Running   0          7m47s   10.244.114.74   nearbyone-worker-2.novalocal   <none>           <none>
hippo-ha-pgha1-chg8-0                 3/4     Running   0          7m7s    10.244.130.11   nearbyone-worker-1.novalocal   <none>           <none>
hippo-ha-repo-host-0                  2/2     Running   0          7m6s    10.244.114.75   nearbyone-worker-2.novalocal   <none>           <none>
pgo-6cc745c948-52s8z                  1/1     Running   0          20h     10.244.130.2    nearbyone-worker-1.novalocal   <none>           <none>

$ kubectl -n postgres-operator get svc -o wide
NAME                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE     SELECTOR
hippo-ha-ha          ClusterIP   10.108.230.220   <none>        5432/TCP   9m26s   <none>
hippo-ha-ha-config   ClusterIP   None             <none>        <none>     8m46s   <none>
hippo-ha-pgbouncer   ClusterIP   10.108.234.25    <none>        5432/TCP   6m5s    postgres-operator.crunchydata.com/cluster=hippo-ha,postgres-operator.crunchydata.com/role=pgbouncer
hippo-ha-pods        ClusterIP   None             <none>        <none>     9m26s   postgres-operator.crunchydata.com/cluster=hippo-ha
hippo-ha-primary     ClusterIP   None             <none>        5432/TCP   9m26s   <none>
hippo-ha-replicas    ClusterIP   10.110.165.8     <none>        5432/TCP   9m26s   postgres-operator.crunchydata.com/cluster=hippo-ha,postgres-operator.crunchydata.com/role=replica
```

The Non-Ready Pod shows this error:
```
2023-11-16 11:05:34,951 INFO: Lock owner: hippo-ha-pgha1-5bgg-0; I am hippo-ha-pgha1-chg8-0
2023-11-16 11:05:34,952 INFO: bootstrap from leader 'hippo-ha-pgha1-5bgg-0' in progress
pg_basebackup: error: could not translate host name "hippo-ha-pgha1-5bgg-0.hippo-ha-pods" to address: Name or service not known
2023-11-16 11:05:40,099 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2023-11-16 11:05:40,099 ERROR: failed to bootstrap from leader 'hippo-ha-pgha1-5bgg-0'
2023-11-16 11:05:40,099 INFO: Removing data directory: /pgdata/pg15
```

So this means that it is trying to translate the leader hostname by using the `hippo-ha-pods` service. This service is described as follows:

```
$ kubectl -n postgres-operator describe svc hippo-ha-pods
Name:              hippo-ha-pods
Namespace:         postgres-operator
Labels:            postgres-operator.crunchydata.com/cluster=hippo-ha
Annotations:       <none>
Selector:          postgres-operator.crunchydata.com/cluster=hippo-ha
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                None
IPs:               None
Session Affinity:  None
Events:            <none>
```
No IPs are included here, so I must assume this is treated as a headless service?
Any help would be appreciated in this as it is blocking the deployment of other componets as well, thanks!

Best regards,
Marc
El dia dimecres, 27 d’octubre de 2021 a les 23:22:45 UTC+2, Javier Roca va escriure:
Reply all
Reply to author
Forward
0 new messages