Error on 2-Replica Cluster - Bootstrap from leader - Could not translate hostname

1,244 views

Skip to first unread message

Javier Roca

unread,

Oct 27, 2021, 12:54:18 PM10/27/21

to Postgres Operator

Hi,

We are evaluating this operator, and for that I'm trying the PGO examples, and I tried to understand how the HA operator works, and I added a configuration with 2 replicas.

apiVersion: postgres-operator.crunchydata.com/v1beta1

kind: PostgresCluster

metadata:

name: hippo

spec:

image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-13.4-1

postgresVersion: 13

instances:

- name: instance1

replicas: 2

dataVolumeClaimSpec:

accessModes:

- "ReadWriteOnce"

resources:

requests:

storage: 1Gi

backups:

pgbackrest:

image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.35-0

repos:

- name: repo1

volume:

volumeClaimSpec:

accessModes:

- "ReadWriteOnce"

resources:

requests:

storage: 1Gi

.. and followed the indications to verify self-healing as shown in Tutorial->High Availability.

By running:

kubectl -n postgres-operator get pods \ --selector=postgres-operator.crunchydata.com/cluster=hippo,postgres-operator.crunchydata.com/instance-set

I get this:

NAME READY STATUS RESTARTS AGE
hippo-instance1-ksbq-0 2/3 Running 0 97m
hippo-instance1-tdxx-0 3/3 Running 0 96m

JIC, the PRIMARY POD is hippo-instance1-tdxx

My assumption is (maybe I'm completely wrong), that with replicas=2 we have an HA setup with 1 Primary / 1 Follower

Thus:
hippo-instance1-tdxx - Primary
hippo-instance1-ksbq - Slave

Before continuing with the 2 tests mentioned in the tutorial I tried to understand the secondary pod status.. In order to understand this, I tried opening a terminal in each pod, and in primary pod I can enter and launch PSQL:

bash-4.4$ psql
psql (13.4) Type "help" for help.
postgres=# SELECT NOT pg_catalog.pg_is_in_recovery() is_primary; is_primary ------------ t (1 row)

And once I do the same in the secondary pod, I get this:

bash-4.4$ psql
psql: error: could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/tmp/postgres/.s.PGSQL.5432"?

Thus seems -perhaps- we have some issue in the secondary pod... by analyzing pod's logs I found something strange in secondary pod:

Primary POD hippo-instance1-tdxx:

... 2021-10-27 15:36:10,787 INFO: no action. I am (hippo-instance1-tdxx-0) the leader with the lock
2021-10-27 15:36:21,017 INFO: no action. I am (hippo-instance1-tdxx-0) the leader with the lock
2021-10-27 15:36:30,785 INFO: no action. I am (hippo-instance1-tdxx-0) the leader with the lock
2021-10-27 15:36:40,787 INFO: no action. I am (hippo-instance1-tdxx-0) the leader with the lock ...

Secondary POD hippo-instance1-ksbq:

...
2021-10-27 15:38:20,836 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:38:20,836 WARNING: Trying again in 5 seconds 2021-10-27 15:38:30,780 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:30,780 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress
2021-10-27 15:38:40,813 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:40,813 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known
2021-10-27 15:38:45,857 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:38:45,857 ERROR: failed to bootstrap from leader 'hippo-instance1-tdxx-0'
2021-10-27 15:38:45,857 INFO: Removing data directory: /pgdata/pg13
2021-10-27 15:38:50,781 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:51,024 INFO: trying to bootstrap from leader 'hippo-instance1-tdxx-0'
2021-10-27 15:39:00,779 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:39:00,852 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress
2021-10-27 15:39:10,810 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:39:10,811 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known
2021-10-27 15:39:11,054 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:39:11,054 WARNING: Trying again in 5 seconds ...

Thus seems error is related to pg_basebackup (seems is a tool to take base backups of a running database cluster), as it cannot reach the primary host?!:

pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known

In my case I'm using OKE on Oracle's cloud OCI, which is CNCF-certified. I also tried this with Microk8s and have a similar behavior (slave pod with same error log),

Thus, I'd like to ask:

Are my assumptions ok (with replicas=2 I have 1 Primary/Master and 1 Secondary/Follower , and both should be available?)
In ideal conditions, should the secondary POD hippo-instance1-ksbq-0 report 3/3 in Ready status?
In ideal conditions, should both Primary and Secondary PODs respond to the a PSQL query?
Based on the logs, I assume there's something with the networking.. Is there any additional steps I might be missing, or any advice to troubleshoot this issue? FYI, I just followed the steps as described in tutorial.

JIC, I initially post this same question in the PGO Examples Repo, and I got this response:

hippo-instance1-tdxx-0.hippo-pods is suspicious -- I'm unsure where .hippo-pods is coming from given it appears you've deployed to the postgres-operator namespace. I would potentially check your networking layer. I'll run some local tests against this. It appears that there is something going on with the DNS resolution in your environment, I would investigate that.

That said, this repo is specifically for the examples around PGO. Please see the Support page for where you can ask support questions or report bugs.

Separately, the terminology is "secondary", "replica" or "follower".

Thanks in advance for your help.

Best

Javier

Jonathan S. Katz

unread,

Oct 27, 2021, 1:30:04 PM10/27/21

to Javier Roca, Postgres Operator

Hi,

Comments inline:

That assumption is correct.

Thus:
hippo-instance1-tdxx - Primary
hippo-instance1-ksbq - ~~Slave~~ Replica

You can check this explicitly with:

kubectl -n postgres-operator get pods \

--selector=postgres-operator.crunchydata.com/cluster=hippo,postgres-operator.crunchydata.com/instance-set \

-L postgres-operator.crunchydata.com/role

Secondary POD hippo-instance1-ksbq:
...
2021-10-27 15:38:20,836 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:38:20,836 WARNING: Trying again in 5 seconds 2021-10-27 15:38:30,780 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:30,780 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress
2021-10-27 15:38:40,813 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:40,813 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known
2021-10-27 15:38:45,857 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:38:45,857 ERROR: failed to bootstrap from leader 'hippo-instance1-tdxx-0'
2021-10-27 15:38:45,857 INFO: Removing data directory: /pgdata/pg13
2021-10-27 15:38:50,781 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:38:51,024 INFO: trying to bootstrap from leader 'hippo-instance1-tdxx-0'
2021-10-27 15:39:00,779 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:39:00,852 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress
2021-10-27 15:39:10,810 INFO: Lock owner: hippo-instance1-tdxx-0; I am hippo-instance1-ksbq-0
2021-10-27 15:39:10,811 INFO: bootstrap from leader 'hippo-instance1-tdxx-0' in progress pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known
2021-10-27 15:39:11,054 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2021-10-27 15:39:11,054 WARNING: Trying again in 5 seconds ...
Thus seems error is related to pg_basebackup (seems is a tool to take base backups of a running database cluster), as it cannot reach the primary host?!:
pg_basebackup: error: could not translate host name "hippo-instance1-tdxx-0.hippo-pods" to address: Name or service not known

I would recommend checking a few things:

kubectl -n postgres-operator get svc

Is there a "hippo-pods" Service available?

Additionally, are there any errors in the "postgres-operator" Pod logs?

Thus, I'd like to ask:
Are my assumptions ok (with replicas=2 I have 1 Primary/Master and 1 Secondary/Follower , and both should be available?)

Yes

In ideal conditions, should the secondary POD hippo-instance1-ksbq-0 report 3/3 in Ready status?

Yes

In ideal conditions, should both Primary and Secondary PODs respond to the a PSQL query?

Yes

Based on the logs, I assume there's something with the networking.. Is there any additional steps I might be missing, or any advice to troubleshoot this issue? FYI, I just followed the steps as described in tutorial.

I would also check the Postgres Operator logs as well to see if there were any reconciliation issues.

Thanks,

Jonathan

Message has been deleted

Javier Roca

unread,

Oct 27, 2021, 5:22:45 PM10/27/21

to Postgres Operator, jonath...@crunchydata.com, Postgres Operator, Javier Roca

Dear Jonathan,

Hi. Thanks for your previous answers! I redid a new clean installation on the cluster, and I still have the same issue.

kubectl -n postgres-operator get pods \

> --selector=postgres-operator.crunchydata.com/cluster=hippo,postgres-operator.crunchydata.com/instance-set \

> -L postgres-operator.crunchydata.com/role

NAME READY STATUS RESTARTS AGE ROLE

hippo-instance1-7t78-0 3/3 Running 0 32m master

hippo-instance1-pwvz-0 2/3 Running 0 33m

The follower is still 2/3, logs shows the same issue:

kubectl logs pod/hippo-instance1-pwvz-0 --namespace=postgres-operator --container=database --since=0

2021-10-27 20:40:50,153 INFO: No PostgreSQL configuration items changed, nothing to reload.

2021-10-27 20:40:50,160 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0

2021-10-27 20:40:50,242 INFO: trying to bootstrap from leader 'hippo-instance1-7t78-0'

2021-10-27 20:40:51,759 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0

2021-10-27 20:40:51,783 INFO: bootstrap from leader 'hippo-instance1-7t78-0' in progress

2021-10-27 20:41:01,791 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0

2021-10-27 20:41:01,791 INFO: bootstrap from leader 'hippo-instance1-7t78-0' in progress

pg_basebackup: error: could not translate host name "hippo-instance1-7t78-0.hippo-pods" to address: Name or service not known

2021-10-27 20:41:10,272 ERROR: Error when fetching backup: pg_basebackup exited with code=1

2021-10-27 20:41:10,273 WARNING: Trying again in 5 seconds

2021-10-27 20:41:11,770 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0

2021-10-27 20:41:11,770 INFO: bootstrap from leader 'hippo-instance1-7t78-0' in progress

2021-10-27 20:41:21,758 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0

2021-10-27 20:41:21,758 INFO: bootstrap from leader 'hippo-instance1-7t78-0' in progress

2021-10-27 20:41:31,794 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0

2021-10-27 20:41:31,794 INFO: bootstrap from leader 'hippo-instance1-7t78-0' in progress

pg_basebackup: error: could not translate host name "hippo-instance1-7t78-0.hippo-pods" to address: Name or service not known

2021-10-27 20:41:35,300 ERROR: Error when fetching backup: pg_basebackup exited with code=1

2021-10-27 20:41:35,300 ERROR: failed to bootstrap from leader 'hippo-instance1-7t78-0'

2021-10-27 20:41:35,300 INFO: Removing data directory: /pgdata/pg13

2021-10-27 20:41:41,771 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0

2021-10-27 20:41:41,937 INFO: trying to bootstrap from leader 'hippo-instance1-7t78-0'

2021-10-27 20:41:51,772 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0

2021-10-27 20:41:51,794 INFO: bootstrap from leader 'hippo-instance1-7t78-0' in progress

2021-10-27 20:42:01,794 INFO: Lock owner: hippo-instance1-7t78-0; I am hippo-instance1-pwvz-0

2021-10-27 20:42:01,794 INFO: bootstrap from leader 'hippo-instance1-7t78-0' in progress

pg_basebackup: error: could not translate host name "hippo-instance1-7t78-0.hippo-pods" to address: Name or service not known

2021-10-27 20:42:01,965 ERROR: Error when fetching backup: pg_basebackup exited with code=1

2021-10-27 20:42:01,966 WARNING: Trying again in 5 seconds

....

As you recommended, I did

kubectl -n postgres-operator get svc

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

hippo-ha ClusterIP 10.96.62.185 <none> 5432/TCP 39m

hippo-ha-config ClusterIP None <none> <none> 38m

hippo-pods ClusterIP None <none> <none> 39m

hippo-primary ClusterIP None <none> 5432/TCP 39m

hippo-replicas ClusterIP 10.96.192.179 <none> 5432/TCP 39m

About the operator pod logs, I saw an error (in red):

kubectl logs pod/pgo-b95d7bbd-dcn9z --namespace=postgres-operator --container=operator --since=0

time="2021-10-27T20:28:42Z" level=debug msg="debug flag set to true" file="cmd/postgres-operator/main.go:62" func=main.main version=5.0.3-0

time="2021-10-27T20:28:44Z" level=info msg="metrics server is starting to listen" addr=":8080" file="sigs.k8s.io/controlle...@v0.8.3/pkg/log/deleg.go:130" func="log.(*DelegatingLogger).Info" version=5.0.3-0

time="2021-10-27T20:28:44Z" level=info msg="starting controller runtime manager and will wait for signal to exit" file="cmd/postgres-operator/main.go:83" func=main.main version=5.0.3-0

time="2021-10-27T20:28:44Z" level=info msg="starting metrics server" file="sigs.k8s.io/controlle...@v0.8.3/pkg/manager/internal.go:385" func="manager.(*controllerManager).serveMetrics.func2" path=/metrics version=5.0.3-0

time="2021-10-27T20:28:44Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0

time="2021-10-27T20:28:45Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0

time="2021-10-27T20:28:46Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.3-0

time="2021-10-27T20:28:46Z" level=info msg="Starting Controller" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:173" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0

time="2021-10-27T20:28:46Z" level=info msg="Starting workers" file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/controller/controller.go:211" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0 worker count=2

time="2021-10-27T20:28:46Z" level=debug msg=deleting file="internal/controller/postgrescluster/controller.go:139" func="postgrescluster.(*Reconciler).Reconcile" name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster result="{Requeue:false RequeueAfter:0s}" version=5.0.3-0

E1027 20:28:50.367436 1 reflector.go:138] k8s.io/clie...@v0.20.8/tools/cache/reflector.go:167: Failed to watch *v1beta1.PostgresCluster: the server could not find the requested resource (get postgresclusters.postgres-operator.crunchydata.com)

E1027 20:28:51.742217 1 reflector.go:138] k8s.io/clie...@v0.20.8/tools/cache/reflector.go:167: Failed to watch *v1beta1.PostgresCluster: failed to list *v1beta1.PostgresCluster: the server could not find the requested resource (get postgresclusters.postgres-operator.crunchydata.com)

E1027 20:28:54.192785 1 reflector.go:138] k8s.io/clie...@v0.20.8/tools/cache/reflector.go:167: Failed to watch *v1beta1.PostgresCluster: failed to list *v1beta1.PostgresCluster: the server could not find the requested resource (get postgresclusters.postgres-operator.crunchydata.com)

E1027 20:28:57.496771 1 reflector.go:138] k8s.io/clie...@v0.20.8/tools/cache/reflector.go:167: Failed to watch *v1beta1.PostgresCluster: failed to list *v1beta1.PostgresCluster: the server could not find the requested resource (get postgresclusters.postgres-operator.crunchydata.com)

time="2021-10-27T20:36:20Z" level=debug msg="reconciled instance" file="internal/controller/postgrescluster/instance.go:1094" func="postgrescluster.(*Reconciler).reconcileInstance" instance=hippo-instance1-pwvz name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0

time="2021-10-27T20:37:20Z" level=debug msg="reconciled instance" file="internal/controller/postgrescluster/instance.go:1094" func="postgrescluster.(*Reconciler).reconcileInstance" instance=hippo-instance1-7t78 name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0

time="2021-10-27T20:37:20Z" level=debug msg="reconciled instance set" file="internal/controller/postgrescluster/instance.go:988" func="postgrescluster.(*Reconciler).scaleUpInstances" instance-set=instance1 name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0

time="2021-10-27T20:37:20Z" level=debug msg=Normal file="sigs.k8s.io/controlle...@v0.8.3/pkg/internal/recorder/recorder.go:98" func="recorder.(*Provider).getBroadcaster.func1.1" message="created pgBackRest repository host StatefulSet/hippo-repo-host" object="{PostgresCluster postgres-operator hippo dc803837-62b3-4fb7-bb95-e27c4dab6763 postgres-operator.crunchydata.com/v1beta1 1412169 }" reason=RepoHostCreated version=5.0.3-0

time="2021-10-27T20:39:21Z" level=debug msg="reconciled cluster" file="internal/controller/postgrescluster/controller.go:299" func="postgrescluster.(*Reconciler).Reconcile" name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0

time="2021-10-27T20:39:21Z" level=debug msg="patched cluster status" file="internal/controller/postgrescluster/controller.go:171" func="postgrescluster.(*Reconciler).Reconcile.func2" name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0

time="2021-10-27T20:41:21Z" level=debug msg="reconciled instance" file="internal/controller/postgrescluster/instance.go:1094" func="postgrescluster.(*Reconciler).reconcileInstance" instance=hippo-instance1-pwvz name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0

time="2021-10-27T20:42:21Z" level=debug msg="reconciled instance" file="internal/controller/postgrescluster/instance.go:1094" func="postgrescluster.(*Reconciler).reconcileInstance" instance=hippo-instance1-7t78 name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0

time="2021-10-27T20:42:21Z" level=debug msg="reconciled instance set" file="internal/controller/postgrescluster/instance.go:988" func="postgrescluster.(*Reconciler).scaleUpInstances" instance-set=instance1 name=hippo namespace=postgres-operator reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0

Any help will be welcome!. As I mentioned, I'm just using Oracle's OKE out-of-the-box without any tweak (no network settings changes).

JIC, I'm using a fork form 23 days ago, https://github.com/chapaco/postgres-operator-examples?organization=chapaco&organization=chapaco

Kind Regards!

Marc Palacín Marfil

unread,

Nov 22, 2023, 1:24:44 PM11/22/23

to Postgres Operator, Javier Roca, jonath...@crunchydata.com, Postgres Operator

Hi all! Sorry about resurrecting this old post, but as of today I'm facing the same issues as Javier exposed in this thread, with exactly the same errors in the Postgres pods:

My Kubernetes stack is as follows:

Kubernetes version 1.28.2
Cluster infrastructure:
- 1 control-plane node (named `nearbyone-singlenode-cluster.novalocal`)
- 2 worker nodes (named `nearbyone-worker-1.novalocal` and `nearbyone-worker-2.novalocal`)
Pod network CNI: calico v3.26.1

Here's a summary of the deployed cluster without the High Availability example applied, pretty vanilla apart from the NFS provisioner required for the Persistent Volume Claims:

$ kubectl get all -A

NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default pod/kubernetes-ingress-d5f66d8fc-8f4x7 1/1 Running 0 24h 10.244.130.1 nearbyone-worker-1.novalocal <none> <none>
default pod/kubernetes-ingress-d5f66d8fc-zv8zn 1/1 Running 0 24h 10.244.114.65 nearbyone-worker-2.novalocal <none> <none>
default pod/nfs-subdir-external-provisioner-79dfbfd847-9kkxj 1/1 Running 0 19h 10.244.114.68 nearbyone-worker-2.novalocal <none> <none>
default pod/nfs-subdir-external-provisioner-79dfbfd847-pv96c 1/1 Running 0 19h 10.244.130.4 nearbyone-worker-1.novalocal <none> <none>
kube-system pod/calico-kube-controllers-7ddc4f45bc-xlfj7 1/1 Running 0 37h 10.244.172.195 nearbyone-singlenode-cluster.novalocal <none> <none>
kube-system pod/calico-node-8xjsq 1/1 Running 0 37h 10.0.30.242 nearbyone-worker-2.novalocal <none> <none>
kube-system pod/calico-node-brkl5 1/1 Running 0 37h 10.0.30.234 nearbyone-singlenode-cluster.novalocal <none> <none>
kube-system pod/calico-node-gwjtf 1/1 Running 0 37h 10.0.30.229 nearbyone-worker-1.novalocal <none> <none>
kube-system pod/coredns-5dd5756b68-f2pkl 1/1 Running 0 37h 10.244.172.193 nearbyone-singlenode-cluster.novalocal <none> <none>
kube-system pod/coredns-5dd5756b68-w7xq2 1/1 Running 0 37h 10.244.172.194 nearbyone-singlenode-cluster.novalocal <none> <none>
kube-system pod/etcd-nearbyone-singlenode-cluster.novalocal 1/1 Running 37 37h 10.0.30.234 nearbyone-singlenode-cluster.novalocal <none> <none>
kube-system pod/kube-apiserver-nearbyone-singlenode-cluster.novalocal 1/1 Running 10 37h 10.0.30.234 nearbyone-singlenode-cluster.novalocal <none> <none>
kube-system pod/kube-controller-manager-nearbyone-singlenode-cluster.novalocal 1/1 Running 9 37h 10.0.30.234 nearbyone-singlenode-cluster.novalocal <none> <none>
kube-system pod/kube-proxy-2r9nx 1/1 Running 0 37h 10.0.30.234 nearbyone-singlenode-cluster.novalocal <none> <none>
kube-system pod/kube-proxy-549jh 1/1 Running 0 37h 10.0.30.229 nearbyone-worker-1.novalocal <none> <none>
kube-system pod/kube-proxy-vfrp7 1/1 Running 0 37h 10.0.30.242 nearbyone-worker-2.novalocal <none> <none>
kube-system pod/kube-scheduler-nearbyone-singlenode-cluster.novalocal 1/1 Running 10 37h 10.0.30.234 nearbyone-singlenode-cluster.novalocal <none> <none>
postgres-operator pod/dnsutils 1/1 Running 0 24m 10.244.130.9 nearbyone-worker-1.novalocal <none> <none>
postgres-operator pod/pgo-6cc745c948-52s8z 1/1 Running 0 19h 10.244.130.2 nearbyone-worker-1.novalocal <none> <none>

NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 37h <none>
default service/kubernetes-ingress NodePort 10.108.136.182 <none> 80:31307/TCP,443:31782/TCP,1024:31396/TCP,6060:32104/TCP 24h app.kubernetes.io/instance=kubernetes-ingress,app.kubernetes.io/name=kubernetes-ingress
kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 37h k8s-app=kube-dns

NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR
kube-system daemonset.apps/calico-node 3 3 3 3 3 kubernetes.io/os=linux 37h calico-node docker.io/calico/node:v3.26.1 k8s-app=calico-node
kube-system daemonset.apps/kube-proxy 3 3 3 3 3 kubernetes.io/os=linux 37h kube-proxy registry.k8s.io/kube-proxy:v1.28.3 k8s-app=kube-proxy

NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
default deployment.apps/kubernetes-ingress 2/2 2 2 24h kubernetes-ingress-controller haproxytech/kubernetes-ingress:1.10.9 app.kubernetes.io/instance=kubernetes-ingress,app.kubernetes.io/name=kubernetes-ingress
default deployment.apps/nfs-subdir-external-provisioner 2/2 2 2 19h nfs-subdir-external-provisioner registry.k8s.io/sig-storage/nfs-subdir-external-provisioner:v4.0.2 app=nfs-subdir-external-provisioner,release=nfs-subdir-external-provisioner
kube-system deployment.apps/calico-kube-controllers 1/1 1 1 37h calico-kube-controllers docker.io/calico/kube-controllers:v3.26.1 k8s-app=calico-kube-controllers
kube-system deployment.apps/coredns 2/2 2 2 37h coredns registry.k8s.io/coredns/coredns:v1.10.1 k8s-app=kube-dns
postgres-operator deployment.apps/pgo 1/1 1 1 19h operator registry.developers.crunchydata.com/crunchydata/postgres-operator:ubi8-5.4.3-0 postgres-operator.crunchydata.com/control-plane=postgres-operator

NAMESPACE NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR
default replicaset.apps/kubernetes-ingress-d5f66d8fc 2 2 2 24h kubernetes-ingress-controller haproxytech/kubernetes-ingress:1.10.9 app.kubernetes.io/instance=kubernetes-ingress,app.kubernetes.io/name=kubernetes-ingress,pod-template-hash=d5f66d8fc
default replicaset.apps/nfs-subdir-external-provisioner-79dfbfd847 2 2 2 19h nfs-subdir-external-provisioner registry.k8s.io/sig-storage/nfs-subdir-external-provisioner:v4.0.2 app=nfs-subdir-external-provisioner,pod-template-hash=79dfbfd847,release=nfs-subdir-external-provisioner
kube-system replicaset.apps/calico-kube-controllers-7ddc4f45bc 1 1 1 37h calico-kube-controllers docker.io/calico/kube-controllers:v3.26.1 k8s-app=calico-kube-controllers,pod-template-hash=7ddc4f45bc
kube-system replicaset.apps/coredns-5dd5756b68 2 2 2 37h coredns registry.k8s.io/coredns/coredns:v1.10.1 k8s-app=kube-dns,pod-template-hash=5dd5756b68
postgres-operator replicaset.apps/pgo-6cc745c948 1 1 1 19h operator registry.developers.crunchydata.com/crunchydata/postgres-operator:ubi8-5.4.3-0 pod-template-hash=6cc745c948,postgres-operator.crunchydata.com/control-plane=postgres-operator

```

You can ignore the `dns-utils` Pod as it is just a dnsutils container for debugging purposes.

I'm using the `kustomize/high-available` example in the `postgres-operator-examples` repository without modifications:

```

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:

name: hippo-ha
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-15.4-1
postgresVersion: 15
instances:
- name: pgha1

replicas: 2
dataVolumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 1Gi

affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
postgres-operator.crunchydata.com/cluster: hippo-ha
postgres-operator.crunchydata.com/instance-set: pgha1
backups:
pgbackrest:
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.47-1

repos:
- name: repo1
volume:
volumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 1Gi

proxy:
pgBouncer:

image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbouncer:ubi8-1.19-5
replicas: 2
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
postgres-operator.crunchydata.com/cluster: hippo-ha
postgres-operator.crunchydata.com/role: pgbouncer

```

Once deployed, after some minutes of initialization, this is the status of both the pods and the services:

```

$ kubectl -n postgres-operator get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
dnsutils 1/1 Running 0 54m 10.244.130.9 nearbyone-worker-1.novalocal <none> <none>
hippo-ha-pgbouncer-7b67d94d88-h8tjt 2/2 Running 0 5m6s 10.244.130.10 nearbyone-worker-1.novalocal <none> <none>
hippo-ha-pgbouncer-7b67d94d88-mbbdt 2/2 Running 0 5m6s 10.244.114.76 nearbyone-worker-2.novalocal <none> <none>
hippo-ha-pgha1-5bgg-0 4/4 Running 0 7m47s 10.244.114.74 nearbyone-worker-2.novalocal <none> <none>
hippo-ha-pgha1-chg8-0 3/4 Running 0 7m7s 10.244.130.11 nearbyone-worker-1.novalocal <none> <none>
hippo-ha-repo-host-0 2/2 Running 0 7m6s 10.244.114.75 nearbyone-worker-2.novalocal <none> <none>
pgo-6cc745c948-52s8z 1/1 Running 0 20h 10.244.130.2 nearbyone-worker-1.novalocal <none> <none>

$ kubectl -n postgres-operator get svc -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
hippo-ha-ha ClusterIP 10.108.230.220 <none> 5432/TCP 9m26s <none>
hippo-ha-ha-config ClusterIP None <none> <none> 8m46s <none>
hippo-ha-pgbouncer ClusterIP 10.108.234.25 <none> 5432/TCP 6m5s postgres-operator.crunchydata.com/cluster=hippo-ha,postgres-operator.crunchydata.com/role=pgbouncer
hippo-ha-pods ClusterIP None <none> <none> 9m26s postgres-operator.crunchydata.com/cluster=hippo-ha
hippo-ha-primary ClusterIP None <none> 5432/TCP 9m26s <none>
hippo-ha-replicas ClusterIP 10.110.165.8 <none> 5432/TCP 9m26s postgres-operator.crunchydata.com/cluster=hippo-ha,postgres-operator.crunchydata.com/role=replica

```

The Non-Ready Pod shows this error:

```

2023-11-16 11:05:34,951 INFO: Lock owner: hippo-ha-pgha1-5bgg-0; I am hippo-ha-pgha1-chg8-0
2023-11-16 11:05:34,952 INFO: bootstrap from leader 'hippo-ha-pgha1-5bgg-0' in progress
pg_basebackup: error: could not translate host name "hippo-ha-pgha1-5bgg-0.hippo-ha-pods" to address: Name or service not known
2023-11-16 11:05:40,099 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2023-11-16 11:05:40,099 ERROR: failed to bootstrap from leader 'hippo-ha-pgha1-5bgg-0'

2023-11-16 11:05:40,099 INFO: Removing data directory: /pgdata/pg15

```

So this means that it is trying to translate the leader hostname by using the `hippo-ha-pods` service. This service is described as follows:

```

$ kubectl -n postgres-operator describe svc hippo-ha-pods
Name: hippo-ha-pods
Namespace: postgres-operator
Labels: postgres-operator.crunchydata.com/cluster=hippo-ha
Annotations: <none>
Selector: postgres-operator.crunchydata.com/cluster=hippo-ha
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: None
IPs: None
Session Affinity: None
Events: <none>

```

No IPs are included here, so I must assume this is treated as a headless service?

Any help would be appreciated in this as it is blocking the deployment of other componets as well, thanks!

Best regards,

Marc

El dia dimecres, 27 d’octubre de 2021 a les 23:22:45 UTC+2, Javier Roca va escriure:

Reply all

Reply to author

Forward

0 new messages