unable to connect via pg bouncer

1,680 views
Skip to first unread message

Paul Karayan

unread,
Feb 12, 2021, 5:39:11 PM2/12/21
to Postgres Operator
Hi all -

I'll file this as an issue (bug) if i'm not doing something obviously wrong, but wanted to send here first... also to get my feet wet around here :)

thank you!
- PK

**Describe the bug**
unable to connect to postgres db via pgbouncer

this occurs both if pgbouncer added to an existing db cluster, or if added later.

i am unable to connect with testuser (as in the documentation), a newly provisioned user, or the postgres superuser.

**To Reproduce**
Steps to reproduce the behavior:

A. adding pgbouncer to an existing cluster

1. create database cluster
pgo create cluster profig-pgo

# confirm database is running
pgo test profig-pgo

2. create pgbouncer

pgo create pgbouncer profig-pgo
# confirm pgbouncer service is available
kubectl -n pgo get svc
# and we see profig-pgo-pgbouncer with port 5432 exposed

3. further validate that pgbouncer works by connecting to it
pgo show pgbouncer profig-pgo
PGPASSWORD=<excised> psql -h localhost -p 5432 -U pgbouncer pgbouncer
# get to a psql prompot on pgbouncer

4. retrieve testuser password
kubectl -n pgo port-forward svc/postgres-operator 8443:8443
pgo show user -n pgo testuser --show-system-accounts


5. demonstrate we can connect directly to database
kubectl -n pgo port-forward svc/profig-pgo 5432:5432
PGPASSWORD="<excised>>" psql -h localhost -p 5432 -U testuser profig-pgo
# get psql promppt

6. attempt to connect via pgbouncer

kubectl -n pgo port-forward svc/profig-pgo-pgbouncer 5432:5432
PGPASSWORD="<excised>>" psql -h localhost -p 5432 -U testuser profig-pgo
# error as detailed below


B. pgbouncer provisioned with cluster

1. pgo create cluster profig-pgo --pgbouncer

# confirm database is running
pgo test profig-pgo

# confirm pgbouncer service is available
kubectl -n pgo get svc
# and we see profig-pgo-pgbouncer with port 5432 exposed

THEN: follow steps 3-6 under A above.


Notes:

User details
---

I show both above using the testuser as per the documentation here:

connecting using a newly provisioned user or with Postgres user is done similarly.
let me know if you want steps i took here detailed.

to create the new user:
pgo create user --selector=name=profig-pgo --username "profig"


**Expected behavior**

connect through pgbouncer to postgres database and get a psql prompt


**Actual behavior**

response:
psql: error: ERROR:  query_wait_timeout

checking the pgbouncer pod logs:

2021-02-12 00:32:49.359 UTC [13] WARNING DNS lookup failed: profig: result=0
2021-02-12 00:32:49.359 UTC [13] LOG S-0x56208c5b9970: profig/pgbouncer@(bad-af):0 closing because: server DNS lookup failed (age=0s)
2021-02-12 00:33:04.357 UTC [13] LOG S-0x56208c5b9970: profig/pgbouncer@(bad-af):0 closing because: server DNS lookup failed (age=0s)
2021-02-12 00:33:04.357 UTC [13] LOG S-0x56208c5b9970: profig/pgbouncer@(bad-af):0 closing because: server DNS lookup failed (age=0s)
2021-02-12 00:33:04.357 UTC [13] LOG S-0x56208c5b9970: profig/pgbouncer@(bad-af):0 closing because: server DNS lookup failed (age=0s)
2021-02-12 00:33:18.023 UTC [13] LOG C-0x56208c5b2bf0: profig/(nouser)@127.0.0.1:46354 closing because: query_wait_timeout (age=120s)
2021-02-12 00:33:18.023 UTC [13] WARNING C-0x56208c5b2bf0: profig/(nouser)@127.0.0.1:46354 pooler error: query_wait_timeout
2021-02-12 00:33:18.023 UTC [13] LOG C-0x56208c5b2bf0: profig/(nouser)@127.0.0.1:46354 closing because: query_wait_timeout (age=120s)
2021-02-12 00:33:18.023 UTC [13] WARNING C-0x56208c5b2bf0: profig/(nouser)@127.0.0.1:46354 pooler error: query_wait_timeout


**Screenshots**
n/a

**Please tell us about your environment:**

* Operating System: Amazon Linux 2
* Where is this running ( Local, Cloud Provider): Amazon EKS
* Storage being used (NFS, Hostpath, Gluster, etc): EBS
* PGO Version: 4.6.0
* PostgreSQL Version: 13.1
* Platform (Docker, Kubernetes, OpenShift): Kubernetes
* Platform Version: 1.18

**Additional context**
n/a

Jonathan S. Katz

unread,
Feb 14, 2021, 11:48:32 AM2/14/21
to Paul Karayan, Postgres Operator
Hi Paul,

Do you have any Network Policies or Pod Security Policies in place that might affect connectivity between Pods? The pgBouncer Pod logs appear to be showing DNS issues, which could be a sign that there is a setting that needs to be tweaked on that end.

Thanks,

Jonathan

Jonathan S. Katz

Crunchy Data
Enterprise PostgreSQL 


Jonathan S. Katz

unread,
Feb 16, 2021, 2:17:03 PM2/16/21
to Simon Trigona, Postgres Operator, Paul Karayan
Hi,

You cannot ping a Service directly from a Pod (https://stackoverflow.com/questions/50852542/kubernetes-cannot-ping-another-service) so I don't believe that test will tell you much. Here is a blog post that I skimmed with a few tools you can use to test the network:


Similar to the question I raised with Paul, do you have any Network Policies or Pod Security Policies in place that might affect connectivity between Pods? The pgBouncer Pod logs appear to be showing DNS issues, which could be a sign that there is a setting that needs to be tweaked on that end.

Thanks,

Jonathan

Jonathan S. Katz
VP Platform Engineering

Crunchy Data
Enterprise PostgreSQL 



On Tue, Feb 16, 2021 at 2:01 PM Simon Trigona <simon....@gmail.com> wrote:
I'm also experiencing the same issue and have had to bypass pgbouncer for the time being.

My environment
  • Operating System: Ubuntu 20.04.2
  • Where is this running: local, bare metal
  • Storage: Longhorn
  • PGO Version: 4.6.0
  • Postgres Version: 12.5
  • Platform: Kubernetes (k3s HA cluster)
  • Platform version: 1.20.2+k3s1
Additional Context
I am migrating from a local microk8s cluster to a new local k3s cluster so this was a fresh install of PGO / k3s - no version upgrades. I did have issues with other alpine based images resolving external/public hosts, so I switched to debian based images (nodejs) and the issues were resolved.
I can shell into the pgbouncer pod and ping the service (postgres cluster name is "ws"):
bash-4.4$ ping ws
PING ws.dev.svc.cluster.local (10.43.43.78) 56(84) bytes of data.


Logs
2021-02-16 18:51:07.334 UTC [14] LOG C-0x55b915401980: postgres/(nouser)@10.42.0.0:9687 closing because: pgbouncer cannot connect to server (age=0s)
2021-02-16 18:51:07.334 UTC [14] WARNING C-0x55b915401980: postgres/(nouser)@10.42.0.0:9687 pooler error: pgbouncer cannot connect to server
2021-02-16 18:51:07.334 UTC [14] LOG C-0x55b915401980: postgres/(nouser)@10.42.0.0:9687 closing because: pgbouncer cannot connect to server (age=0s)
2021-02-16 18:51:07.334 UTC [14] WARNING C-0x55b915401980: postgres/(nouser)@10.42.0.0:9687 pooler error: pgbouncer cannot connect to server
2021-02-16 18:51:09.535 UTC [14] WARNING DNS lookup failed: ws: result=0
2021-02-16 18:51:09.535 UTC [14] LOG S-0x55b9154087d0: postgres/pgbouncer@(bad-af):0 closing because: server DNS lookup failed (age=2s)
2021-02-16 18:51:09.535 UTC [14] WARNING DNS lookup failed: ws: result=0
2021-02-16 18:51:09.535 UTC [14] LOG S-0x55b9154087d0: postgres/pgbouncer@(bad-af):0 closing because: server DNS lookup failed (age=2s)

Jonathan S. Katz

unread,
Feb 18, 2021, 4:47:04 PM2/18/21
to Simon Trigona, Postgres Operator, Paul Karayan
Hi Simon,

Thanks for the thorough investigation.

Let's try to figure this out based on the list of changes to pgBouncer in 4.6.

- TLS support. I presume that you are just using the stock pgBouncer configuration, so this is not set up in your environment. That said, the configuration changes as a result of this should not affect the above
- Service Type. This Service Type of pgBouncer can now be independently controlled and set with the --service-type flag on pgo create pgbouncer and pgo update pgbouncer. This can also be modified directly on a custom resource. This could potentially affect the above, though I have some doubts. That said -- what are the service types for the PG cluster and pgBouncer?
- Ensure pgBouncer Secret is created for a standby cluster.

We've been using the same major version of pgBouncer for the past few releases.

Thanks,

Jonathan

Jonathan S. Katz
VP Platform Engineering

Crunchy Data
Enterprise PostgreSQL 



On Tue, Feb 16, 2021 at 7:21 PM Simon Trigona <simon....@gmail.com> wrote:
Hi Jonathan,

Regarding ping - I understand that packet loss to a service is expected, however it was more that the hostname is resolving accurately to the correct IP from within the pod, but pgbouncer itself appears to not be resolving. That being said, I did confirm that I can ping the pod directly.

There is only one Pod Security Policy in this cluster and it appears to be from Longhorn, so I don't think it's relevant.

I started to try a few things to try and diagnose the issue. I swapped out the pgbouncer image of the existing deployment and got the following results:
  • centos7-12.3-4.4.0 - works
  • centos7-12.5-4.4.2 - works
  • centos7-12.5-4.5.1 - works
  • centos8-12.4-4.5.0 - works **
  • centos8-12.4-4.5.1 - works **
  • centos8-12.5-4.6.0 - does not work
** - these images work, however there are logs about connecting/disconnecting to the nameservers. I don't know if this is just more verbose logging or a change in behaviour. Here are the logs:
Wed Feb 17 00:11:11 UTC 2021 INFO: Custom users.txt file detected..
Wed Feb 17 00:11:11 UTC 2021 INFO: Custom pgbouncer.ini file detected..
Wed Feb 17 00:11:11 UTC 2021 INFO: Starting pgBouncer..
2021-02-17 00:11:11.393 UTC [14] LOG kernel file descriptor limit: 1048576 (hard: 1048576); max_client_conn: 100, max expected fd use: 112
2021-02-17 00:11:11.393 UTC [14] LOG kernel file descriptor limit: 1048576 (hard: 1048576); max_client_conn: 100, max expected fd use: 112
2021-02-17 00:11:11.397 UTC [14] LOG listening on 0.0.0.0:5432
2021-02-17 00:11:11.398 UTC [14] LOG listening on [::]:5432
2021-02-17 00:11:11.398 UTC [14] LOG listening on unix:/tmp/.s.PGSQL.5432
2021-02-17 00:11:11.398 UTC [14] LOG process up: PgBouncer 1.14.0, libevent 2.1.8-stable (epoll), adns: evdns2, tls: OpenSSL 1.1.1c FIPS  28 May 2019
2021-02-17 00:11:11.397 UTC [14] LOG listening on 0.0.0.0:5432
2021-02-17 00:11:11.398 UTC [14] LOG listening on [::]:5432
2021-02-17 00:11:11.398 UTC [14] LOG listening on unix:/tmp/.s.PGSQL.5432
2021-02-17 00:11:11.398 UTC [14] LOG process up: PgBouncer 1.14.0, libevent 2.1.8-stable (epoll), adns: evdns2, tls: OpenSSL 1.1.1c FIPS  28 May 2019
2021-02-17 00:11:12.964 UTC [14] LOG C-0x563b6e998260: (nodb)/(nouser)@10.42.0.0:19860 registered new auto-database: db=ws
2021-02-17 00:11:12.964 UTC [14] LOG C-0x563b6e998260: (nodb)/(nouser)@10.42.0.0:19860 registered new auto-database: db=ws
[msg] Nameserver 10.43.0.10:53 has failed: request timed out.
[msg] All nameservers have failed
2021-02-17 00:11:13.044 UTC [14] LOG S-0x563b6e99f1a0: ws/pgbo...@10.43.43.78:5432 new connection to server (from 10.42.1.73:49428)
2021-02-17 00:11:13.044 UTC [14] LOG S-0x563b6e99f1a0: ws/pgbo...@10.43.43.78:5432 new connection to server (from 10.42.1.73:49428)
2021-02-17 00:11:13.058 UTC [14] LOG C-0x563b6e998260: ws/wsq...@10.42.0.0:19860 login attempt: db=ws user=wsquery tls=no
2021-02-17 00:11:13.058 UTC [14] LOG C-0x563b6e998260: ws/wsq...@10.42.0.0:19860 login attempt: db=ws user=wsquery tls=no
2021-02-17 00:11:13.115 UTC [14] LOG S-0x563b6e99f3d0: ws/wsq...@10.43.43.78:5432 new connection to server (from 10.42.1.73:49430)
2021-02-17 00:11:13.115 UTC [14] LOG S-0x563b6e99f3d0: ws/wsq...@10.43.43.78:5432 new connection to server (from 10.42.1.73:49430)
[msg] Nameserver 10.43.0.10:53 is back up
2021-02-17 00:11:28.770 UTC [14] LOG C-0x563b6e998490: ws/wsq...@10.42.0.0:37367 login attempt: db=ws user=wsquery tls=no
2021-02-17 00:11:28.770 UTC [14] LOG C-0x563b6e998490: ws/wsq...@10.42.0.0:37367 login attempt: db=ws user=wsquery tls=no
[msg] Nameserver 10.43.0.10:53 has failed: request timed out.
[msg] All nameservers have failed



Simon

Jonathan S. Katz

unread,
Feb 18, 2021, 5:11:49 PM2/18/21
to Simon Trigona, Postgres Operator, Paul Karayan
Hi Simon & Paul,

A quick follow-up that we may have discovered something on our end. I will follow-up once we have more details, but we may have a root cause.

Thanks,

Jonathan

Jonathan S. Katz
VP Platform Engineering

Crunchy Data
Enterprise PostgreSQL 


Jonathan S. Katz

unread,
Feb 18, 2021, 5:33:41 PM2/18/21
to Simon Trigona, Postgres Operator, Paul Karayan
Hi Simon & Paul,

First, again thank you for your patience and the thorough test. The key was this line:

2021-02-17 00:11:11.398 UTC [14] LOG process up: PgBouncer 1.14.0, libevent 2.1.8-stable (epoll), adns: evdns2, tls: OpenSSL 1.1.1c FIPS  28 May 2019

Here is what happened:

Between 4.5.1 and 4.6.0, we changed the specific pgBouncer package that was being used to install pgBouncer into the pgBouncer container (I'm paid by the number of times I say "pgBouncer" in a sentence). In 4.5.1, the container was both using a newer version of pgBouncer (1.15) and was properly built against evdns, which is used by pgBouncer to make DNS lookups.

However, due to the change of package in 4.6.0, pgBouncer was not built against evdns, which is leading to the reported issues. I am still unsure why I do not see this in my environment, but at least we know this much.

While we evaluate how we want to handle this, there is a resolution: the pgBouncer container itself does not container any Operator specific logic, and as such, you can just retag the 4.5.1 container to be 4.6.0, e.g:


Understandably that is not ideal for a number of reasons, but this will allow you to use pgBouncer with 4.6.0.

Please let me know if you have any questions.

Thanks,

Jonathan

Jonathan S. Katz
VP Platform Engineering

Crunchy Data
Enterprise PostgreSQL 


Simon Trigona

unread,
Feb 19, 2021, 7:42:58 AM2/19/21
to Postgres Operator, jonath...@crunchydata.com, Postgres Operator, Paul Karayan
I'm also experiencing the same issue and have had to bypass pgbouncer for the time being.

My environment
  • Operating System: Ubuntu 20.04.2
  • Where is this running: local, bare metal
  • Storage: Longhorn
  • PGO Version: 4.6.0
  • Postgres Version: 12.5
  • Platform: Kubernetes (k3s HA cluster)
  • Platform version: 1.20.2+k3s1
Additional Context
I am migrating from a local microk8s cluster to a new local k3s cluster so this was a fresh install of PGO / k3s - no version upgrades. I did have issues with other alpine based images resolving external/public hosts, so I switched to debian based images (nodejs) and the issues were resolved.
I can shell into the pgbouncer pod and ping the service (postgres cluster name is "ws"):
bash-4.4$ ping ws
PING ws.dev.svc.cluster.local (10.43.43.78) 56(84) bytes of data.


Logs
2021-02-16 18:51:07.334 UTC [14] LOG C-0x55b915401980: postgres/(nouser)@10.42.0.0:9687 closing because: pgbouncer cannot connect to server (age=0s)
2021-02-16 18:51:07.334 UTC [14] WARNING C-0x55b915401980: postgres/(nouser)@10.42.0.0:9687 pooler error: pgbouncer cannot connect to server
2021-02-16 18:51:07.334 UTC [14] LOG C-0x55b915401980: postgres/(nouser)@10.42.0.0:9687 closing because: pgbouncer cannot connect to server (age=0s)
2021-02-16 18:51:07.334 UTC [14] WARNING C-0x55b915401980: postgres/(nouser)@10.42.0.0:9687 pooler error: pgbouncer cannot connect to server
2021-02-16 18:51:09.535 UTC [14] WARNING DNS lookup failed: ws: result=0
2021-02-16 18:51:09.535 UTC [14] LOG S-0x55b9154087d0: postgres/pgbouncer@(bad-af):0 closing because: server DNS lookup failed (age=2s)
2021-02-16 18:51:09.535 UTC [14] WARNING DNS lookup failed: ws: result=0
2021-02-16 18:51:09.535 UTC [14] LOG S-0x55b9154087d0: postgres/pgbouncer@(bad-af):0 closing because: server DNS lookup failed (age=2s)



On Sunday, February 14, 2021 at 8:48:32 a.m. UTC-8 jonath...@crunchydata.com wrote:

Simon Trigona

unread,
Feb 19, 2021, 7:43:05 AM2/19/21
to Postgres Operator, jonath...@crunchydata.com, Postgres Operator, Paul Karayan, Simon Trigona

Simon Trigona

unread,
Feb 19, 2021, 7:43:10 AM2/19/21
to Postgres Operator, jonath...@crunchydata.com, Postgres Operator, Paul Karayan, Simon Trigona
Hi Jonathan,

Thanks for tracking this down! I'm glad the mystery is solved.

Currently we're just running 4.6.0 in our dev environment so I will update the pgBouncer deployment to use the 4.5.1 container as you suggested and wait for a solution before deploying to our other environments.

Simon

Jonathan S. Katz

unread,
Feb 19, 2021, 2:46:10 PM2/19/21
to Simon Trigona, Postgres Operator, Paul Karayan
Hi Simon,

That sounds great. Again, I will note that the pgBouncer container itself is "dumb" in a sense -- the management logic resides in the Operator. If you had to move forward with 4.6 (and I will note 4.6.1 just went out), you could safely use the pgBouncer 4.5.1 container.

Thanks,

Jonathan

Jonathan S. Katz
VP Platform Engineering

Crunchy Data
Enterprise PostgreSQL 


Jonathan

unread,
Apr 15, 2021, 9:19:41 AM4/15/21
to Postgres Operator, Jonathan, Postgres Operator, Paul Karayan, Simon Trigona
Hi,

I wanted to briefly revisit this as there have been some developments since this was first raised.

This issue came up again in a customer environment and we conducted a full analysis of it.

The tl;dr: prior to PGO release 4.6.2/4.5.2/4.4.3, if your pgBouncer ran in an environment with broken DNS (particularly IPv6), you would encounter the issue that was described.


The remediation that we used in the latest pgBouncer container builds was switching the async DNS library from c-ares to evdns. However, this remediation is for working around broken DNS: the proper fix is to check your DNS environment to ensure it is properly working and fix it ;-)

Thanks,

Jonathan
Reply all
Reply to author
Forward
0 new messages