Postgres Connections Randomly Dropping

3,346 views
Skip to first unread message

Nigel Gutzmann

unread,
Apr 25, 2018, 12:59:51 PM4/25/18
to Google Cloud SQL discuss
I have a django and celery application running inside of Google Kubernetes Engine. I am connecting to my CloudSQL instance (postgres) using a Kubernetes service running the CloudSQL Proxy. Database connections and queries generally work fine, but occasionally we get spurts of errors with connections breaking. They are raised in python like this:

OperationalError: could not connect to server: Connection refused Is the server running on host "cloudsql-proxy-service" and accepting TCP/IP connections on port 3306?

or

OperationalError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

I can't find anything that might cause that in the logs of the CloudSQL instance. There are some messages like this in the CloudSQL proxy logs:

2018/04/24 18:55:18 Instance <project_name>:us-central1:<instance_name> closed connection

But I can't necessarily correlate the timestamps between when those messages appear and when we get the python errors. I have tried setting CONN_MAX_AGE and tcp keepalives like this inside django's settings.py:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'NAME': '<db_name>',
        'USER': os.environ.get('DB_USER', None),
        'HOST': os.environ.get('DB_HOST', None),
        'PORT': os.environ.get('DB_PORT', None),
        'PASSWORD': os.environ.get('DB_PASSWORD', None),
        'CONN_MAX_AGE': int(os.environ.get('CONN_MAX_AGE', 0)),
        'OPTIONS': {
            'keepalives': 1,
            'keepalives_idle': 480,
            'keepalives_interval': 10,
            'keepalives_count': 3,
        },
    },
}

But that didn't seem to make a difference. We still get the same errors in bunches, about 20 errors over the span of 2-3 minutes, 2-3 times per day.

Dinesh (Google Platform Support)

unread,
Apr 25, 2018, 5:17:18 PM4/25/18
to Google Cloud SQL discuss
As you suggesting you receive such errors only for 2-3 minutes in a day, I suspect your instance might be going through maintenance updates (that require an instance restart) during that times.  Please view operational logs of your Postgre SQL instance. You can view them from cloud console GUI inside instance details view, under operations tab. If you find instance was updated at the same time, that explain the cause of these logs. 

If that is the real cause of the mentioned errors, I will recommend configuring the schedule for the Maintenance window and Maintenance timing to avoid any surprises in the future. 

Let me know if this helps?

Regards,

Nigel Gutzmann

unread,
Apr 25, 2018, 5:38:43 PM4/25/18
to Google Cloud SQL discuss
Thanks for the suggestion Dinesh, but I don't think that's it. We are getting the errors much more frequently than there are events in the operations tab. Typically we get 2-3 bursts of errors per day, but there is generally just one operation: a backup. The timing also doesn't necessarily coincide between the events and the errors. Regardless, if the operations were a problem, would turning on HA fix the issue, or would that not help?

Any other ideas about why we could be seeing these connection problems?

Nigel G.

Dinesh (Google Platform Support)

unread,
Apr 26, 2018, 4:49:39 PM4/26/18
to Google Cloud SQL discuss
Hi Nigel,

Thanks for the feedback. For Postgres SQL instance, fail-over instance is located in the different zone with-in the same configured region, so if the master cannot serve data from its primary zone, it fails over and continues to serve data from its secondary zone. As per the cloud documentation, if cloud SQL instance configured for HA experiences an outage or becomes unresponsive, Cloud SQL automatically switches to serving data from the secondary zone. It is recommended to configure all of your instances that contain production data for HA.

Regarding the connection problem, please check and verify the following points:-

1. Default TCP/IP port on which the postgres server listen is tcp:5432. Any specific reason why you using port 3306 which is default port for MySQL DB? Please correct this if it is a mistake.

2. Please note that there is Maximum concurrent connections limit for Cloud SQL based upon instance memory. Please check if your instance is not going beyond the connection limit during peak hours.

3. Please refer to this StackOverflow thread for the similar problem and suggested solutions for the troubleshooting. 

If above does not help, please share your project number, PostgreSQL instance name, Instance operations logs, and PostgreSQL error logs along with issue date, time and duration via a private email.

Regards,

Nigel Gutzmann

unread,
May 8, 2018, 4:31:31 PM5/8/18
to Google Cloud SQL discuss
I sent a private reply, thanks Dinesh

Nigel Gutzmann

unread,
May 11, 2018, 1:59:17 PM5/11/18
to Google Cloud SQL discuss
Hi Dinesh, did you get my private reply? Just want to make sure that I sent it correctly, not sure if I'm using this group exactly correctly.

Nigel G

Dinesh (Google Platform Support)

unread,
May 11, 2018, 6:40:00 PM5/11/18
to Google Cloud SQL discuss
Hi Nigel, 

Yes I Received your private message and reverted you on that thread. 

Regards,

Nigel Gutzmann

unread,
May 22, 2018, 2:50:57 PM5/22/18
to Google Cloud SQL discuss
For anyone who is maybe experiencing a similar situation, here's what the problem was!

Dinesh pointed out that the OperationalError and connection issues that we were experiencing coincided with kubernetes node pool autoscaling events. I realized that when the GKE cluster autoscaled, pods are sometimes rescheduled on different nodes and some of the cloudsql proxy pods were terminated (specifically, we were running a DaemonSet and when the node pool downsized, cloudsql proxy pods would terminate). However, we have very long running tasks that don't complete before the termination grace period of the cloudsql proxy pods ends. So the pod was being removed and the connection to the database between corresponding long-running tasks and the cloudsql proxy pods was breaking prematurely. Issue was resolved by extending the termination grace period and adding a preStop command to delay the tear down of the cloudsql pods so that all the connections can drain before the pods are removed.

Nigel G.

Zach Hanna

unread,
Jun 28, 2018, 9:04:32 PM6/28/18
to Google Cloud SQL discuss
Nigel,
Do you mind sharing your lifecycle preStop command you used to gracefully drain the connections before teardown?  
 

Nigel Gutzmann

unread,
Jun 28, 2018, 9:07:31 PM6/28/18
to Google Cloud SQL discuss
Hi Zach,

No problem, it's below:

          lifecycle:
            preStop:
              exec:
                command: ["sleep", "<time that you think your longest connection should last in seconds>"]
Reply all
Reply to author
Forward
0 new messages