Performance problems migrating from NAT gateway to Cloud NAT/Private Google Access

526 views
Skip to first unread message

Craig Finch

unread,
Feb 8, 2019, 8:48:27 PM2/8/19
to gce-discussion
We've experienced performance problems with Cloud NAT, and I'm wondering if anyone else has seen similar problems.

Base configuration:
We have a VPC network with subnets (Private Google Access is not enabled). A pool of web servers with private IP addresses connects to a Cloud SQL instance via Cloud SQL Proxy. We have a NAT/gateway instance that handles all outbound and related traffic from instances in the VPC with private IP addresses. Inbound requests to the load balancer, and related responses, are handled through a GCP load balancer. The NAT/gateway instance in this configuration is small (2 vCPUs, 2.5 GB memory) but has performed very well under load for 2+ years.

The change:
I created a Cloud NAT gateway for the region. After business hours, I deleted the route that directed traffic to our NAT instance. Load on the NAT instance dropped as Cloud NAT took over. I tested key features of the web application, and everything worked fine when load on the web application was low.

The problem:
Overnight, Cloud SQL Proxy started experiencing problems on one web server. The next morning, as traffic picked up, active connections to the Cloud SQL DB spiked and our web application started experiencing intermittent outages. Cloud SQL Proxy crashed on the web server that had problems overnight, and the socket file for the primary SQL server disappeared. I restarted Proxy SQL on that server and the socket file came back, but the performance problems continued. I reverted the change by re-creating the route to direct traffic back to the old NAT instance. Performance immediately returned to normal. 

Why did the move to Cloud NAT cause these problems? Cloud NAT should be far more performant than our little NAT instance. Should we have restarted Cloud SQL Proxy after changing the NAT to establish new connections? 

Routing traffic to Cloud NAT also enables Private Google Access automatically. Could this change have caused the problem? I'm going to try enabling Private Google Access for our testing environments to see if it causes problems independently from Cloud NAT.

Any suggestions or shared experience will be appreciated.

Tariq (Google Cloud Support)

unread,
Feb 17, 2019, 5:16:19 PM2/17/19
to gce-discussion
I would recommend you to report this as a private issue on the issue tracker [1] as this appears to be a specific issue which might require investigation from Cloud NAT team (gce-discussion group is for generic GCE issue related discussion). 

Craig Finch

unread,
Feb 19, 2019, 6:18:01 PM2/19/19
to gce-discussion
Tariq,
Thanks for the response. We realized that another problem, unrelated to the Cloud NAT change, happened around the same time. We think we found a bug in Google's instance CPU monitoring, which is affecting Cloud Load Balancer when CPU utilization is used to allocate requests among instances. I have a group of 3 instances and I'm watching the CPU load via StackDriver Monitoring and via the "Monitoring" tab for each instance in the Cloud Console web interface. Several times per day, the LB suddenly directs up to 95% of traffic to a single instance, loading it up to 80 or 90% CPU, and affecting our application performance. The other two instances are at less than 5% CPU while this is happening. I can see the differential in CPU load on StackDriver, but in the Cloud Console, the CPU on ALL hosts is nearly zero. I can log into the instance and verify that StackDriver CPU load is right and the CPU monitoring in Cloud Console is wrong, so there's clearly a bug. I can only infer that this bug is throwing off load balancing, but I'm pretty sure that's what's happening.

    Craig 

Germán (Google Cloud Support)

unread,
Apr 3, 2019, 3:21:57 PM4/3/19
to gce-discussion
Hello Craig,

Please report this as a private issue on the issue tracker [1] as this appears to be a specific issue which might require investigation from the product team (gce-discussion group is for generic GCE issue related discussion). 

Reply all
Reply to author
Forward
0 new messages