We've experienced performance problems with Cloud NAT, and I'm wondering if anyone else has seen similar problems.
Base configuration:
We have a VPC network with subnets (Private Google Access is not enabled). A pool of web servers with private IP addresses connects to a Cloud SQL instance via Cloud SQL Proxy. We have a NAT/gateway instance that handles all outbound and related traffic from instances in the VPC with private IP addresses. Inbound requests to the load balancer, and related responses, are handled through a GCP load balancer. The NAT/gateway instance in this configuration is small (2 vCPUs, 2.5 GB memory) but has performed very well under load for 2+ years.
The change:
I created a Cloud NAT gateway for the region. After business hours, I deleted the route that directed traffic to our NAT instance. Load on the NAT instance dropped as Cloud NAT took over. I tested key features of the web application, and everything worked fine when load on the web application was low.
The problem:
Overnight, Cloud SQL Proxy started experiencing problems on one web server. The next morning, as traffic picked up, active connections to the Cloud SQL DB spiked and our web application started experiencing intermittent outages. Cloud SQL Proxy crashed on the web server that had problems overnight, and the socket file for the primary SQL server disappeared. I restarted Proxy SQL on that server and the socket file came back, but the performance problems continued. I reverted the change by re-creating the route to direct traffic back to the old NAT instance. Performance immediately returned to normal.
Why did the move to Cloud NAT cause these problems? Cloud NAT should be far more performant than our little NAT instance. Should we have restarted Cloud SQL Proxy after changing the NAT to establish new connections?
Routing traffic to Cloud NAT also enables Private Google Access automatically. Could this change have caused the problem? I'm going to try enabling Private Google Access for our testing environments to see if it causes problems independently from Cloud NAT.
Any suggestions or shared experience will be appreciated.