Hello,
as per subject, I am writing this message to report an ongoing problem we have had for the past 48 hours on a project we are urgently trying to deploy to staging and production.
Here is the situation:
1) we have a project with a few google app engine services
2) repositories are on gitlab
3) deploys happen through gitlab CI
4) dev branch is deployed to our own GCP account and project
5) staging and production branches are deployed to the client's GCP account and to staging and production projects respectively
6) all deploys worked properly until monday, we deployed new versions on dev, then on staging, then on prod (still private project but going live asap)
7) suddenly, deploys to staging and production began giving this error:
After research and multiple attempts, I gave up and decided to wait as I suspected this was due to some temporary malfunction on GAE's part.
8) The next day the error shown at point 7 was gone, but the service update part of the deploy began timing out, and it has consistently timed out for the dozens of times I have tried this over the past 24/36 hours
9) The dev environment continues to successfully deploy with acceptable deploy times (SUB 10 minutes for each service pipeline)
I have done a fair amount of googling and failed to stumble on definitive answers. One thread on SO implied that this could be related to IP quotas, but we did stumble on this in the past and have always received this specific error in those circumstances:
ERROR: (gcloud.app.deploy) INVALID_ARGUMENT: The following quotas were exceeded: IN_USE_ADDRESSES (quota: 8, used: 8 + needed: 2).
I don't even think the step that's causing us issues now is reached if the IP quota is exceeded.
When reviewing the GCP UI during the enormous wait time, I see the new version is created, but traffic is not redirected to it (because the new version fails to complete the bootup). All I see in the logs of this temporary version is the first "create version" log line. A few minutes after the deploy fails, this version then gets deleted and traffic stays with the previous version.
Needless to say, this is causing enormous frustration and as a long time GAE user, I am very concerned, having never stumbled on consistent timeouts of this sort.
As this seems to be an issue multiple node flex users have had every now and then, it would be useful to understand (1) what could possibly be causing this and (2) how to even debug this issue when the deploy log won't give any details about the issue.
This is time sensitive so any help is deeply appreciated.
Thanks,
Salvatore