Instance gets terminated without warming up another

333 views
Skip to first unread message

Alan deLespinasse

unread,
Aug 3, 2020, 1:26:37 PM8/3/20
to Google App Engine
Python 3.7, Standard Environment. Here's my entire app.yaml:

runtime: python37
instance_class: F4
automatic_scaling:
 min_instances: 1
 max_instances: 10

inbound_services:
- warmup



The problem: Several times a day, the instance (of which there's only one) seems to get terminated randomly. This is fine; I understand that App Engine doesn't guarantee a single instance will last forever. The problem is that the new instance doesn't start warming up until after the old one stops, which means that any real requests made during warmup end up having very high latency, sometimes over 30 seconds. Shouldn't GAE wait to terminate the old instance until the replacement is finished warming up?

I'm pretty sure the process isn't crashing or running out of memory or anything. Most of the time, this service just handles one request every minute (it's a cron job), and the terminate signals seem to happen at random times in between those requests.

Maybe there's just something I don't understand about GAE configuration.

I could increase min_instances to 2, of course, which I assume would make this much less likely, but I'd rather not. This isn't a high-traffic or (currently) mission critical server. On the other hand, having occasional requests take 30+ seconds can be pretty annoying.

Is there something else I can do?


Typical logs are below (stdout, stderr, and request_log, all versions). You can see the cron job at the top at 22:39:00, 22:40:00, 22:41:00.

At 22:41:47 is the terminate signal, and at 22:41:48 the warmup request for the new instance. There's an exception right after that that I assume is triggered by the process terminating (related to a Firestore connection, I think).

Then at 22:42:00, the next cron request happens, and I think it actually triggers a second new instance because there's no instance currently ready to go. If it just waited for the first one to finish warming up, this request would have taken about 19 seconds. But since it instead starts a second one, it takes 34 seconds, and increases my bill as well.


Alan deLespinasse

unread,
Aug 3, 2020, 2:08:06 PM8/3/20
to Google App Engine
I'm also not clear on why it would take on the order of 30 seconds to warm up an instance, but I assume it's an unavoidable property of GAE. There's nothing in my code that would make it take so long, unless it has to install the dependencies every time or something.

Elliott (Cloud Platform Support)

unread,
Aug 12, 2020, 9:00:55 PM8/12/20
to Google App Engine
Hello Alan,

As you have pointed out from the documentation: (I know you’ve read it but please bear with me.)

App Engine attempts to keep manual and basic scaling instances running indefinitely. However, at this time there is no guaranteed uptime for manual and basic scaling instances. Hardware and software failures that cause early termination or frequent restarts can occur without prior warning and can take considerable time to resolve; thus, you should construct your application in a way that tolerates these failures.

*You would like to know if App Engine can resist shutting down the instance before spinning up a new one. This information is not available in public documentation and from what is happening in your scenario, I believe I would have to reach out to an App Engine Specialist to confirm. I want to make sure.

You have provided a workaround to this behavior to set up more instances but you would like to know what are your other options. (Again, please bear with me.)

From the documentation I included you may:

Reduce the amount of time it takes for your instances restart or for new ones to start.

For long-running computations, periodically create checkpoints so that you can resume from that state.

Your app should be "stateless" so that nothing is stored on the instance.

Use queues for performing asynchronous task execution.

If you configure your instances to manual scaling:
Use load balancing across multiple instances.

Configure more instances than required to handle normal traffic.
Write fall-back logic that uses cached results when a manual scaling instance is unavailable.

So to summarize right now with the information I have, you have two options. The first and more direct is to create another instance that would affect your budget and the other to look at the advice for your application.

You also have another issue where it may take 30 seconds to warm up an instance. This information is private and would require looking at your project. I can help you as much as I can using this forum but sharing specific information about your project is not good for security.

I will wait for your response.




Alan deLespinasse

unread,
Aug 12, 2020, 11:54:16 PM8/12/20
to Google App Engine
Hi, thanks for your response!

I would love to decrease the startup time. I assumed there wasn't any way I could decrease it much, since I'm basically just running a simple Flask server with mostly default app configuration. It does have some third-party package dependencies, but I don't know of any reason why they'd be that slow to load. When I run it locally for development, it starts up in less than a second. (I don't use gunicorn locally; maybe I'll try that and see if it's slow.)

It's been a while since I ran a simple Standard Environment Python app with automatic scaling, and when I did it was probably old-school Python 2.x. I think they used to start up pretty quickly, but I didn't know if it would be the same in newer environments.

I'm on vacation this week, but maybe next week I'll experiment with adding some extra logging before any dependencies are loaded, and when Flask actually starts serving. It would be interesting to see the timing on that.

I could play with gunicorn configuration, but I assumed the defaults were reasonable.

Meanwhile, my coworkers and I actually decided it was a good idea to increase the minimum instances anyway, so this isn't really bothering us now, but it would be nice to solve the mystery. 

Thanks!

Alan

Alan deLespinasse

unread,
Aug 13, 2020, 9:32:38 AM8/13/20
to Google App Engine
Possibly relevant, but basically comes to the conclusion of "increase the minimum instances": https://stackoverflow.com/questions/57054132/improving-cold-start-up-times-on-google-app-engine-running-django-on-python-3-7

Also, is there any way to tell why the instance was terminated? There's nothing in the logs that makes it clear to me. I've been assuming that it was being preempted by some other process. For some reason it seems to happen a lot more frequently in our production environment than our staging environment, 

Elliott (Cloud Platform Support)

unread,
Aug 13, 2020, 9:12:04 PM8/13/20
to Google App Engine

Hello Alan,


I may look if there was a common issue at the time of the new instance creation but if you want us to look specifically in your logs, the best way to move forward is to seek support. We offer affordable plans here[1] such as Role Based Developer support and I thought you would be interested in it. Otherwise, you may open a free community based Google Issue Tracker[2] although it is used to report bugs rather than obtaining individual support. I wanted you to know this.


You asked if there is a way to know for yourself why the instance was terminated. Aside from looking at Stackdriver, we can see more. By obtaining support, we can look at our end to offer a more detailed view of the possible cause.


The link you shared with me here[3] is very helpful in your scenario. If you would like to discuss a point in particular, please respond to this thread. We’re always interested.


I will wait for your reply.


[1] https://cloud.google.com/support

[2] https://issuetracker.google.com/issues/new?component=187191&template=0

[3] https://stackoverflow.com/questions/57054132/improving-cold-start-up-times-on-google-app-engine-running-django-on-python-3-7


Reply all
Reply to author
Forward
0 new messages