"The process handling this request has unexpectedly died... (Error Code 203)"

2,661 views
Skip to first unread message

Charles Batty-Capps

unread,
Aug 4, 2018, 11:53:21 PM8/4/18
to Google App Engine
To whomever at Google with knowledge of this error message,
We are using AppEngine Standard with a mix of Java7/8 (on Tuesday our production environment will all be on Java8). We are using basic scaling, generally with a max of ~20 instances, and we rarely see services scale above 5 instances. We frequently get this error in our logs:

The process handling this request unexpectedly died. This is likely to cause a new process to be used for the next request to your application. (Error code 203)

We used to think this was due to an OOM (perhaps an OOM is one possible root cause), however we've been seeing this error more and more frequently, and checking in the Cloud Console, the services were at low memory usage when it happened, and there was no spike in memory usage or other anomaly. So it's fairly safe to assume that this error has multiple root causes (perhaps any java Error ?) 

This seems to be due to high traffic, but our services getting this error haven't nearly scaled to the maximum allowed configured instances, as mentioned above.

So my question to you is:
  • What are all the root causes of this issue?
  • How can we troubleshoot this issue?
  • FYI, I don't believe this is due to performance; we've done a lot of work on performance and generally requests are under 500ms for all endpoints, except for a few endpoints that may take up to 10s when under load. When we see this error, the request time was often under 100ms. We haven't been seeing any 60s timeouts.

Some troubleshooting info
This is happening mostly for 2 of our microservices, and in deferred tasks of a third microservice. The one that sees this error the most seems to be scaling up and down the number of "active" instances fairly frequently. I'm not sure how "active" is determined, other than the obvious of whether there is traffic to the instance.


This happens for a wide variety of requests; requests from our app to our mobile proxy service, requests between services, and deferred tasks. This happens for some slow requests, for some fast requests, for background threads, etc. So it's quite difficult to pinpoint the cause. We can do some blanket work to try to "generally improve performance" but that's a rather inefficient way to solve this problem. I appreciate any help on this matter and I may also create a support ticket, but the ticket system often doesn't provide very useful info. 

Here are some example requests that all had this error:

 

Thanks for any help!

Charles Batty-Capps

unread,
Aug 5, 2018, 12:06:12 AM8/5/18
to Google App Engine
I would add that when this error happens, it seems to have a cascading effect where multiple requests that were executing very close together in time all have the same failure. In other words, it happens in "clumps" where each clump seems to have a discrete root cause that leads to all the requests in the clump failing together.

Charles Batty-Capps

unread,
Aug 5, 2018, 12:07:24 AM8/5/18
to Google App Engine
We see "spikes" of 5xx errors -- all of the 5xx errors being due to this issue - 


Nick

unread,
Aug 5, 2018, 4:53:43 PM8/5/18
to Google App Engine
When your application fails to handle a request (ie throws out of the request handler rather than returning an error code), appengine treats it as unrecoverable.

This leads it to discard the instance, and to do so it has to terminate any concurrent requests. When an app engine API call happens, it will cause that to fail and record 203.

So your causation is probably around the wrong way.
I’d look for errors in your logs closely proceeding the 203 on the same instance - your stack trace should tell you what application error is causing this. You probably have a couple of systemic data driven failure (like NPE) on a service that is causing this.

Note you can probably still turn off concurrent requesting handling per service, so might be able to isolate more effectively at the cost of greater instances running - not sure about that.

Charles Batty-Capps

unread,
Aug 5, 2018, 7:56:04 PM8/5/18
to Google App Engine
Your post may have given me an idea, but the root cause isn't what you think. We always use a REST framework for all our services (the latest version of Jersey currently), so I am 100% certain an unexpected application level exception results in a 500 error and the stacktrace is logged by an Exception handler in Jersey. We never see anything in the logs except for the process death error when this happens.

It stands to reason that maybe an exception is being thrown in the Jersey layer before it reaches the HTTP layer, and somehow nothing is logged in this case.

I'll investigate options with Jersey. Maybe there's a thread pool that needs reconfiguring or something.

Dan S (Cloud Platform Support)

unread,
Aug 6, 2018, 5:13:27 PM8/6/18
to Google App Engine
There are many possible reasons to cause this kind of issues such as Slow-Loading Apps,
Performance Settings, Warmup Requests, Delays Associated with Logging
Delays Associated with UrlFetch… you can find more details about the root causes and how to avoid errors in the following documentation[1].

Also note that while using Java, the runtime may not terminate the process, so this could cause problems for future requests to the same instance. “To be safe, you should not rely on the DeadlineExceededError, and instead ensure that your requests complete well before the time limit.” as per docs[2].

Charles Batty-Capps

unread,
Aug 6, 2018, 8:54:35 PM8/6/18
to Google App Engine
I can look into this, but I'm not sure that you carefully read my post. I'm getting a "process has died unexpectedly" error, NOT a "deadline exceeded error".

Dan S (Cloud Platform Support)

unread,
Aug 8, 2018, 4:19:47 PM8/8/18
to Google App Engine
Hello Charles,

In fact, I understand this part of your issue, and I believe that there is a good answer from Nick. However, my concern and my answer concerns to the HTTP 500 server error displayed in your screenshots. Sorry for do not let this clear.

Charles Batty-Capps

unread,
Aug 8, 2018, 5:19:02 PM8/8/18
to Google App Engine
Yep, and thanks for the info. Our 500 errors were due to the "process death" issue, though, and their latencies were well under the 60s deadline. So JFYI, I don't think the deadline exceeded error is related to what we're experiencing. 

I will investigate when I have a chance to see if our REST framework (latest version of Jersey) is somehow throwing uncaught exceptions before our application code runs. That's currently my best guess about what's going on, because anywhere in our application code an uncaught exception will be logged and result in a 500 error.

Nick

unread,
Aug 9, 2018, 6:16:52 PM8/9/18
to Google App Engine
I took a quick look at the docs, but couldn’t find an answer so you may want to test for yourself, but it may well be returning 500 has the same result as not handling exceptions - instance termination.

Just deploy a test app with a 500 (possibly 5xx) result - appengine specifically tells you in your logs if other requests will be killed.

edgaral...@google.com

unread,
Aug 10, 2018, 6:46:07 PM8/10/18
to google-a...@googlegroups.com

When an instance returns to many sequential 5xx errors, our instance scheduling system will consider the instance unhealthy. The instance scheduler will then terminate this instance.  


When an instance gets terminated, and it still has a request queued, the queued request will throw the 203 error. While the instance is being terminated, no new requests get queued for that instance by our scheduler.


This means that the 203 only gets thrown when there's a request queued when the instance gets terminated, and only for the request that was queued.


The root cause is/should in fact be that our instance scheduler will terminate an instance that serves to many sequential 5xx errors, and this is expected and desired behavior.

The cause of the problem that needs to be addressed is this high incidence of 5xx errors. You could filter the logs by the instance Id and look at the 5xx errors prior to the instance shutdown to verify this claim.



Charles Batty-Capps

unread,
Aug 10, 2018, 11:35:00 PM8/10/18
to Google App Engine
Thanks edg...@google.com,

You've given the first genuinely useful answer! It makes more sense that this could be due to the rate of 500 errors in our services. 

However, the service in question only has twenty-eight 5xx errors in the past week that are not due to "The process handling this request unexpectedly died."
In contrast, is has about 200 errors that ARE due to "The process handling this request unexpectedly died." in the past week.

I'm a bit incredulous that 28 5xx errors resulted in a high enough rate to generate all these "process died" errors, so I suspect there is some non-ideal behavior of the instance scheduler or a bug on Google's end.

That being said, thank you, because this gives me a lead at least. 

We do have 2 suspicious 500 error requests that resulted in 204 error codes in the past week: "A problem was encountered with the process that handled this request, causing it to exit.".
So I'll address that issue; it's one place where we're still using a raw HttpServlet instead of a proper REST API Framework. 

As I said, I am still finding it very hard to believe that there's not an issue on Google's end. We're getting hundreds and hundreds of these 203 errors, far away in time from the legitimate 500 errors that are due to our application code.

Thanks,
Charles

edgaral...@google.com

unread,
Aug 14, 2018, 5:37:11 PM8/14/18
to Google App Engine
If by filtering the instance ID you do not find 5xx errors leading to the instance unexpectedly dying, it might be worth opening a issue tracker in order to investigate further. I believe looking at what are the services used in the workload and what is happening before the instance dying would be necessary in order to troubleshoot the issue as the root cause was explained before and it seems to not be the case here. If you believe this is a issue on Google's end, I invite you to open an Private thread from here:

By providing this Google Groups link for reference as well as providing your Project ID and resources name affected. The thread will be private meaning you will be the only one able to see the thread and Google Cloud Platform Support team. They will be able to help you troubleshoot the issue and confirm if this is an issue on the service. Alternatively, if you have a support package, I would also suggest to open a ticket with Google Cloud Platform Support:
Reply all
Reply to author
Forward
0 new messages