Hey Damith,
I believe I can shed some light on this situation. As explained in the docs,
If the DeadlineExceededError
is caught but a response is not produced quickly enough (you have less than a second), the request is aborted and a 500 internal server error is returned.
Another possible cause of 104 errors is:
In the Java runtime, if the DeadlineExceededException
is not caught, an uncatchableHardDeadlineExceededError
is thrown. The instance is terminated in both cases, but theHardDeadlineExceededError
does not give any time margin to return a custom response. To make sure your request returns within the allowed time frame, you can use theApiProxy.getCurrentEnvironment().getRemainingMillis() method to checkpoint your code and return if you have no time left. The Java runtime page contains an explaination on how to use this method. If concurrent requests are enabled through the "threadsafe" flag, every other running concurrent request is killed with error code 104
In line with the first quoted documentation above, it's possible that the the Datastore latency (the kind which would have caused the Datastore calls to go so long that the request itself would be facing a deadline error) could be causing the AppStats writing of the DeadlineExceeded exception itself to go so long that the response is not produced "quickly enough", leading to the observed error.
In general, it appears that the code being written to handle a DeadlineExceeded error, while a good thing, can tend to provide buffer room for tolerating the system being very close to deadline often enough that a slight change to Datastore latency could cause a certain proportion of requests to fail. AppStats being used to capture exceptions, and the deadline limit itself being announced via exception (thus causing the AppStats machine to start working, possibly for too long a duration leading to an absolute crash), this can lead to some moderately complex failure scenarios.
I believe this entire class of errors would be avoided by taking a look at whatever long-running activity is causing the requests to run so close to the deadline, and shifting that activity to a Task Queue or other form of processing which doesn't take place directly within the App Engine request handler. Another option would be to switch to Basic scaling, which does not have a 60 second
Deadline.