Server 5xx Errors in the Dashboard - Invisible from the logs - Likely causing taskqueue to fail

134 views
Skip to first unread message

Kaan Soral

unread,
Mar 19, 2014, 7:36:30 PM3/19/14
to google-a...@googlegroups.com
I have a custom mapreduce implementation, noticed that some entities weren't processed while they should, so I've spent the last 24 hours investigating the issue
Built a system to pinpoint what exactly is failing, I was 99% sure taskqueue sucked without anyone noticing, my main purpose was to prove this

At my initial test, I've detected that the workers that ran as a result of the taskqueue.add was less than the number of taskqueue.add executions
I've improved the logging routine, which uses memcached, to use more shards to be sure this is the case, so I deployed improvements 2-3 times

After the logging system was perfected, I've tested the system 3 times, all 3 times the mapreduce system parsed the 259656 entities, in a minute, very impressive, no tasks lost, couldn't prove my theory
(At all of my trials before I was logging each step, the entity count was always less than the actual amount, one time it was 200023, missing 20% of the entities, indicating taskqueue is flawed)

However at these perfect trials there were no Server 5XX errors, however during my pre-logged trials there were these unexplained Server 5xx errors
My hunch is that, at re-deployments I've caused the app to be relocated to a good place, while the 5xx errors and invisible taskqueue issues happened on a bad sector

Any ideas?
In the meantime, I will try again later
server5xx.png

Kaan Soral

unread,
Mar 19, 2014, 8:18:07 PM3/19/14
to google-a...@googlegroups.com
In all of the above tests, there wasn't any work done, so the mapreduce routine was pretty lightweight, I've used the routine to do some work after the above tests
TOTAL test8
Splitter Out: 100 In: 100
Splitter Workers Out: 127 In: 127
Workers Out: 8192 In: 8183
Elements: 247666

247666 should have been 259656, the missing "Workers In" indicates that taskqueue.add succeeded, however not as many tasks have run

During all of these, the error and warning logs are clean, and as far as the code knows, everything works as expected, no exceptions handled

Seems like a fatal/critical issue to me
server5xx-2.png

Kaan Soral

unread,
Mar 20, 2014, 9:34:30 AM3/20/14
to google-a...@googlegroups.com
The tests were done on my new app, I've realized that my old app, that has been generating my revenue also has these Server 5XX errors, likely causing money-loss with each missed task, does explain why the performance had dropped lately

It seems these issues occur only when the traffic/load spikes, I will create an issue later on, in the meantime it would be great to know if others see these mysterious Server 5XX errors in their logs too

Kaan Soral

unread,
Mar 21, 2014, 10:51:59 AM3/21/14
to google-a...@googlegroups.com
Please star this issue: http://code.google.com/p/googleappengine/issues/detail?id=10721&thanks=10721&ts=1395413451

Years of using appengine, this is the most critical issue I've faced

Kaan Soral

unread,
Apr 9, 2014, 9:04:06 PM4/9/14
to google-a...@googlegroups.com
For those who are interested or stumble onto this thread in the future;

It seems, when instances fail to serve a request, a taskqueue task, that internal failure counts as a task failure (this shouldn't happen, but does happen)
It internally eats up your task_retry_limit (In my opinion, it shouldn't, as it is not controlled, and it can happen unknown times, only code exceptions/critical instance deaths should count through task retry limits)

The event is logged for a very very short time as "Request was aborted after waiting too long to attempt to service your request." in the logs, this should be an error log, but it's an info log, also it disappears in <10 minutes (I have a high log storage capacity, it shouldn't disappear, but it does)


Reply all
Reply to author
Forward
0 new messages