I have a custom mapreduce implementation, noticed that some entities weren't processed while they should, so I've spent the last 24 hours investigating the issue
Built a system to pinpoint what exactly is failing, I was 99% sure taskqueue sucked without anyone noticing, my main purpose was to prove this
At my initial test, I've detected that the workers that ran as a result of the taskqueue.add was less than the number of taskqueue.add executions
I've improved the logging routine, which uses memcached, to use more shards to be sure this is the case, so I deployed improvements 2-3 times
After the logging system was perfected, I've tested the system 3 times, all 3 times the mapreduce system parsed the 259656 entities, in a minute, very impressive, no tasks lost, couldn't prove my theory
(At all of my trials before I was logging each step, the entity count was always less than the actual amount, one time it was 200023, missing 20% of the entities, indicating taskqueue is flawed)
However at these perfect trials there were no Server 5XX errors, however during my pre-logged trials there were these unexplained Server 5xx errors
My hunch is that, at re-deployments I've caused the app to be relocated to a good place, while the 5xx errors and invisible taskqueue issues happened on a bad sector
Any ideas?
In the meantime, I will try again later