This is an issue that sometimes comes up when having a large number of tasks enqueued in a single queue, or enqueueing many tasks at once. Sometimes this is due to many tasks failing in a short time and causing the queue retry to hit the maximum default backoff time of one hour, causing the queue to stall. In other cases simply having > 1000 tasks in a single queue can cause transient contention issues with the underlying scheduler.
There are some ways to mitigate this:
- Shard your queues, or in other words don't add all your tasks to a single queue but distribute them among several different queues.
- Try to avoid adding tasks simultaneously, especially if the tasks being added are scheduled to execute immediately. If possible also try to add tasks that are scheduled to execute in the future, at least 5 minutes later.
- Adding to the above, when adding large numbers of tasks that start at the same time (or relatively close to the same time), be sure that all the add operations have completed before the scheduled leasing or execution time.
- Add tasks from a single thread. If all task add calls are made sequentially, the risks of contention between calls is minimized.
- Add scheduled tasks set to run at different scheduled times. The interval between them does not matter, for example 8:00:00 PM, 8:00:01 PM, and 8:00:02 PM will create 3 different 'buckets' for the tasks.
- Backoff on errors when adding tasks. If task add calls fail or specific tasks within a request fail, wait before retrying (preferably using exponential backoff).
- If you're following these guidelines and still see the queue stall from time to time, you can adjust max_backoff_seconds in your queue configuration.