All,
Diagnosis:
I pulled webops in to look at a couple things [1][2], and after some
digging, we discovered that the celery nodes are overloaded with tasks.
When celery is overloaded, writing any documents is finicky, because we
connect to celery to add *more* tasks each time.
There's a bug in new code we introduced with new celery tasks recently
that causes tasks to spawn more tasks, so the queue gradually grows, and
the problem escalates the longer we go without clearing the task queue.
Treatment:
Last time we inadvertently fixed the issue by virtue of increasing the
memory limit and restarting the celery nodes. We've restarted the celery
nodes again, but they will keep filling up until we squish the bug in
the new celery task code. So, we've also wrapped the celery task code in
a try/except block [3] so when the celery node *is* overloaded, it won't
kill the document save request.
Prognosis:
Good news - we have a diagnosis that makes sense and we have a quick fix
in place.
Bad news - we don't have the final fix, and we still need help to know
if we're on the right track. We have to keep monitoring issues and
especially issues around saving documents. Please keep trying to save
documents, add comments to bugs [4][5] or open new bugs if you see new
issues. Be as detailed as possible.
Thanks,
-L
[1]
https://bugzilla.mozilla.org/show_bug.cgi?id=901971
[2]
https://bugzilla.mozilla.org/show_bug.cgi?id=901960
[3]
https://github.com/mozilla/kuma/pull/1258
[4]
https://bugzilla.mozilla.org/show_bug.cgi?id=866459
[5]
https://bugzilla.mozilla.org/show_bug.cgi?id=866458