Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

UPDATE: reliability issues

20 views
Skip to first unread message

Luke Crouch

unread,
Aug 6, 2013, 3:29:24 PM8/6/13
to dev...@lists.mozilla.org, Jake Maul, Stormy Peters
All,

Diagnosis:

I pulled webops in to look at a couple things [1][2], and after some
digging, we discovered that the celery nodes are overloaded with tasks.
When celery is overloaded, writing any documents is finicky, because we
connect to celery to add *more* tasks each time.

There's a bug in new code we introduced with new celery tasks recently
that causes tasks to spawn more tasks, so the queue gradually grows, and
the problem escalates the longer we go without clearing the task queue.

Treatment:

Last time we inadvertently fixed the issue by virtue of increasing the
memory limit and restarting the celery nodes. We've restarted the celery
nodes again, but they will keep filling up until we squish the bug in
the new celery task code. So, we've also wrapped the celery task code in
a try/except block [3] so when the celery node *is* overloaded, it won't
kill the document save request.

Prognosis:

Good news - we have a diagnosis that makes sense and we have a quick fix
in place.

Bad news - we don't have the final fix, and we still need help to know
if we're on the right track. We have to keep monitoring issues and
especially issues around saving documents. Please keep trying to save
documents, add comments to bugs [4][5] or open new bugs if you see new
issues. Be as detailed as possible.

Thanks,
-L


[1] https://bugzilla.mozilla.org/show_bug.cgi?id=901971
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=901960
[3] https://github.com/mozilla/kuma/pull/1258
[4] https://bugzilla.mozilla.org/show_bug.cgi?id=866459
[5] https://bugzilla.mozilla.org/show_bug.cgi?id=866458

Luke Crouch

unread,
Aug 6, 2013, 3:33:26 PM8/6/13
to dev...@lists.mozilla.org, Jake Maul, dev-mdc, Stormy Peters
+dev-mdc
> _______________________________________________
> dev-mdn mailing list
> dev...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-mdn

Jean-Yves Perrier

unread,
Aug 6, 2013, 6:38:36 PM8/6/13
to dev...@lists.mozilla.org
Luke! That's awesome news..

I will try to be more aware to change in pattern/frequency of the
problem(s). Tonight only one symptom happened (empty page), 100% of the
time when clicking "Save". I filled a new bug:
https://bugzilla.mozilla.org/show_bug.cgi?id=902177

This look very promising: that means that what you are doing is indeed
touching at the symptoms.

Will inform in any change in the pattern of errors. I'm sometimes
asynchronous with you guy, so it may be interesting :-)

Feel free to ask specific details, though.

(Out of the office tomorrow)

Have a good night!
--
Jean-Yves
--
Jean-Yves Perrier
Technical Writer / Mozilla Developer Network

Jannis Leidel

unread,
Aug 7, 2013, 7:27:20 AM8/7/13
to Luke Crouch, dev...@lists.mozilla.org, Stormy Peters, Jake Maul

On 06.08.2013, at 21:29, Luke Crouch <lcr...@mozilla.com> wrote:

> All,
>
> Diagnosis:
>
> I pulled webops in to look at a couple things [1][2], and after some digging, we discovered that the celery nodes are overloaded with tasks. When celery is overloaded, writing any documents is finicky, because we connect to celery to add *more* tasks each time.
>
> There's a bug in new code we introduced with new celery tasks recently that causes tasks to spawn more tasks, so the queue gradually grows, and the problem escalates the longer we go without clearing the task queue.
>
> Treatment:
>
> Last time we inadvertently fixed the issue by virtue of increasing the memory limit and restarting the celery nodes. We've restarted the celery nodes again, but they will keep filling up until we squish the bug in the new celery task code. So, we've also wrapped the celery task code in a try/except block [3] so when the celery node *is* overloaded, it won't kill the document save request.
>
> Prognosis:
>
> Good news - we have a diagnosis that makes sense and we have a quick fix in place.
>
> Bad news - we don't have the final fix, and we still need help to know if we're on the right track. We have to keep monitoring issues and especially issues around saving documents. Please keep trying to save documents, add comments to bugs [4][5] or open new bugs if you see new issues. Be as detailed as possible.
>
> Thanks,
> -L
>
>
> [1] https://bugzilla.mozilla.org/show_bug.cgi?id=901971
> [2] https://bugzilla.mozilla.org/show_bug.cgi?id=901960
> [3] https://github.com/mozilla/kuma/pull/1258
> [4] https://bugzilla.mozilla.org/show_bug.cgi?id=866459
> [5] https://bugzilla.mozilla.org/show_bug.cgi?id=866458

I've started to dive into the call stack and have a first trial fix for the issue:

https://github.com/mozilla/kuma/pull/1262

This should prevent force-triggering the rendering when trying to build the JSON data for the search index update.

I'm frankly not sure if this will fix it since it's a multi-level rabbit hole and I wasn't able to reproduce the problem locally (yet). I'd appreciate sanity checking the patch and then deploying it to stage to see if it does work.

BTW, if we enable the rabbitmq management console on the server with "/usr/lib/rabbitmq/bin/rabbitmq-plugins enable rabbitmq_management && sudo service rabbitmq-server restart" we should be able to see the growing queue by visiting :55672. Unless of course someone at ops already did that, I'm unfamiliar with the exact celery configuration.

Thanks!
Jannis

Luke Crouch

unread,
Aug 7, 2013, 11:30:08 AM8/7/13
to Jannis Leidel, dev...@lists.mozilla.org, Stormy Peters, Jake Maul
Jake,

Can we enable the rabbitmq management console on the server so that devs
can see the queue? It will help us to know exactly how the tasks are
spawned.

Thanks,
-L

John Karahalis

unread,
Aug 7, 2013, 11:31:33 AM8/7/13
to Luke Crouch, dev...@lists.mozilla.org, Stormy Peters, Jake Maul
Wanted to add to what Luke said.

If you notice this issue resurfacing, tell us on mdn-drivers@. Bug comments are helpful for pinpointing the technical cause of the problem, but only a conversation on mdn-drivers@ will notify the right people to ensure we make the correction a priority.

--
John Karahalis
Mozilla
openjck.com

----- Original Message -----
0 new messages