Celery freezes until node server is restarted

144 views
Skip to first unread message

choube...@gmail.com

unread,
Aug 23, 2017, 11:55:21 AM8/23/17
to rabbitmq-users
Hi,

We have an two apps running on heroku - python(django) and node. Both of these apps use celery for asynchronous task with rabbitmq. Lately we have started to observe that celery tasks don't run till the time node server is restarted. We observe that number of connections keeps on increasing and publish rate & deliver rate comes down to 0. Restarting worker dynos doesn't help. But restarting node server helps. Also, we use websockets for in-app chat. We see that there are too many App crashed errors with most of the errors for the endpoint "/socket.io?EIO=3&transport=websocket" sort.

Does anyone have any idea why this could be happening? I understand this is very li'l info to understand the behaviour. But my main problem is actually around debugging this and thus I don't have any logs for it. I'ld also love if I get ideas on how to trace events and states of the task when it's in rabbitMQ because I'm not sure how to run rabbitmqctl for my app. I'ld be happy to answer any question you have and any help I get.

Thanks,
Anshu

choube...@gmail.com

unread,
Aug 23, 2017, 11:57:53 AM8/23/17
to rabbitmq-users
For it may help, we are using celery 3.1.17, rabbitmq 3.1.3. I understand these are fairly old versions. But the app is sort of legacy and thus I'ld not be able to immediately update the versions.

Michael Klishin

unread,
Aug 23, 2017, 12:00:47 PM8/23/17
to rabbitm...@googlegroups.com
Do you collect metrics provided by RabbitMQ management UI/HTTP API [1]? I assume
you don't have access to infrastructure metrics, how about Celery logs or something like this?
(we won't be able to help with much of Celery-specific stuff but getting logs from all services
involved is a very good idea)

First thing I'd check if whether Celery acknowledges delivered messages and what channel QoS is used [2][3].

Next is whether there are any missed heartbeats [4] (will be trivial to see in RabbitMQ logs) since Celery
uses a spectacularly low quality client library under the hood and not so long ago it did not support heartbeats at all,
for example.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

choube...@gmail.com

unread,
Aug 23, 2017, 2:24:33 PM8/23/17
to rabbitmq-users
Thanks for your prompt reply.

To answer your questions,

Yes we do collect metrics provided by RabbitMQ management UI using Bigwig add-on. I'm not sure what exact infra metrics do you mention here? We have info like how much is the memory usage per web or worker dyno at every point of time(the ones given by heroku router, you might be aware).

Yes celery uses nack. I didn't understand "what channel QoS is used". If you are talking about prefetch_count, I am not sure about the channels I see in management plugin, but my consumers(celery workers) run with a prefetch_count varying between 30 - 50.

Yes there are worker logs for missed heartbeats from the queue, but not when the problem is there. Rather when the problem is not there. And thus I understand why you say it'ld be trivial. I'm not sure how can I check missed heartbeats in reliable way? When I look at my broker configuration, I don't see anything for heartbeat and if I remember correctly broker heartbeat is off by default.

We've celery logs and normally we get logs for when a task is received and when it's succeeded. But when we have the problem, we don't see such logs. We do see celerybeat scheduling periodic tasks. But none of the workers pick the tasks.

I also would like to add few points here that there are too many conncetions we have at such times. 20 to 40 times more than normally we have. Also I see there are many channels which are idle for months now. I see ~200 queues when we have only 4 workers with concurrency of 8 running. Do you have any idea around this?

Not sure if it's useful but when I try to access an active channel in management UI at such times, I get following error in the management plugin:

Got response code 500 with body 

Internal Server Error

The server encountered an error while processing this request:
{error,function_clause,
       [{proplists,get_value,
                   [user,not_found,undefined],
                   [{file,"proplists.erl"},{line,225}]},
        {rabbit_mgmt_util,'-is_authorized_user/3-fun-0-',3,[]},
        {rabbit_mgmt_util,is_authorized,6,[]},
        {webmachine_resource,resource_call,3,[]},
        {webmachine_resource,do,3,[]},
        {webmachine_decision_core,resource_call,1,[]},
        {webmachine_decision_core,decision,1,[]},
        {webmachine_decision_core,handle_request,2,[]}]}
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Aug 23, 2017, 2:32:40 PM8/23/17
to rabbitm...@googlegroups.com
OK, the prefetch is fine and avoids trivial issues such as one slow or timing out task blocking everything.

Missed heartbeats indicate a lost connection (regardless of the exact reason, networking or overloaded peer or other things). See server logs if Celery recovers them (reconnect). If that's not the case, there will be no more deliveries.

As always, if you can detect the issue early and take
a traffic capture with tcpdump. It will tell you what is really going on on the wire.

I'm afraid our only other piece of advice for those old versions is to upgrade 
Staff Software Engineer, Pivotal/RabbitMQ

choube...@gmail.com

unread,
Aug 23, 2017, 2:55:28 PM8/23/17
to rabbitmq-users
It skipped my mind to add that I do see too many blocking and blocked python-amqp and node-amqp, both, connections.

Ok. Shall consider upgrading over the weekend. Meanwhile, do you have any idea how can I run rabbitmqctl for my heroku apps? Or how can I use management UI to get any information on this? Or any logs since we don't have access to the filesystem?

Thanks for your help!

Michael Klishin

unread,
Aug 23, 2017, 5:51:48 PM8/23/17
to rabbitm...@googlegroups.com
You can ask service maintainers for logs. Management UI does not expose logs
and HTTP API offers a different set of commands. 3.1.x API documentation was at
but hg.rabbitmq.com was deprecated over 2 years ago and shut down earlier this year.

Use your management UI's endpoint and add /api to it, it should serve a local copy of the docs for your
version.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

choube...@gmail.com

unread,
Aug 24, 2017, 2:43:40 AM8/24/17
to rabbitmq-users
Yes I have seen the documentation. And I have sought Bigwig's team as well as heroku's team support as well on this. Thanks!

Interestingly, working on a local setup I have some more info on the issue. As background info, using websockets, we have implemented in-app chat and when a user sends a message to another person, a notification is sent to the recipient. Now, when that notification is sent there are repetitive connections made to celery which increases the connection count till the time socket descriptors are exhausted. And stopping the node server reverts the connection and socket descriptors count to the earlier value. So, this gives a clue how it's related to node(as websocket runs using this) server. But no clue on why publish is stopped completely in the heroku app. And why connection count keeps increasing even after that point of time. Let me know if you have any thoughts on this. And thanks again for your help.

Michael Klishin

unread,
Aug 24, 2017, 3:56:00 AM8/24/17
to rabbitm...@googlegroups.com
Extra connections is something to look into but they cannot be avoided in certain edge cases.

I believe http://www.rabbitmq.com/heartbeats.html largely answers the question about publisher traffic
stopping unexpectedly. At least that's still my best theory: a network connection issue combined
with Celery's imperfect ability to recover plus possibly file descriptor exhaustion on the node(s).

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Anshu Choubey

unread,
Aug 24, 2017, 5:34:13 AM8/24/17
to rabbitm...@googlegroups.com
All right. I'll try to see if I get any more lead and will plan upgrade over the weekend. I'll let you know if I have further doubt. Thanks for looking into this.

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/3hYmTKItpRQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages