RabbitMQ, Celery, and Broken Connection Issues

635 views
Skip to first unread message

Ryan Petrello

unread,
Aug 24, 2010, 5:13:35 PM8/24/10
to celery-users
We've got RabbitMQ running in production and several celery workers
that are constantly doing work. Periodically, we notice that our
workers stop working entirely and generate the following traceback
over and over into our logs:

2010-08-24 19:35:20,549 DEBG 'celery' stderr output:
[2010-08-24 19:35:20,548: ERROR/MainProcess] CarrotListener:
Connection to broker lost. Trying to re-establish connection...

2010-08-24 19:35:20,604 DEBG 'celery' stderr output:
[2010-08-24 19:35:20,604: WARNING/MainProcess] Exception in thread
Thread-5:
Traceback (most recent call last):
File "/opt/deps/lib/python2.5/threading.py", line 486, in
__bootstrap_inner
self.run()
File "/opt/shootq/shootqenv/lib/python2.5/site-packages/celery-2.0.2-
py2.5.egg/celery/concurrency/processes/pool.py", line 234, in run
cache[job]._ack(i, time_accepted, pid)
File "/opt/shootq/shootqenv/lib/python2.5/site-packages/celery-2.0.2-
py2.5.egg/celery/concurrency/processes/pool.py", line 839, in _ack
self._accept_callback()
File "/opt/shootq/shootqenv/lib/python2.5/site-packages/celery-2.0.2-
py2.5.egg/celery/worker/job.py", line 360, in on_accepted
self.send_event("task-started", uuid=self.task_id)
File "/opt/shootq/shootqenv/lib/python2.5/site-packages/celery-2.0.2-
py2.5.egg/celery/worker/job.py", line 331, in send_event
self.eventer.send(type, **fields)
File "/opt/shootq/shootqenv/lib/python2.5/site-packages/celery-2.0.2-
py2.5.egg/celery/events/__init__.py", line 70, in send
self.publisher.send(Event(type, hostname=self.hostname, **fields))
File "/opt/shootq/shootqenv/lib/python2.5/site-packages/
carrot-0.10.5-py2.5.egg/carrot/messaging.py", line 762, in send
headers=headers)
File "/opt/shootq/shootqenv/lib/python2.5/site-packages/
carrot-0.10.5-py2.5.egg/carrot/backends/pyamqplib.py", line 333, in
publish
immediate=immediate)
File "build/bdist.solaris-2.11-i86pc/egg/amqplib/client_0_8/
channel.py", line 2223, in basic_publish
self._send_method((60, 40), args, msg)
File "build/bdist.solaris-2.11-i86pc/egg/amqplib/client_0_8/
abstract_channel.py", line 70, in _send_method
method_sig, args, content)
File "build/bdist.solaris-2.11-i86pc/egg/amqplib/client_0_8/
method_framing.py", line 233, in write_method
self.dest.write_frame(1, channel, payload)
File "build/bdist.solaris-2.11-i86pc/egg/amqplib/client_0_8/
transport.py", line 125, in write_frame
frame_type, channel, size, payload, 0xce))
File "<string>", line 1, in sendall
error: (32, 'Broken pipe')

We figure this is just network errors, but the problem is that any
time this occurs, our celery workers are just completely hosed until
we restart them manually. Does anybody have any sort of explanation
as to what's causing this, and how we can prevent or at least detect
it?

Cal Leeming [Simplicity Media Ltd]

unread,
Aug 24, 2010, 6:46:20 PM8/24/10
to celery...@googlegroups.com


On Tue, Aug 24, 2010 at 11:46 PM, Cal Leeming <c...@foxwhisper.co.uk> wrote:
Hi Ryan,

Please provide a full dump of all the variables involved in this setup.

Celery version
Django version
Erland version
RabbitMQ version
AMQP version
Python version
OS version
Celery config


--
You received this message because you are subscribed to the Google Groups "celery-users" group.
To post to this group, send email to celery...@googlegroups.com.
To unsubscribe from this group, send email to celery-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/celery-users?hl=en.





--

Cal Leeming

Operational Security & Support Team

Out of Hours: +44 (07534) 971120Support Tickets: sup...@simplicitymedialtd.co.uk 
Fax: +44 (02476) 578987 | Email: cal.l...@simplicitymedialtd.co.uk 

Simplicity Media Ltd. All rights reserved.
Registered company number 7143564

Ryan Petrello

unread,
Aug 24, 2010, 7:10:58 PM8/24/10
to celery-users
Here ya go:

Celery version - 2.0.2
Django version - We're not using Django
Erlang version - R13B04 (erts-5.7.5)
RabbitMQ version - 1.7.0
AMQP version - Assuming you mean amqplib, 0.6.1
Python version - 2.5.4
OS version - Solaris 5.11
Celery config (sensitive data has been replaced):

BROKER_HOST = blah.example.com
BROKER_PORT = 5672
BROKER_USER = xyz
BROKER_PASSWORD = xyz
BROKER_VHOST = example
CELERY_RESULT_BACKEND = database
CELERYD_CONCURRENCY = 4
CELERYD_QUEUES = proofing,calendarfetch
CELERYD_MAX_TASKS_PER_CHILD = 100
CELERY_RESULT_DBURI = mysql://blahblahblah
CELERYD_LOG_LEVEL = INFO
CELERY_SEND_TASK_ERROR_EMAILS = True
ADMINS = (("Errors", "errors...@example.com"),)
SERVER_EMAIL = no-r...@blahblah.com
EMAIL_HOST = localhost

On Aug 24, 6:46 pm, "Cal Leeming [Simplicity Media Ltd]"
<cal.leem...@simplicitymedialtd.co.uk> wrote:
> On Tue, Aug 24, 2010 at 11:46 PM, Cal Leeming <c...@foxwhisper.co.uk> wrote:
> > Hi Ryan,
>
> > Please provide a full dump of all the variables involved in this setup.
>
> > Celery version
> > Django version
> > Erland version
> > RabbitMQ version
> > AMQP version
> > Python version
> > OS version
> > Celery config
>
> >> celery-users...@googlegroups.com<celery-users%2Bunsubscribe@google groups.com>
> >> .
> >> For more options, visit this group at
> >>http://groups.google.com/group/celery-users?hl=en.
>
> --
>
> Cal Leeming
>
> Operational Security & Support Team
>
> *Out of Hours: *+44 (07534) 971120 | *Support Tickets: *
> supp...@simplicitymedialtd.co.uk
> *Fax: *+44 (02476) 578987 | *Email: *cal.leem...@simplicitymedialtd.co.uk

Ask Solem

unread,
Aug 25, 2010, 4:35:00 AM8/25/10
to celery...@googlegroups.com

On Aug 24, 2010, at 11:13 PM, Ryan Petrello wrote:

> We've got RabbitMQ running in production and several celery workers
> that are constantly doing work. Periodically, we notice that our
> workers stop working entirely and generate the following traceback
> over and over into our logs:
>
> 2010-08-24 19:35:20,549 DEBG 'celery' stderr output:
> [2010-08-24 19:35:20,548: ERROR/MainProcess] CarrotListener:
> Connection to broker lost. Trying to re-establish connection...
>

> [...]


> self.publisher.send(Event(type, hostname=self.hostname, **fields))
> File "/opt/shootq/shootqenv/lib/python2.5/site-packages/

> [...]


> error: (32, 'Broken pipe')
>
> We figure this is just network errors, but the problem is that any
> time this occurs, our celery workers are just completely hosed until
> we restart them manually. Does anybody have any sort of explanation
> as to what's causing this, and how we can prevent or at least detect
> it?

Apparently celeryd is trying to send an event while the connection is down.
I created a fix for this just now, backported into the stable branch:

http://github.com/ask/celery/tree/release20-maint

Could you test this? If it seems to be working I'll make a 2.0.3 release.


Also, could you check your rabbitmq logs? Maybe there are additional clues
there to as why it lost the connection in the first place.

--
{Ask Solem,
+47 98435213 | twitter.com/asksol }.

Ryan Petrello

unread,
Aug 25, 2010, 8:59:03 AM8/25/10
to celery-users
I noticed this from our rabbitmq logs around the time we were having
trouble yesterday. It's the only error I've found, though:

=ERROR REPORT==== 24-Aug-2010::20:58:04 ===
exception on TCP connection <0.274.0> from 10.17.172.226:63549
{bad_payload,<<123,34,114,101,116,114,105,101,115,34,58,
32,48,44,32,34,104,111,115,116,110,97,109,
101,34,58,32,34,102,102,99,103,120,54,97,
98,46,106,111,121,101,110,116,46,117,115,
34,44,32,34,117,117,105,100,34,58,32,34,
57,54,97,49,50,100,50,49,45,48,50,101,98,
45,52,56,51,56,45,97,97,50,53,45,55,98,53,
101,50,49,101,101,50,50,51,99,34,44,32,34,

...this goes on for awhile...

98,53,32,92,92,120,57,56>>}

I can try to test your changes, but reproducing this is pretty
difficult (we've experienced it twice in the last week for a period of
a few minutes - if I had to guess, I'd chalk it up to network issues,
as our RabbitMQ machine lives in another part of the data center than
our workers).

Ask Solem

unread,
Aug 27, 2010, 8:27:29 AM8/27/10
to celery-users


On Aug 25, 2:59 pm, Ryan Petrello <r...@shootq.com> wrote:

>
> I can try to test your changes, but reproducing this is pretty
> difficult (we've experienced it twice in the last week for a period of
> a few minutes - if I had to guess, I'd chalk it up to network issues,
> as our RabbitMQ machine lives in another part of the data center than
> our workers).

I did manage to reproduce it by simply shutting rabbitmq down,
and confirmed that my patch fixed it. Just wanted another
confirmation.
Reply all
Reply to author
Forward
0 new messages