Celery comes to a halt and stops doing anything

41 views
Skip to first unread message

Muharem Hrnjadovic

unread,
Oct 26, 2011, 6:28:27 AM10/26/11
to celery-users
Hello there!

We are running a setup with 4 machines (1 control node that creates all
the tasks and 3 worker nodes with celeryd that calculate the tasks)

I am trying to run a big job that consists of approx. 150,000 tasks and
I am observing that the celeryd workers on all 3 machines stop doing any
work after approx. 8,500 tasks.

If I restart any one of them (kill -SIGHUP <pid>) things return to
normal (i.e. they all continue to pick up and perform work) but that
"trick" does not work always.

We are running celery-2.3.1 with rabbitmq-2.3.1 and our celery config
is as follows:

=========================================================================
CELERY_RESULT_BACKEND = "amqp"
CELERY_ACKS_LATE = True
CELERYD_PREFETCH_MULTIPLIER = 1
=========================================================================

All the worker machines have 32 CPU cores and 64 GB of RAM, the celeryd
workers are running with a concurrency level of 14 (-c 14)

After the second celeryd worker restart I can see a sudden change in the
number of messages in the celery queue:

"rabbitmqctl list_queues name auto_delete messages messages_ready messages_unacknowledged" says:
celery false 3317 3275 42
celery false 252 210 42

Once that queue is drained no work is performed and further celeryd
restarts (on the worker machines) have no effect. It remains empty
like this:

celery false 0 0 0

Any explanations of this behaviour and/or ideas how to analyze this
futher would be very much appreciated.


Best regards/Mit freundlichen Grüßen

--
Muharem Hrnjadovic <m...@foldr3.com>
Public key id : B2BBFCFC
Key fingerprint : A5A3 CC67 2B87 D641 103F 5602 219F 6B60 B2BB FCFC

signature.asc

Muharem Hrnjadovic

unread,
Oct 26, 2011, 8:29:33 AM10/26/11
to celery...@googlegroups.com
I commented out the options below in the celeryconfig.py files but that
had no effect.

> CELERY_ACKS_LATE = True
> CELERYD_PREFETCH_MULTIPLIER = 1

My impression is that the celeryd worker is not reporting progress
peroperly. After restarting a celeryd worker for the first time it would
appear that it found/got around to reporting a successful task and that
is what gets the calculation going again.

Please see lines 2790 and 2795 in http://paste.ubuntu.com/719612/

signature.asc

Muharem Hrnjadovic

unread,
Oct 26, 2011, 8:46:49 AM10/26/11
to celery...@googlegroups.com
I forgot to mention the sudden message count decrease in the celery
queue that coincided with the celeryd worker restart:

before restart:
celery false 3225 3057 168

after restart:
celery false 300 132 168

The figures above were obtained via:


rabbitmqctl list_queues name auto_delete messages messages_ready messages_unacknowledged

signature.asc

Muharem Hrnjadovic

unread,
Nov 15, 2011, 2:16:22 AM11/15/11
to celery...@googlegroups.com
RabbitMQ was the "culprit", it would get stuck after a while due to a
low maximum number of open files (1024 by default).

The following fixes the issue altogether:

# ulimit -n 32768
# /etc/init.d/rabbitmq-server restart

On 10/26/2011 12:28 PM, Muharem Hrnjadovic wrote:

signature.asc
Reply all
Reply to author
Forward
0 new messages