Celery worker failing to start with 4 or more queues

Carlos Oliva

unread,

Jun 19, 2018, 8:46:43 AM6/19/18

to rapidpro-dev

Hello everyone,

I've been using rapidpro in a few different environments for the last couple of years, and I am now trying to get it up and running on a new VM. However this time around I encountered a really weird issue which I don't know if it's directly related to rapidpro.

I normally run the celery worker under supervisor under user 'rapidpro':

...
command = celery worker --app=temba -l info -Q celery,handler,flows,msgs
...

Under this current installation however it fails to run. I spent a few (many actually) hours debugging and tracing as the default logging output from celery isn't very clear - even when setting the log level to debug. It just failed silently and got restarted again by supervisor before even running a single task.

So I started running celery manually under the same user, and when running it with up to 3 queues on the command line it starts and works just fine. However when adding a 4th queue it fails - it doesn't matter the order or even the name of the queues:

$ celery worker --app=temba -l debug -Q celery,msgs,handler,flows
 
 -------------- celery@multicontact v4.0.2 (latentcall)
---- **** ----- 
--- * ***  * -- Linux-4.9.0-6-amd64-x86_64-with-debian-9.4 2018-06-19 02:06:14
-- * - **** --- 
- ** ---------- [config]
- ** ---------- .> app:         temba:0x7f6bfb54bef0
- ** ---------- .> transport:   redis://localhost:6379/15
- ** ---------- .> results:     disabled://
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery
                .> flows            exchange=flows(direct) key=flows
                .> handler          exchange=handler(direct) key=handler
                .> msgs             exchange=msgs(direct) key=msgs


[tasks]
...


[2018-06-19 02:06:14,972: INFO/MainProcess] Connected to redis://localhost:6379/15
[2018-06-19 02:06:14,984: INFO/MainProcess] mingle: searching for neighbors
[2018-06-19 02:06:16,009: INFO/MainProcess] mingle: all alone
[2018-06-19 02:06:17,039: DEBUG/MainProcess] removing tasks from inqueue until task handler finished

I later found an environment variable celery uses to improve logging and that gave a bit more information:

$ MP_LOG=1 celery worker --app=temba -l debug -Q celery,msgs,handler,flows

...
[2018-06-19 02:07:51,689: INFO/PoolWorker-1] child process 25631 calling self.run()
[2018-06-19 02:07:51,707: INFO/MainProcess] Connected to redis://localhost:6379/15
[2018-06-19 02:07:51,718: INFO/MainProcess] mingle: searching for neighbors
[2018-06-19 02:07:52,744: INFO/MainProcess] mingle: all alone
[2018-06-19 02:07:52,776: DEBUG/MainProcess] closing pool
[2018-06-19 02:07:52,777: DEBUG/MainProcess] closing pool
[2018-06-19 02:07:52,777: DEBUG/MainProcess] joining worker handler
[2018-06-19 02:07:52,778: DEBUG/MainProcess] joining task handler
[2018-06-19 02:07:52,779: DEBUG/PoolWorker-1] worker got sentinel -- exiting
[2018-06-19 02:07:52,780: DEBUG/MainProcess] joining result handler
[2018-06-19 02:07:52,780: DEBUG/MainProcess] result handler joined
[2018-06-19 02:07:52,780: DEBUG/MainProcess] joining worker 1/1 (<Process(PoolWorker-1, started daemon)>)
[2018-06-19 02:07:53,784: DEBUG/MainProcess] pool join complete
[2018-06-19 02:07:53,788: DEBUG/MainProcess] finalizing pool
[2018-06-19 02:07:53,788: DEBUG/MainProcess] helping task handler/workers to finish
[2018-06-19 02:07:53,788: DEBUG/MainProcess] removing tasks from inqueue until task handler finished
[2018-06-19 02:07:54,289: DEBUG/MainProcess] terminating workers
[2018-06-19 02:07:54,290: DEBUG/MainProcess] joining task handler
[2018-06-19 02:07:54,291: DEBUG/MainProcess] joining result handler
[2018-06-19 02:07:54,291: DEBUG/MainProcess] joining timeout handler
[2018-06-19 02:07:54,291: DEBUG/MainProcess] joining pool workers
[2018-06-19 02:07:54,292: DEBUG/MainProcess] pool workers joined

Finally, I managed to trace the message "worker got sentinel -- exiting" to a celery dependency called billiard, but that's as far as I could get.

Like I said before, this isn't my first rapidpro setup - I have a slightly older one (from early last October, previous to courier) running under the same OS version (Debian 9) and with almost identical configurations, with the same celery, billiard and redis versions.

I tried to provide as much detail as possible but I think this is as far as I could get. As I already mentioned, it doesn't matter the order of the queues or the names - 4 or more causes the process to fail.

Nic Pottier

unread,

Jun 19, 2018, 8:48:42 AM6/19/18

to carlos...@gmail.com, rapidpro-dev

Hi Carlos,

That's a weird one all right. You are running Python 3.6 or higher yes?

It's possible this is a Celery bug introduced in Python 3 (which is all RapidPro runs anymore). Generally we run the celery workers as their own processes, so one worker per queue, so that may be a workaround for you.

Nic

--
You received this message because you are subscribed to the Google Groups "rapidpro-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rapidpro-dev...@googlegroups.com.
To post to this group, send email to rapidp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rapidpro-dev/c50e8dd6-a86c-4985-a48d-f65b11131fe4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Carlos Oliva

unread,

Jun 19, 2018, 10:11:06 AM6/19/18

to rapidpro-dev

On Tuesday, June 19, 2018 at 8:48:42 AM UTC-4, Nic Pottier wrote:

Hi Carlos,

Hey Nic, thanks for your response.

That's a weird one all right. You are running Python 3.6 or higher yes?

Right, 3.6.5

It's possible this is a Celery bug introduced in Python 3 (which is all RapidPro runs anymore). Generally we run the celery workers as their own processes, so one worker per queue, so that may be a workaround for you.

That's what I thought and I what ended up doing, but actually running 2 celery worker processes with 2 queues each.