Hello everyone,
I've been using rapidpro in a few different environments for the last couple of years, and I am now trying to get it up and running on a new VM. However this time around I encountered a really weird issue which I don't know if it's directly related to rapidpro.
I normally run the celery worker under supervisor under user 'rapidpro':
...
command = celery worker --app=temba -l info -Q celery,handler,flows,msgs
...
Under this current installation however it fails to run. I spent a few (many actually) hours debugging and tracing as the default logging output from celery isn't very clear - even when setting the log level to debug. It just failed silently and got restarted again by supervisor before even running a single task.
So I started running celery manually under the same user, and when running it with up to 3 queues on the command line it starts and works just fine. However when adding a 4th queue it fails - it doesn't matter the order or even the name of the queues:
$ celery worker --app=temba -l debug -Q celery,msgs,handler,flows
-------------- celery@multicontact v4.0.2 (latentcall)
---- **** -----
--- * *** * -- Linux-4.9.0-6-amd64-x86_64-with-debian-9.4 2018-06-19 02:06:14
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app: temba:0x7f6bfb54bef0
- ** ---------- .> transport: redis://localhost:6379/15
- ** ---------- .> results: disabled://
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
-------------- [queues]
.> celery exchange=celery(direct) key=celery
.> flows exchange=flows(direct) key=flows
.> handler exchange=handler(direct) key=handler
.> msgs exchange=msgs(direct) key=msgs
[tasks]
...
[2018-06-19 02:06:14,972: INFO/MainProcess] Connected to redis://localhost:6379/15
[2018-06-19 02:06:14,984: INFO/MainProcess] mingle: searching for neighbors
[2018-06-19 02:06:16,009: INFO/MainProcess] mingle: all alone
[2018-06-19 02:06:17,039: DEBUG/MainProcess] removing tasks from inqueue until task handler finished
I later found an environment variable celery uses to improve logging and that gave a bit more information:
$ MP_LOG=1 celery worker --app=temba -l debug -Q celery,msgs,handler,flows
...
[2018-06-19 02:07:51,689: INFO/PoolWorker-1] child process 25631 calling self.run()
[2018-06-19 02:07:51,707: INFO/MainProcess] Connected to redis://localhost:6379/15
[2018-06-19 02:07:51,718: INFO/MainProcess] mingle: searching for neighbors
[2018-06-19 02:07:52,744: INFO/MainProcess] mingle: all alone
[2018-06-19 02:07:52,776: DEBUG/MainProcess] closing pool
[2018-06-19 02:07:52,777: DEBUG/MainProcess] closing pool
[2018-06-19 02:07:52,777: DEBUG/MainProcess] joining worker handler
[2018-06-19 02:07:52,778: DEBUG/MainProcess] joining task handler
[2018-06-19 02:07:52,779: DEBUG/PoolWorker-1] worker got sentinel -- exiting
[2018-06-19 02:07:52,780: DEBUG/MainProcess] joining result handler
[2018-06-19 02:07:52,780: DEBUG/MainProcess] result handler joined
[2018-06-19 02:07:52,780: DEBUG/MainProcess] joining worker 1/1 (<Process(PoolWorker-1, started daemon)>)
[2018-06-19 02:07:53,784: DEBUG/MainProcess] pool join complete
[2018-06-19 02:07:53,788: DEBUG/MainProcess] finalizing pool
[2018-06-19 02:07:53,788: DEBUG/MainProcess] helping task handler/workers to finish
[2018-06-19 02:07:53,788: DEBUG/MainProcess] removing tasks from inqueue until task handler finished
[2018-06-19 02:07:54,289: DEBUG/MainProcess] terminating workers
[2018-06-19 02:07:54,290: DEBUG/MainProcess] joining task handler
[2018-06-19 02:07:54,291: DEBUG/MainProcess] joining result handler
[2018-06-19 02:07:54,291: DEBUG/MainProcess] joining timeout handler
[2018-06-19 02:07:54,291: DEBUG/MainProcess] joining pool workers
[2018-06-19 02:07:54,292: DEBUG/MainProcess] pool workers joined
Finally, I managed to trace the message "worker got sentinel -- exiting" to a celery dependency called billiard, but that's as far as I could get.
Like I said before, this isn't my first rapidpro setup - I have a slightly older one (from early last October, previous to courier) running under the same OS version (Debian 9) and with almost identical configurations, with the same celery, billiard and redis versions.
I tried to provide as much detail as possible but I think this is as far as I could get. As I already mentioned, it doesn't matter the order of the queues or the names - 4 or more causes the process to fail.