Trying to debug a beat error, and would need a bit more background information

475 views

Skip to first unread message

Jens Troeger

unread,

Oct 2, 2019, 1:32:06 AM10/2/19

to celery-users

Hi,

I am debugging a sudden failure of Celery beat; there was no change in setup, things just stopped working. Here is the beginning of the slew of errors on a standard Ubuntu machine:

[2019-10-01 06:40:38,360: INFO/Beat] beat: Starting...
[2019-10-01 06:40:38,367: INFO/MainProcess] Connected to amqp://foobar:**@127.0.0.1:5672//
[2019-10-01 06:40:38,364: ERROR/Beat] Process Beat
Traceback (most recent call last):
  File "/…/python3.7/site-packages/kombu/utils/objects.py", line 42, in __get__
    return obj.__dict__[self.__name__]
KeyError: 'scheduler'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/shelve.py", line 111, in __getitem__
    value = self.cache[key]
KeyError: 'entries'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/…/python3.7/site-packages/celery/beat.py", line 524, in _create_schedule
    self._store[str('entries')]
  File "/usr/lib/python3.7/shelve.py", line 113, in __getitem__
    f = BytesIO(self.dict[key.encode(self.keyencoding)])
KeyError: b'entries'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/…/python3.7/site-packages/billiard/process.py", line 327, in _bootstrap
    self.run()
  File "/…/python3.7/site-packages/celery/beat.py", line 673, in run
    self.service.start(embedded_process=True)
  File "/…/python3.7/site-packages/celery/beat.py", line 588, in start
    humanize_seconds(self.scheduler.max_interval))
  File "/…/python3.7/site-packages/kombu/utils/objects.py", line 44, in __get__
    value = obj.__dict__[self.__name__] = self.__get(obj)
  File "/…/python3.7/site-packages/celery/beat.py", line 632, in scheduler
    return self.get_scheduler()
  File "/…/python3.7/site-packages/celery/beat.py", line 627, in get_scheduler
    lazy=lazy,
  File "/…/python3.7/site-packages/celery/beat.py", line 467, in __init__
    Scheduler.__init__(self, *args, **kwargs)
  File "/…/python3.7/site-packages/celery/beat.py", line 226, in __init__
    self.setup_schedule()
  File "/…/python3.7/site-packages/celery/beat.py", line 495, in setup_schedule
    self._create_schedule()
  File "/…/python3.7/site-packages/celery/beat.py", line 528, in _create_schedule
    self._store[str('entries')] = {}
  File "/usr/lib/python3.7/shelve.py", line 125, in __setitem__
    self.dict[key.encode(self.keyencoding)] = f.getvalue()
_dbm.error: cannot add item to database

And things keep going from there. The current Celery configuration is as follows:

_celery_config = {
    "broker_url": "amqp://user:passwd@localhost:5672//",
    "result_backend": "redis://localhost:6379/0",
    # https://github.com/celery/celery/issues/4642
    "result_expires": 0,
    "task_serializer": "json",
    "result_serializer": "json",
    "accept_content": ["json"],
    "enable_utc": True,
    "timezone": "UTC",
    "imports": (
        "srv.celery.async_tasks",
        "srv.celery.periodic_tasks",
        ),
    "task_routes": {
        "srv.celery.*_tasks": {
            "queue": "foobar-queue",
            },
        },
    "beat_schedule": {
        **_celery_beat_schedule,
        },
    }

The beat schedule is a dictionary of crontab based entries.

I should point out that beat runs as part of the “normal” async workers:

./bin/celery worker --app srv.celery.celery_app --beat --hostname host@%h --loglevel info --queues foobar-queue --statedb celery-localhost.state --schedule celery-beat-localhost.state

However, as per this section that doesn’t seem to be the recommended way. But it’s been working so far, so I think this is not the cause of the above issue.

As can be seen, there’s Redis as a result store and AMQP as a message queue.

Question 1

When a task is being scheduled for a future datetime (i.e. not immediate execution) then that data is stored… where? And what instance manages that stored message, i.e. leaves it stored and checks the due time?

Question 2

When a task is due to execute, the message is popped off the queue, i.e. removed from it? Does the consumer (Celery worker task) actually consume the message or will it remain in the queue in some form? Is this Celery configurable, or AMQP configurable?

Question 3

I looked at the AMQP status and I am blown away by the data that I see, and have a sneaking suspicion that the message queue has issues which manifest in the above Celery failures.

Status of node 'rabbit@ip-10-0-1-38' ...
[{pid,1452},
 {running_applications,[{rabbit,"RabbitMQ","3.5.7"},
                        {os_mon,"CPO  CXC 138 46","2.4"},
                        {mnesia,"MNESIA  CXC 138 12","4.13.3"},
                        {xmerl,"XML parser","1.3.10"},
                        {sasl,"SASL  CXC 138 11","2.7"},
                        {stdlib,"ERTS  CXC 138 10","2.8"},
                        {kernel,"ERTS  CXC 138 10","4.2"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang/OTP 18 [erts-7.3] [source] [64-bit] [async-threads:64] [kernel-poll:true]\n"},
 {memory,[{total,46067816},
          {connection_readers,30832},
          {connection_writers,14664},
          {connection_channels,45496},
          {connection_other,73800},
          {queue_procs,235368},
          {queue_slave_procs,0},
          {plugins,0},
          {other_proc,13206720},
          {mnesia,76088},
          {mgmt_db,0},
          {msg_index,147680},
          {other_ets,760744},
          {binary,5297784},
          {code,16972203},
          {atom,654217},
          {other_system,8552220}]},
 {alarms,[]},
 {listeners,[{clustering,25672,"::"},{amqp,5672,"::"}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,415493324},
 {disk_free_limit,50000000},
 {disk_free,4036173824},
 {file_descriptors,[{total_limit,65436},
                    {total_used,7},
                    {sockets_limit,58890},
                    {sockets_used,3}]},
 {processes,[{limit,1048576},{used,164}]},
 {run_queue,0},
 {uptime,5875120}]

This is worrisome data, e.g. 235368 queue_procs: queue masters, indices and messages kept in memory. Considering there's about a handful of tasks (or less) per hour (or much less) that number indicates that something is not quite right.

I am thinking to purge the message queue but want to make sure that scheduled tasks aren’t affected. (See question 1 above.)

It would help if somebody can shed some light on the above questions, and perhaps also on the error message?

Much thanks,

Jens

Jens Troeger

unread,

Oct 7, 2019, 9:30:33 PM10/7/19

to celery-users

Two things here:

Make sure that RabbitMQ is the most recent, the default Ubuntu packages are way outdated. Follow these instructions.
Delete the beat’s .db file, and then restart.