RabbitMQ 3.8.3 killed by OoM killer on restart while 3.8.2 works

459 views
Skip to first unread message

Jeroen Hoekx

unread,
Mar 24, 2020, 4:44:16 AM3/24/20
to rabbitmq-users
Hello,

We are not pushing much data through RabbitMQ, so have set a relatively small memory limit of 384 MB. After upgrading to 3.8.3 our containers are always killed on restart.

The following docker-compose file illustrates the issue:

version: '2'

networks:
  rabbitnet-1:
  rabbitnet-2:

volumes:
  rabbitvol-1:
  rabbitvol-2:

services:
  rabbitmq-3.8.2:
    image: rabbitmq:3.8.2-alpine
    environment:
      RABBITMQ_VM_MEMORY_HIGH_WATERMARK: 200MiB
      RABBITMQ_NODENAME: "rabbitmq01@localhost"
      RABBITMQ_USE_LONGNAME: "true"
    mem_limit: 402653184 # 384M
    memswap_limit: 402653184
    networks:
      - rabbitnet-1
    volumes:
      - rabbitvol-1:/var/lib/rabbitmq
  rabbitmq-3.8.3:
    image: rabbitmq:3.8.3-alpine
    environment:
      RABBITMQ_VM_MEMORY_HIGH_WATERMARK: 200MiB
      RABBITMQ_NODENAME: "rabbitmq01@localhost"
      RABBITMQ_USE_LONGNAME: "true"
    mem_limit: 402653184 # 384M
    memswap_limit: 402653184
    networks:
      - rabbitnet-2
    volumes:
      - rabbitvol-2:/var/lib/rabbitmq

Starting this works fine.
Stopping and then restarting (maybe one or two times) leads to:

rabbitmq-3.8.3_1  | 2020-03-24 08:35:06.452 [info] <0.320.0> WAL: recovering ["/var/lib/rabbitmq/mnesia/rabbitmq01@localhost/quorum/rabbitmq01@localhost/00000002.wal"]
rabbitmq-alloc_rabbitmq-3.8.3_1 exited with code 0

Here RabbitMQ was killed by the OoM killer.

Looking at the files mentioned:

$ docker-compose exec rabbitmq-3.8.2 ls -lh /var/lib/rabbitmq/mnesia/rabbitmq01@localhost/quorum/rabbitmq01@localhost/
total 20K   
-rw-r--r--    1 rabbitmq rabbitmq       5 Mar 24 08:38 00000001.wal
-rw-r--r--    1 rabbitmq rabbitmq    5.3K Mar 24 08:38 meta.dets
-rw-r--r--    1 rabbitmq rabbitmq    5.3K Mar 24 08:38 names.dets
$ docker-compose exec rabbitmq-3.8.3 ls -lh /var/lib/rabbitmq/mnesia/rabbitmq01@localhost/quorum/rabbitmq01@localhost/
total 451M  
-rw-r--r--    1 rabbitmq rabbitmq  451.2M Mar 24 08:38 00000001.wal
-rw-r--r--    1 rabbitmq rabbitmq    5.3K Mar 24 08:38 meta.dets
-rw-r--r--    1 rabbitmq rabbitmq    5.3K Mar 24 08:38 names.dets

It's clear that that file cannot be read entirely within the memory allocation.

I did not see anything that could cause this in the release notes and I would not expect such a big change in a minor release. Is there a way to tweak this behavior?

Greetings,

Jeroen

Luke Bakken

unread,
Mar 24, 2020, 10:56:24 AM3/24/20
to rabbitmq-users
Hello -

You're going to have to provide more information about your test scenarios -
  • Exactly how much is "not pushing much data"?
  • I'm assuming you're publishing a certain number of messages, but not consuming them, so they create a backlog. Is this backlog exactly the same amount in both tests? If it is, I would not expect to see two wildly different sizes for the wal files (write ahead log).
Thanks,
Luke

We are not pushing much data through RabbitMQ
 
Looking at the files mentioned:

Jeroen Hoekx

unread,
Mar 24, 2020, 11:10:54 AM3/24/20
to rabbitm...@googlegroups.com
Hello Luke,

The problem that I describe happens in a completely clean setup, as evidenced by the docker-compose file I included. There has not been any usage of RabbitMQ at all at that point. It is enough to run `docker-compose up`, then do CTRL-C, then do `docker-compose up` again, at most repeat this two times to see the problem. The 3.8.3 container will be killed by the OoM killer.

We encountered the problem in our build pipeline that happens to restart RabbitMQ before it has been used for any data.

Greetings,

Jeroen

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/A6IF38VNngo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/f50aeeb0-6e00-4895-bfa0-730e78e230d7%40googlegroups.com.


--
Jeroen Hoekx
IT Engineer
+32 11 26 38 30 | trendminer.com

Luke Bakken

unread,
Mar 24, 2020, 11:14:56 AM3/24/20
to rabbitmq-users
Hi Jeroen,

Thanks for the clarification. I will pass this on to someone on the team that may have ideas.

Luke
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

Karl Nilsson

unread,
Mar 24, 2020, 11:35:03 AM3/24/20
to rabbitmq-users
Hi,

There are a couple of settings you can use to control the wal size ( and ultimately peak memory use during recovery) but the simplest would be to set raft.wal_max_size_bytes to a small value, 32MiB. This is assuming you are not and will not be using quorum queues or MQTT connection tracking.

Cheers
Karl

To unsubscribe from this group and all its topics, send an email to rabbitm...@googlegroups.com.

Jeroen Hoekx

unread,
Mar 24, 2020, 2:41:37 PM3/24/20
to rabbitm...@googlegroups.com
Thanks for the feedback already,

In the documentation, I can see (for raft.wal_max_size_bytes):
## NB: changing these on a node with existing data directory
## can lead to DATA LOSS.
##

Since, once we roll out this change and this version goes through our pipeline, it will target systems with existing data directories, I would like to check with you if this would be a safe thing to do?

To me it feels like decreasing the WAL size is an action that could lead to corruption. Note I did not yet try this. That's for tomorrow when I'm working again.

Looking through the release notes, there is one change that looks like it could have caused this: https://github.com/rabbitmq/rabbitmq-server/issues/2222
Since the default WAL max size is 512M if I read the other docs correctly, maybe in the past the container actually never wrote that file on shutdown?

We went by what the production checklist says:
Nodes hosting RabbitMQ should have at least 256 MiB of memory available at all times.

I would not expect a default larger WAL size to crash the system given that statement.

Greetings,

Jeroen

To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/a25937b7-cc20-4bf7-9fbe-5ab044eed944%40googlegroups.com.

Karl Nilsson

unread,
Mar 25, 2020, 7:53:05 AM3/25/20
to rabbitm...@googlegroups.com
Decreasing the WAL file size won't take effect until the next WAL file is created which is safe and won't cause corruption. Because of that you will need enough memory for your RabbitMQ nodes to recover the current files. After that is done and smaller WALs have been created you can reduce it again.

That said it should be possible to reduce peak memory usage during WAL recovery. I've added a GH issue to our Ra library to do this: https://github.com/rabbitmq/ra/issues/173

Cheers
Karl

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/CAHtrqydKG19BZ_rSJUzJvk--3sPtNNja1nTJ5N_gapGR6OOx7g%40mail.gmail.com.


--
Karl Nilsson

Pivotal/RabbitMQ

Jeroen Hoekx

unread,
Apr 15, 2020, 3:27:56 AM4/15/20
to rabbitm...@googlegroups.com
Hello,

I was finally able to test this.

I configured my WAL to be 100M:
raft.wal_max_size_bytes = 104857600

Then it no longer crashes like it did before. But I do get the error message:
rabbitmq-3.8.3_1  | 2020-04-15 06:56:37.055 [info] <0.371.0> WAL: recovering ["/var/lib/rabbitmq/mnesia/rabbitmq01@localhost/quorum/rabbitmq01@localhost/00000001.wal"]
rabbitmq-3.8.3_1  | 2020-04-15 06:56:37.106 [warning] <0.371.0> wal: encountered error during recovery: badarg

This is again just starting, stopping and restarting the container, with a fresh data volume and without any data ingested.

It looks like this error is hit, but I have no idea how to diagnose this further:

On the other hand, I noticed that the issue you created for the Ra library was fixed, so I think I will just try the next minor release of 3.8 with the default settings.

Thanks for your help,

Jeroen

Karl Nilsson

unread,
Apr 16, 2020, 9:10:25 AM4/16/20
to rabbitm...@googlegroups.com
This error can be ignored and will not occur from the forthcoming 3.8 version.

Cheers
Karl



--
Karl Nilsson

Pivotal/RabbitMQ
Reply all
Reply to author
Forward
0 new messages