Disk Space Issues: How to clean msg_store_persistent and msg_store_transient?

10,751 views
Skip to first unread message

Kyle Flavin

unread,
Apr 6, 2015, 12:53:44 PM4/6/15
to rabbitm...@googlegroups.com
We had an issue this weekend where our disk filled up on our RabbitMQ instance, and brought down several other dependent services.  On further investigation, the cause was the msg_store_persistent and msg_store_transient folders (under /var/lib/rabbitmq) filling with many *.rdq files, all ~16MB in size.

I'm looking for the proper way to manage these files.  We do not currently "ack" messages on the client (no_ack=True).  Would that cause the store to fill up?  Are ack'd messages removed?  Is it safe to setup a logrotate job to periodically delete older *.rdq files?

Michael Klishin

unread,
Apr 6, 2015, 1:29:42 PM4/6/15
to Kyle Flavin, rabbitm...@googlegroups.com
On 6 April 2015 at 19:53:47, Kyle Flavin (kyle....@gmail.com) wrote:
> I'm looking for the proper way to manage these files. We do not
> currently "ack" messages on the client (no_ack=True). Would
> that cause the store to fill up? Are ack'd messages removed?

If you have messages piling up in the message store, this means a queue somewhere
(can be in any vhost) is growing out of bounds.

RabbitMQ has a disk free space monitor but it has a low value by default because
certain popular Linux distributions have unreasonably low partition size for /var/*.

You should configure it to a few GBs:
https://github.com/rabbitmq/rabbitmq-server/blob/master/docs/rabbitmq.config.example#L190-194

(directory size monitoring should suggest what's your 99 percentile size is)

Note that RabbitMQ performs compaction of on disk segments, it can temporarily grow up to x2 the original
size — that's a temporary condition.

> Is it safe to setup a logrotate job to periodically delete older
> *.rdq files?

DO NOT do that. This will remove your on-disk data *and* make the node fail in obscure ways.
You probably would not logrotate data files for PostgreSQL, MySQL, or similar — doing it for RabbitMQ data store
is an equally terrible  idea.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ


Michael Klishin

unread,
Apr 6, 2015, 1:36:50 PM4/6/15
to Kyle Flavin, rabbitm...@googlegroups.com
 On 6 April 2015 at 19:53:47, Kyle Flavin (kyle....@gmail.com) wrote:
> We had an issue this weekend where our disk filled up on our RabbitMQ
> instance, and brought down several other dependent services.
> On further investigation, the cause was the msg_store_persistent
> and msg_store_transient folders (under /var/lib/rabbitmq)
> filling with many *.rdq files, all ~16MB in size.

and to explain the transient part: under memory pressure, messages are moved to disk to free up RAM.

How soon that will begin happening can be configured as well with a setting called `vm_memory_high_watermark_paging_ratio`:
https://github.com/rabbitmq/rabbitmq-server/blob/master/docs/rabbitmq.config.example#L181-188

Kyle Flavin

unread,
Apr 6, 2015, 3:04:35 PM4/6/15
to rabbitm...@googlegroups.com, kyle....@gmail.com
Thanks Michael.

We have 4GB on our /var partition.  It sounds like if I set the free space monitor from 50MB to a couple GB's, it will stop processing messages sooner?    From the disk alarms documentation:

https://www.rabbitmq.com/disk-alarms.html
"RabbitMQ will block producers when free disk space drops below a certain limit."

What I need is the older messages/events to be cleared.  It took about 2 months to fill the 4GB partition; it didn't happen overnight, so I think space-wise, we're okay.  I'm noticing messages in the RDQ files from early March.  The thing is, our consumers need to deal with a message immediately.  If the message isn't handled immediately, it becomes stale (within minutes) as far as our app is concerned, and there isn't any need to process it at a later time.  So ideally, I'd like to not keep around any messages older than a day (or even a few hours).

It looks like I might be able to accomplish this using TTL's on the messages?:
https://www.rabbitmq.com/ttl.html

Also, not sure if you saw it before, but I'm not currently ACK'ing messages from the consumer.  I'm trying to determine if this is keeping the messages around longer than if I was ACK'ing them?
 

Michael Klishin

unread,
Apr 6, 2015, 3:27:00 PM4/6/15
to Kyle Flavin, rabbitm...@googlegroups.com
On 6 April 2015 at 22:04:38, Kyle Flavin (kyle....@gmail.com) wrote:
> We have 4GB on our /var partition. It sounds like if I set the free
> space monitor from 50MB to a couple GB's, it will stop processing
> messages sooner? 

Publishers will be blocked sooner, yes.

> From the disk alarms documentation:
>
> https://www.rabbitmq.com/disk-alarms.html
>
> ...(bloop://bloop_expand)
> What I need is the older messages/events to be cleared. It took
> about 2 months to fill the 4GB partition; it didn't happen overnight,
> so I think space-wise, we're okay. I'm noticing messages in the
> RDQ files from early March.

That is very odd. I wonder if there can be a queue that is accumulating messages
at a very slow rate due to a mistake somewhere?

> The thing is, our consumers need to
> deal with a message immediately. If the message isn't handled
> immediately, it becomes stale (within minutes) as far as our
> app is concerned, and there isn't any need to process it at a later
> time. So ideally, I'd like to not keep around any messages older
> than a day (or even a few hours).
> It looks like I might be able to accomplish this using TTL's on
> the messages?:
> https://www.rabbitmq.com/ttl.html

Then TTL for messages sound exactly what you're looking for.

See https://www.rabbitmq.com/blog/2014/01/23/preventing-unbounded-buffers-with-rabbitmq/,
too.

> Also, not sure if you saw it before, but I'm not currently ACK'ing
> messages from the consumer. I'm trying to determine if this is
> keeping the messages around longer than if I was ACK'ing them?

the "noack" option should be read as "no manual ack". Which means,
RabbitMQ considers a message to be acknowledged as soon as it writes it to the socket
(oversimplifying but you get the idea). So no, that is not the issue.

In summary, I'd bump free disk space limit 10 at least a few hundred MB and use TTL
on messages, as you say the messages in your system are ephemeral in nature. 

Kyle Flavin

unread,
Apr 6, 2015, 5:36:56 PM4/6/15
to rabbitm...@googlegroups.com, kyle....@gmail.com
I found a "test" queue on the rabbit server (rabbitmqctl list_queues) which had 10,000+ messages in it.  I think this may have been part of the problem.  There were no listening consumers, and it was just sitting there; all messages were un'acked, and it had been set as "durable".  I deleted it.

I'm going to give the TTL a try.  I think that's exactly what I need.

Thanks again!

Kyle Flavin

unread,
Apr 8, 2015, 1:41:06 PM4/8/15
to rabbitm...@googlegroups.com, kyle....@gmail.com
Even with the TTL set, messages are not being removed:

# rabbitmqctl list_queues name messages messages_unacknowledged
Listing queues ...
amq.gen-I4NaY3EBIXPBRoWY2LWzEA  9832    9832

Here's the basics of the consumer code I'm using (with Python pika).  I've tried setting the TTL to various values, including 60000 and 0.  The messages in my queue above just continue to accumulate, and they all show as unacknowledged:

credentials = pika.PlainCredentials(user, pw)
connection
= pika.BlockingConnection(pika.ConnectionParameters(host=host, credentials=self.credentials))
channel
= self.connection.channel()
#channel.exchange_declare(exchange="cloudstack-events", type='topic', durable=True)
result
= self.channel.queue_declare(exclusive=True, arguments={"message-ttl":0})
queue_name
= result.method.queue
#channel.queue_bind(exchange='cloudstack-events', queue=self.queue_name, routing_key = '*.*.*.*.*')
channel
.queue_bind(exchange='cloudstack-events', queue=self.queue_name, routing_key = '#')



It's a topic-based exchange coming from CloudStack.  I tried to change around the routing key, thinking perhaps I wasn't capturing all messages, but it looks like I am from what I can tell.  That being the case, I'm not sure why these messages aren't being acknowledged?  When I repeat the same exercise using the tutorials, it works.  The only thing I can think is that maybe CloudStack, as the publisher, is doing something that makes the messages stick around.  I looked at that code up on Github, and it doesn't appear to be doing anything strange though...

Michael Klishin

unread,
Apr 8, 2015, 1:44:46 PM4/8/15
to Kyle Flavin, rabbitm...@googlegroups.com
On 8 April 2015 at 20:41:08, Kyle Flavin (kyle....@gmail.com) wrote:
> # rabbitmqctl list_queues name messages messages_unacknowledged
> Listing queues ...
> amq.gen-I4NaY3EBIXPBRoWY2LWzEA 9832 9832
>
> Here's the basics of the consumer code I'm using (with Python
> pika). I've tried setting the TTL to various values, including
> 60000 and 0. The messages in my queue above just continue to accumulate,
> and they all show as unacknowledged:

This means some of your code uses manual acknowledgement mode but never acknowledges
deliveries. It should either acknowledge them or use automatic ("noack") mode. 

Kyle Flavin

unread,
Apr 8, 2015, 2:43:46 PM4/8/15
to rabbitm...@googlegroups.com, kyle....@gmail.com
Yep, that was it.  My apologies, I misunderstood the setting, and I wasn't manually ack'ing.  I flipped on noack and everything looks happy.

Thanks.
Reply all
Reply to author
Forward
0 new messages