Stop/Start rabbitmq may corrput persistent message store

222 views
Skip to first unread message

Kai Krueger

unread,
Aug 3, 2021, 5:12:03 AM8/3/21
to rabbitmq-users
Hi

After several restarts (5-15) of rabbitmq-server I can reproduce that rabbitmq is unable to recover the messagestore of the vhost '/'.

My minimal Testenvironment is a single server running rabbitmq, with a rabbitmq config with an upstream federation to get messages.
In one terminal I'm starting the server, in another one I'm stopping the server with `rabbitmqctl shutdown`.

So the server should stop gracefully. After something between 10 to 60 seconds (after the shell prompt in the second window has come back) I then start again the server from command line.

After about 5 to 15 Stop/start of the server (version 3.8.19 and 4.9.1) the server is unable to restart from the message store and the vhost '/' is stopped.
It is not possible to restart that vhost, unless I delete all the queued messages.

Has anyone an idea what is going wrong.

here ist an extract of the logfile:
```
2021-08-03 08:20:28.928814+00:00 [error] <0.223.0>
2021-08-03 08:20:28.928823+00:00 [error] <0.223.0> Discarding message {'$gen_cast',{force_event_refresh,#Ref<0.725243124.2176057346.69472>}} from <0.223.0> to <0.2258.0> in an old incarnation (1
627909770) of this node (1627909771)
2021-08-03 08:20:28.928823+00:00 [error] <0.223.0>
2021-08-03 08:20:28.928823+00:00 [error] <0.223.0>
2021-08-03 08:20:29.015467+00:00 [info] <0.463.0> Started message store of type persistent for vhost '/'
2021-08-03 08:20:29.221685+00:00 [error] <0.666.0> ** Generic server <0.666.0> terminating
2021-08-03 08:20:29.221685+00:00 [error] <0.666.0> ** Last message in was {init,
2021-08-03 08:20:29.221685+00:00 [error] <0.666.0>                            {<0.463.0>,
2021-08-03 08:20:29.221685+00:00 [error] <0.666.0>                             [[{segments,[{10,3},{7,3},{6,12},{8,9},{9,3}]},
2021-08-03 08:20:29.221685+00:00 [error] <0.666.0>                               {persistent_ref,
2021-08-03 08:20:29.221685+00:00 [error] <0.666.0>                                   <<215,215,217,117,6,39,148,21,167,236,73,51,
2021-08-03 08:20:29.221685+00:00 [error] <0.666.0>                                     168,44,216,145>>},
2021-08-03 08:20:29.221685+00:00 [error] <0.666.0>                               {persistent_count,30},
2021-08-03 08:20:29.221685+00:00 [error] <0.666.0>                               {persistent_bytes,11567}]]}}
...
```

Thanks
Kai Krueger
Fraunhofer ITWM


Loïc Hoguin

unread,
Aug 3, 2021, 6:09:50 AM8/3/21
to rabbitm...@googlegroups.com

Hello, could you please provide additional lines from the log, so we can see the reason for the error? Everything that refers to pid <0.666.0>. Feel free to send the full log to me directly otherwise.

 

-- 

Loïc Hoguin

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/1dfea9dc-6859-46ca-9231-cbdaf97985a3n%40googlegroups.com.

Loïc Hoguin

unread,
Aug 3, 2021, 9:22:33 AM8/3/21
to rabbitm...@googlegroups.com

Thanks for the logs.

 

The crash appears to be because there are more messages stored than the expected.

 

The queue is in a very weird state:

 

  • It has shut down normally and saved its current state about the data on disk
  • The data on disk does not match what the saved state indicates

 

Subsequent logs show other processes (other queues?) with more known issues: the segment file has entries that should no longer be in the journal, yet they still are there as well. These hint at a bad shut down “at the wrong time” and the recovery doesn’t know how to handle that.

 

But the first one is intriguing. It’s like the system didn’t flush the most recent changes to disk. I would love to get the data for that queue if possible. You can find its name just below the snippet you initially posted. It should have at least the files 6.idx 7.idx 8.idx 9.idx 10.idx (perhaps more or less considering it is in a bad state). I can’t promise a quick fix considering the season but I can take a look as part of my current work.

 

In any case, there’s currently not much you can do other than wipe the data and restart I’m afraid.

 

-- 

Loïc Hoguin

M K

unread,
Aug 9, 2021, 6:38:48 AM8/9/21
to rabbitmq-users
Reply all
Reply to author
Forward
0 new messages