rabbitmq streams and "rabbit_stream_coordinator: failed to get tail of member"

471 views
Skip to first unread message

Artur Wroblewski

unread,
Apr 17, 2022, 12:28:44 PM4/17/22
to rabbitmq-users
Hi,

A machine (vm in my case) died recently in an unexpected way. After restart, RabbitMQ stopped accepting data into a stream.

In the logs I found

    rabbit_stream_coordinator: failed to get tail of member __test_performance_1650114290250848951 on rabbit@ahost in 1 Error: {error,app_not_running}

Removing all data in streams subdirectory and restarting RabbitMQ helped.

Is there an option to automate this, i.e. for RabbitMQ to skip the corrupted part of a stream and simply move on?

For example, the same machine runs PostgreSQL. After the same incident, it found corrupted records, logged the information about the problem, ignored the record and started to accept data. It would be great if it would be possible with RabbitMQ as well.

Regards,

Artur


Michal Kuratczyk

unread,
Apr 19, 2022, 7:20:08 AM4/19/22
to rabbitm...@googlegroups.com
Hi,

I've reproduced the situation (I have the "failed to get tail of member" logged) but I can still connect to the stream just fine. Do you have full logs from this incident by any chance?
Also, is "rabbit@ahost" the node that went down or one of the other nodes?

There are multiple places where we ignore/truncate partially written data at the end of the files, exactly for situations like this, but perhaps there is a situation where we don't.
But it's also possible that this log is a red herring and the problem was elsewhere.

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/97dd8a3e-04c9-46c5-acf3-19c5cf0b6eb5n%40googlegroups.com.


--
Michał
RabbitMQ team
Message has been deleted

Artur Wroblewski

unread,
Apr 21, 2022, 3:57:04 PM4/21/22
to rabbitmq-users
Hi,

Node "rabbit@ahost" is just single RabbitMQ instance.

Around the time of the crash, I see in the logs also

     rabbit_stream_coordinator: failed to get tail of member
__n23-__test_performance_1650114290250848951 on rabbit@ahost in 5 Error: function_clause

Above warning was raised about every 100ms (10 times per second) - maybe until I cleared the streams directory.

I am unable to replicate the issue with simple "kill -9". I will try
to put more effort into replicating this, but will need to prepare
machine, which I am willing to sacrifice when pulling a plug
(literally :)).

Regards,

Artur

Michal Kuratczyk

unread,
Apr 21, 2022, 4:01:29 PM4/21/22
to rabbitm...@googlegroups.com
Hi

Steps to reproduce would be amazing. If you have a full log from the previous incident - please share. At the very list, please share the complete stacktrace (lines above and below that function_clause).

Thanks,

On Thu, Apr 21, 2022 at 9:55 PM Artur Wroblewski <wro...@gmail.com> wrote:
Hi,

Node "rabbit@ahost" is just single RabbitMQ instance.

Around the time of the crash, I see in the logs also

    rabbit_stream_coordinator: failed to get tail of member
__n23-__test_performance_1650114290250848951 on rabbit@ahost in 5
Error: function_clause

Above warning was raised about every 100ms (10 times per second) - it
could until I cleared the streams directory.


I am unable to replicate the issue with simple "kill -9". I will try
to put more effort into replicating this, but will need to prepare
machine, which I am willing to sacrifice when pulling a plug
(literally :)).

Regards,

Artur

Best regards,

Artur


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.


--
Michał
RabbitMQ team

Artur Wroblewski

unread,
Jun 20, 2022, 6:15:30 PM6/20/22
to rabbitmq-users
Hi,

No steps to reproduce yet, but I had a crash again, and here is the situation.

1. Using RabbitMQ 3.10.5 this time. Single RabbitMQ instance.

2. The logs are flooded with the following messages. It is about 20 messages per second, every second:

    2022-06-20 22:02:57.003759+00:00 [warning] <0.1109.0> rabbit_stream_coordinator: failed to get tail of member __n23-sensor_pressure_1650136879875999476 on rabbit@iotp in 18 Error: function_clause
    2022-06-20 22:02:57.056537+00:00 [warning] <0.1110.0> rabbit_stream_coordinator: failed to get tail of member __n23-sensor_light_1650136878094380461 on rabbit@iotp in 16 Error: function_clause
    2022-06-20 22:02:57.076811+00:00 [warning] <0.1111.0> rabbit_stream_coordinator: failed to get tail of member __n23-sensor_temperature_1650136883291908840 on rabbit@iotp in 17 Error: function_clause
    2022-06-20 22:02:57.097473+00:00 [warning] <0.1112.0> rabbit_stream_coordinator: failed to get tail of member __n23-sensor_humidity_1650136876352549559 on rabbit@iotp in 18 Error: function_clause

3. The list_queues command shows no message count beside affected streams

    # rabbitmqctl list_queues
    Timeout: 60.0 seconds ...
    Listing queues for vhost / ...
    name messages
    n23-sensor/battery_level 131166
    n23-sensor/accelerometer 58993
    test/performance 0
    n23-sensor/switch 99
    n23-sensor/light
    n23-sensor/humidity
    n23-sensor/pressure
    n23-sensor/temperature

4. CPU usage by beam.smp process is at 368% - each core at 50-80% CPU utilization.

5. Finally

    # du -sm stream/*
    12 stream/__n23-sensor_accelerometer_1650136610118548000
    24 stream/__n23-sensor_battery_level_1650136874671313741
    3202 stream/__n23-sensor_humidity_1650136876352549559
    2782 stream/__n23-sensor_light_1650136878094380461
    3170 stream/__n23-sensor_pressure_1650136879875999476
    1 stream/__n23-sensor_switch_1650136881596498973
    3258 stream/__n23-sensor_temperature_1650136883291908840
    1 stream/__test_performance_1650130735857311698

Any tips on how can I recover the streams and minimize data loss (my only method so far is to remove all data from streams' directories)?

I can hold this state of my system for few days. If, by any chance, you would have a patch to try, so RabbitMQ could automatically recover, let me know please.

Regards,

Artur

Michal Kuratczyk

unread,
Jun 20, 2022, 6:39:11 PM6/20/22
to rabbitm...@googlegroups.com
Hi,

I forgot about this thread but we indeed worked on this recently. It's in the PR phase currently: https://github.com/rabbitmq/osiris/pull/87.
It's not easy to try just yet (unless you are comfortable building Erlang projects on your own) but the fix is coming. Based on my tests (hard reset while data is being written), there are three common ways the files get corrupted:
1. They become empty (usually if the reset happened shortly after they were created)
2. Segments have partially chunks written
3. There are zeros written at the end of the index file

The PR should solve all 3 cases. The first one is easy to fix manually - if you have empty files in your stream/__* folders - just delete them.
The other two are harder, unless partial data loss is acceptable, in which case deleting the latest segment and index files for the problematic streams should do the trick.
Unfortunately you can have any combination of those three, so just deleting the empty files, if you have them, won't necessarily solve the problem.

If you would like to try the patch, ping me on the RabbitMQ slack tomorrow and I can guide you through the process.

Best,



--
Michał

Artur Wroblewski

unread,
Jun 22, 2022, 5:23:51 PM6/22/22
to rabbitm...@googlegroups.com
On Tue, Jun 21, 2022 at 12:38:50AM +0200, Michal Kuratczyk wrote:
> Hi,
>
> I forgot about this thread but we indeed worked on this recently. It's in
> the PR phase currently: https://github.com/rabbitmq/osiris/pull/87.

Does this patch skip the whole index/segment file or tries to recover
complete chunks within that files?

> It's not easy to try just yet (unless you are comfortable building Erlang
> projects on your own) but the fix is coming. Based on my tests (hard reset
> while data is being written), there are three common ways the files get
> corrupted:
> 1. They become empty (usually if the reset happened shortly after they were
> created)
> 2. Segments have partially chunks written
> 3. There are zeros written at the end of the index file

My problem seems to be (2) above. Last segment and index files for a stream
is not empty, and there are no trailing zeros at the end of segment nor
index file.

> The PR should solve all 3 cases. The first one is easy to fix manually - if
> you have empty files in your stream/__* folders - just delete them.
> The other two are harder, unless partial data loss is acceptable, in which
> case deleting the latest segment and index files for the problematic
> streams should do the trick.

I dropped the offending files and RabbitMQ started normally. Also, I took
one set of the corrupted files, copied to other machine and reproduced the
problem.

Will contact you on priv regarding the compilation.

[...]

Best regards,

Artur

--
https://mortgage.diy-labs.eu/

Michal Kuratczyk

unread,
Jun 24, 2022, 3:53:53 AM6/24/22
to rabbitm...@googlegroups.com
Quick update on this: We tested the PR on Artur's data and it was able to recover most of the data (as expected when a machine crashes while writing to disk - some was unrecoverable).
Once https://github.com/rabbitmq/osiris/pull/87 is merged, RabbitMQ should be able to recover from common Stream file corruptions caused by server outages
(Osiris is a component responsible for maintaining on-disk log for Streams).

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.


--
Michał
RabbitMQ team

Artur Wroblewski

unread,
Aug 25, 2022, 2:15:45 PM8/25/22
to rabbitm...@googlegroups.com
On Fri, Jun 24, 2022 at 09:53:34AM +0200, Michal Kuratczyk wrote:
> Quick update on this: We tested the PR on Artur's data and it was able to
> recover most of the data (as expected when a machine crashes while writing
> to disk - some was unrecoverable).
> Once https://github.com/rabbitmq/osiris/pull/87 is merged, RabbitMQ should
> be able to recover from common Stream file corruptions caused by server
> outages
> (Osiris is a component responsible for maintaining on-disk log for Streams).

Yesterday, electricty was gone in my house and my three RabbitMQ instances
crashed hard.

This time, all the instances were able to start without any manual
assistance.

Thanks for fixing this.
Reply all
Reply to author
Forward
0 new messages