We've been having issues with RabbitMQ 'soft' crashing over the last couple days.
It doesn't ever really stop working entirely but it generates a good volume of error messages and some of our customers are unable to initiate new subscriptions.
If we restart rabbitmq it appears to recover properly and then resume processing the same amount of client load it was processing previously, but without error.
I've attached a sample of the crash log around the latest event, and a sample of the rabbitmq log at the same period of time.
We've also got a process which captures the 'rabbitmqctl status' output, and I've attached a 'good' one from when there were no errors earlier in the morning, and one with several snapshots around the time we started having issues.
Around these events we see a spike in load, but not significant spike in CPU. One event had a memory spike, but a more recent event did not - so we are not sure if that's an issue.
This rabbitmq instance is relatively busy, around 1000% CPU on a 48 core machine.
My gut is that we've just reached some saturation point in message throughput and we need to scale this, but I'd like to validate that this is the cause, as the error messages don't really seem all explanatory.
It would also be nice to know if we are somehow misconfigured, limiting our throughput. We've been through the rabbitmq production guide and didn't see anything obvious there.