Not sure if it is related, but we've been having issues with a couple of nodes in the cluster going down and reloading all buckets. It happens around 9pm UTC and makes most request to the cluster fail for about 2 minutes.
This is what I saw after the last issue:
Here is the cluster info:
Version 2.2.0
Nodes 5
Total RAM 47GB
Total Disk 1.4TB
Nothing special happens around that time (That I can see) on the client side.
Side info:
I've noticed peaks in different charts;
- Minor page faults: 200k
- Disk write queue 1.25M
- Disk Queue Items going from 550K to 350K
- We do leverage heavily on the expiry of the documents, expiring around 150k to 200k per hour. (peak time does not match)
I'm not sure where to start, We've limited the access to this cluster to a minimum, but it's a pretty important piece of the main application so any kind of help is welcome.
I'm adding screenshots of the charts in case that helps
Sorry for the long mail and thanks in advanced,