Ceph puts worker node on huge load average

67 views
Skip to first unread message

Tiago Mendes

unread,
Nov 5, 2020, 8:35:26 AM11/5/20
to rook-dev
Once in a while we are gettings problems with ceph OSDs withouth any apparent reason, the worker node(s)  get this behavior of FailedScheduling and they need to be drained or the whole cluster becomes stuck.
At the time this occurs the load average on the machine is like 500 and it gets stuck creating new replicas of pods and pvcs and rancher gets slow.
OS: Red Hat Enterprise Linux Server release 7.8 (Maipo) and nodes are physical 
Kubernetes version: v1.17.4
Docker version: 19.3.12 

 

 



 





Travis Nielsen

unread,
Nov 5, 2020, 2:34:34 PM11/5/20
to Tiago Mendes, rook-dev
It is certainly a concern if you are seeing the load spike like that. Please open a GitHub issue for this with as many details as you have about when you are observing it. OSD logs around the time of load spike may help, but we’ll need someone from core Ceph to take a look as well.

Thanks,
Travis Nielsen
Rook Maintainer

--
You received this message because you are subscribed to the Google Groups "rook-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rook-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rook-dev/740ec64d-42ea-44a1-bd4f-5944fb125715n%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages