Continual DaemonSetsMissScheduled warnings

ber...@hippware.com

unread,

Apr 8, 2018, 9:59:42 PM4/8/18

to CoreOS User

Hi group,

A few days back our Tectonic cluster started firing DaemonSetsMissScheduled warnings for the kube-state-metrics pod. They fire roughly every half hour, getting marked as resolved about 15-20 minutes later. I've tried killing the pod and having it restart, but the new pod just does the same thing. The whole thing is a bit of a black box to me, so any suggestions as to where to look would be very welcome. The details of the warning:

Labels: - alertname = DaemonSetsMissScheduled - daemonset = tectonic-torcx-post-update-hook - endpoint = https-main - instance = 10.2.1.102:8443 - job = kube-state-metrics - namespace = tectonic-system - pod = kube-state-metrics-d649fbf5d-tktq2 - service = kube-state-metrics - severity = warning Annotations: - description = A number of daemonsets are running where they are not supposed to run. - summary = Daemonsets are not scheduled correctly

Cheers,

Bernard

Daniel Norman

unread,

May 2, 2018, 4:40:05 AM5/2/18

to CoreOS User

I'm experiencing the same thing in the context of a tectonic cluster.

Bernard Duggan

unread,

May 3, 2018, 8:14:17 PM5/3/18

to Daniel Norman, CoreOS User

Hey Daniel (and anyone else hitting this),

We (or rather my colleague) eventually managed to resolve this - I'll post his writeup below and hopefully it will be of some help to you:

--------

First, the DaemonSet scheduler error seems to have been caused by pods stuck in the terminating phase. Some background on terminology is in order. A DaemonSet is a type of deployment that will schedule pods to run on all or a subset of the nodes in the cluster. There should only ever be at most one instance of a DS pod on each node. The DS that was causing the error (tectonic-torx-post-update-hook) was supposed to schedule pods to run on cluster nodes that need restarting. If you look at the DS definition, the node selector is container-linux-update.v1.coreos.com/reboot-needed=true, which is set on nodes that have been updated and need restarting. The error didn't cause any issues for us because none of our nodes needed restarting.

When I checked the status of the DS I saw that it had six pods stuck in the terminating phase (one for each node). The pods were started the last time their respective nodes were updated. I downloaded the DS spec and tried to delete it. The DS then became stuck in the deleting phase as it waited for its pods to finish shutting down. Eventually, I forcibly killed the stuck nodes and the DS was deleted. I then recreated the DS and everything appears to be running smoothly again.

I don't know what caused the pods to be stuck, so we will need to keep an eye on this DS in the future in case this happens again.

--------

Cheers,

Bernard

--
You received this message because you are subscribed to a topic in the Google Groups "CoreOS User" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/coreos-user/sF6R7JqqyUI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to coreos-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward