Hey Daniel (and anyone else hitting this),
We (or rather my colleague) eventually managed to resolve this - I'll post his writeup below and hopefully it will be of some help to you:
--------
First, the DaemonSet scheduler error seems to have been caused by pods stuck in the terminating phase. Some background on terminology is in order. A DaemonSet is a type of deployment that will schedule pods to run on all or a subset of the nodes in the cluster. There should only ever be at most one instance of a DS pod on each node. The DS that was causing the error (tectonic-torx-post-update-hook
) was supposed to schedule pods to run on cluster nodes that need restarting. If you look at the DS definition, the node selector is container-linux-update.v1.coreos.com/reboot-needed=true
, which is set on nodes that have been updated and need restarting. The error didn't cause any issues for us because none of our nodes needed restarting.
When I checked the status of the DS I saw that it had six pods stuck in the terminating
phase (one for each node). The pods were started the last time their respective nodes were updated. I downloaded the DS spec and tried to delete it. The DS then became stuck in the deleting phase as it waited for its pods to finish shutting down. Eventually, I forcibly killed the stuck nodes and the DS was deleted. I then recreated the DS and everything appears to be running smoothly again.
I don't know what caused the pods to be stuck, so we will need to keep an eye on this DS in the future in case this happens again.
--------
Cheers,
Bernard
--
You received this message because you are subscribed to a topic in the Google Groups "CoreOS User" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/coreos-user/sF6R7JqqyUI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to coreos-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.