Kubernetes Cluster Scheduler fails

488 views
Skip to first unread message

randal cobb

unread,
Aug 1, 2016, 8:41:29 AM8/1/16
to CoreOS User
Hello, all,

I'm trying to get a Kubernetes cluster set up on 6 CoreOS nodes.  Everything seems to work fine when set up using instructions from here: https://coreos.com/kubernetes/docs/latest/deploy-master.html.  For some reason, either the Calico-node CNI or the Kube-Scheduler itself is failing to detect the proper network interface.  I get the following output from Journalctl:

Jul 31 22:46:33 infra0 kubelet-wrapper[3175]: I0801 02:46:33.761626    3175 reconciler.go:253] MountVolume operation started for volume "kubernetes.io/host-path//usr/share
Jul 31 22:46:33 infra0 kubelet-wrapper[3175]: I0801 02:46:33.761754    3175 operation_executor.go:720] MountVolume.SetUp succeeded for volume "
kubernetes.io/host-path//usr
Jul 31 22:46:33 infra0 kubelet-wrapper[3175]: I0801 02:46:33.761925    3175 operation_executor.go:720] MountVolume.SetUp succeeded for volume "kubernetes.io/host-path//etc
Jul 31 22:46:33 infra0 kubelet-wrapper[3175]: I0801 02:46:33.862395    3175 reconciler.go:253] MountVolume operation started for volume "
kubernetes.io/host-path//usr/share
Jul 31 22:46:33 infra0 kubelet-wrapper[3175]: I0801 02:46:33.862631    3175 operation_executor.go:720] MountVolume.SetUp succeeded for volume "kubernetes.io/host-path//usr
Jul 31 22:46:34 infra0 kubelet-wrapper[3175]: E0801 02:46:34.032349    3175 docker_manager.go:345] NetworkPlugin cni failed on the status hook for pod 'kube-scheduler-10.1
Jul 31 22:46:34 infra0 kubelet-wrapper[3175]:  with error: exit status 1
Jul 31 22:46:34 infra0 kubelet-wrapper[3175]: W0801 02:46:34.467274    3175 docker_manager.go:1380] No ref for pod '"
bd8c9f70dad67a026bcf2a2c98a447730cdaf671daab2716e9947b
Jul 31 22:46:34 infra0 kubelet-wrapper[3175]: W0801 02:46:34.611620    3175 docker_manager.go:1380] No ref for pod '"b3c71a48a1cacd2d914bccfe40593666036b6ef41e678b44866182
Jul 31 22:46:35 infra0 kubelet-wrapper[3175]: E0801 02:46:35.095388    3175 docker_manager.go:345] NetworkPlugin cni failed on the status hook for pod '
kube-scheduler-10.1
Jul 31 22:46:35 infra0 kubelet-wrapper[3175]:  with error: exit status 1
Jul 31 22:46:36 infra0 kubelet-wrapper[3175]: W0801 02:46:36.415003    3175 docker_manager.go:1380] No ref for pod '"bd8c9f70dad67a026bcf2a2c98a447730cdaf671daab2716e9947b
Jul 31 22:46:36 infra0 kubelet-wrapper[3175]: W0801 02:46:36.418328    3175 docker_manager.go:1380] No ref for pod '"b3c71a48a1cacd2d914bccfe40593666036b6ef41e678b44866182
Jul 31 23:33:08 infra0 kubelet-wrapper[3175]: E0801 03:33:08.855740    3175 kubelet.go:1899] Deleting mirror pod "
kube-scheduler-10.1.10.110_kube-system(2058ac4a-5792-11e6
Jul 31 23:33:08 infra0 kubelet-wrapper[3175]: W0801 03:33:08.892731    3175 status_manager.go:447] Failed to update status for pod "_()": Operation cannot be fulfilled on
Jul 31 23:33:08 infra0 kubelet-wrapper[3175]: I0801 03:33:08.915472    3175 docker_manager.go:1834] pod "kube-scheduler-10.1.10.110_kube-system(0be27240583c9b189c7a18173fc
Jul 31 23:33:09 infra0 kubelet-wrapper[3175]: E0801 03:33:09.560590    3175 docker_manager.go:345] NetworkPlugin cni failed on the status hook for pod 'kube-scheduler-10.1
Jul 31 23:33:09 infra0 kubelet-wrapper[3175]:  with error: exit status 1

I was seeing earlier in the log, references to "eth0" not being found (my adapter is named 'ens18'), but from Googling, it appears that is more of a warning than an error, but I could be wrong.  Regardless, the scheduler isn't running, and verification steps give the following output:

ROSELCDV0070481:~ cobbr$ kubectl get pods --namespace=kube-system
NAME                                    READY     STATUS    RESTARTS   AGE
kube
-apiserver-10.1.10.110              1/1       Running   1          16h
kube
-controller-manager-10.1.10.110     1/1       Running   1          16h
kube
-dns-v11-71w7j                      0/4       Pending   0          12h
kube
-proxy-10.1.10.110                  1/1       Running   1          16h
kube
-scheduler-10.1.10.110              1/1       Running   1          9h
kubernetes
-dashboard-3717423461-0pecu   0/1       Pending   0          12h

and 

ROSELCDV0070481:~ cobbr$ kubectl get nodes
NAME          STATUS                     AGE
10.1.10.110   Ready,SchedulingDisabled   16h

I believe the "SchedulingDisabled" and constant "Pending" states of pods are due to the kube-scheduler not running; so, any help in identifying a way to get the scheduler running would be greatly appreciated!

Brandon Philips

unread,
Aug 3, 2016, 1:05:44 AM8/3/16
to randal cobb, CoreOS User, Tom Denham, Stefan Junker
Hey Randal-

Lets see if I can help!

On Mon, Aug 1, 2016 at 5:41 AM randal cobb <rco...@gmail.com> wrote:
I'm trying to get a Kubernetes cluster set up on 6 CoreOS nodes.  Everything seems to work fine when set up using instructions from here: https://coreos.com/kubernetes/docs/latest/deploy-master.html.

What environment are you doing this in?
It looks like the scheduler is in fact running, which is good.

The reason that 10.1.10.110 says "SchedulingDisabled" is because this is the controller node. The controller node is marked as "SchedulingDisabled" so that non-control workloads don't get scheduled to it. The control workloads of apiserver, controller-manager, proxy, and scheduler are all running which is great!

I am confused about the CNI failed hook so let me cc in a couple of folks that can help. If you could followup with some details on the configuration so far and environment that would be helpful.

In the meantime if you want the "easy mode" cluster I would suggest:


Thanks!

Brandon

randal cobb

unread,
Aug 3, 2016, 2:17:02 PM8/3/16
to CoreOS User, rco...@gmail.com, t...@tigera.io, stefan...@coreos.com
Brandon,

I have a cluster of 6 CoreOS machines using etcd2 which are clustered and working properly.  The CoreOS version is the latest (as of right now) stable channel; I believe it's 1068.6.0.
I am using the Kubernetes container method (v1.3.3coreos.0) on all containers, using Calico.  The host hardware is a cluster of 3 Proxmox physical hosts clustered together with the 6 CoreOS machines running as VMs on in the cluster.  This is just a prototype environment, so I'm not too worried if I have to scrap it all and reconfigure from scratch; I'm trying to self-teach myself the concepts and infrastructure of it all.

However, I have made some progress... I found that I somehow missed the worker-kubeconfig.yaml step and it was getting created as a directory automatically on the worker kubelet launch, but now that I have the proper yaml file in place, the workers are registering with the master as minions, and do, now, show up in the list properly.

I still see the issues deploying the kube-dns and kubernetes-dashboard containers, but I think they may be related to something amiss in my configuration.  Dashboard constantly fails with some sort of certificate issue, and the kube-dns container fails with a missing skydns or something setting.  I'm still floundering trying to solve those 2 issues at this point.

I also think I have some networking issues to iron out, as well... I deployed the guestbook application after I got the minion issues settled, and I couldn't route to the guestbook app properly, but I can work on those once I get the kubernetes issues resolved.

Brandon Philips

unread,
Aug 3, 2016, 2:30:13 PM8/3/16
to randal cobb, CoreOS User, t...@tigera.io, stefan...@coreos.com
On Wed, Aug 3, 2016 at 11:17 AM randal cobb <rco...@gmail.com> wrote:
However, I have made some progress... I found that I somehow missed the worker-kubeconfig.yaml step and it was getting created as a directory automatically on the worker kubelet launch, but now that I have the proper yaml file in place, the workers are registering with the master as minions, and do, now, show up in the list properly.

Great, good catch!
 
I still see the issues deploying the kube-dns and kubernetes-dashboard containers, but I think they may be related to something amiss in my configuration.  Dashboard constantly fails with some sort of certificate issue, and the kube-dns container fails with a missing skydns or something setting.  I'm still floundering trying to solve those 2 issues at this point.

Can you paste in the logs for dashboard and kube-dns logs so we can try and figure it out?
 
I also think I have some networking issues to iron out, as well... I deployed the guestbook application after I got the minion issues settled, and I couldn't route to the guestbook app properly, but I can work on those once I get the kubernetes issues resolved.

Hrm, what is the problem you encountered? Since you are on baremetal and using an overlay network, I assume, you will need to expose a nodeport http://kubernetes.io/docs/user-guide/services/#type-nodeport and then hit the service on http://$routable-ip-of-any-machine:$node-port

Cheers,

Brandon 

 
--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

randal cobb

unread,
Aug 3, 2016, 3:55:19 PM8/3/16
to CoreOS User, rco...@gmail.com, t...@tigera.io, stefan...@coreos.com


On Wednesday, August 3, 2016 at 2:30:13 PM UTC-4, Brandon Philips wrote:


On Wed, Aug 3, 2016 at 11:17 AM randal cobb <rco...@gmail.com> wrote:
However, I have made some progress... I found that I somehow missed the worker-kubeconfig.yaml step and it was getting created as a directory automatically on the worker kubelet launch, but now that I have the proper yaml file in place, the workers are registering with the master as minions, and do, now, show up in the list properly.

Great, good catch!
 
I still see the issues deploying the kube-dns and kubernetes-dashboard containers, but I think they may be related to something amiss in my configuration.  Dashboard constantly fails with some sort of certificate issue, and the kube-dns container fails with a missing skydns or something setting.  I'm still floundering trying to solve those 2 issues at this point.

Can you paste in the logs for dashboard and kube-dns logs so we can try and figure it out?

Here's what I've found for kube-dns-v11...  I used the "delete" and re-created just to get a better feel for what is happening.  I just finished the "create" from the .yml file with the adjustment for my DNS ip address (10.1.12.10) and it sits in this state for about 5-10 minutes:
ROSELCDV0070481:.kubernetes cobbr$ kubectl get pods --namespace=kube-system

NAME                                  READY     STATUS    RESTARTS   AGE
kube-apiserver-10.1.10.110            1/1       Running   0          17h
kube
-controller-manager-10.1.10.110   1/1       Running   9          1d
kube
-dns-v11-6qoh5                    3/4       Running   2          3m
kube
-proxy-10.1.10.110                1/1       Running   7          1d
kube
-proxy-10.1.10.111                1/1       Running   6          19h
kube
-proxy-10.1.10.112                1/1       Running   26         1d
kube
-proxy-10.1.10.113                1/1       Running   1          15h
kube
-proxy-10.1.10.114                1/1       Running   289        1d
kube
-proxy-10.1.10.115                1/1       Running   289        1d
kube
-scheduler-10.1.10.110            1/1       Running   9          1d

...notice the ready state of 3/4...  then goes into CrashLoopBackoff state.  The logs are: 

ROSELCDV0070481:.kubernetes cobbr$ kubectl logs kube-dns-v11-6qoh5 --namespace=kube-system
Error from server: a container name must be specified for pod kube-dns-v11-6qoh5, choose one of: [etcd kube2sky skydns healthz]

For dashboard, it goes almost directly to CrashLoopBackoff state:
ROSELCDV0070481:.kubernetes cobbr$ kubectl create -f kubernetes-dashboard.yaml
deployment
"kubernetes-dashboard" created
You have exposed your service on an external port on all nodes in your
cluster
.  If you want to expose this service to the external internet, you may
need to
set up firewall rules for the service port(s) (tcp:32560) to serve traffic.


See http://releases.k8s.io/release-1.3/docs/user-guide/services-firewalls.md for more details.
service
"kubernetes-dashboard" created
ROSELCDV0070481
:.kubernetes cobbr$ kubectl get pods --namespace=kube-system

NAME                                    READY     STATUS             RESTARTS   AGE
kube-apiserver-10.1.10.110              1/1       Running            0          18h
kube
-controller-manager-10.1.10.110     1/1       Running            9          1d
kube
-dns-v11-6qoh5                      3/4       CrashLoopBackOff   7          15m
kube
-proxy-10.1.10.110                  1/1       Running            7          1d
kube
-proxy-10.1.10.111                  1/1       Running            6          19h
kube
-proxy-10.1.10.112                  1/1       Running            26         1d
kube
-proxy-10.1.10.113                  1/1       Running            1          16h
kube
-proxy-10.1.10.114                  1/1       Running            289        1d
kube
-proxy-10.1.10.115                  1/1       Running            289        1d
kube
-scheduler-10.1.10.110              1/1       Running            9          1d
kubernetes
-dashboard-3954469829-nyvie   1/1       Running            1          10s
ROSELCDV0070481
:.kubernetes cobbr$ kubectl get pods --namespace=kube-system

NAME                                    READY     STATUS             RESTARTS   AGE
kube-apiserver-10.1.10.110              1/1       Running            0          18h
kube
-controller-manager-10.1.10.110     1/1       Running            9          1d
kube
-dns-v11-6qoh5                      3/4       CrashLoopBackOff   7          15m
kube
-proxy-10.1.10.110                  1/1       Running            7          1d
kube
-proxy-10.1.10.111                  1/1       Running            6          19h
kube
-proxy-10.1.10.112                  1/1       Running            26         1d
kube
-proxy-10.1.10.113                  1/1       Running            1          16h
kube
-proxy-10.1.10.114                  1/1       Running            289        1d
kube
-proxy-10.1.10.115                  1/1       Running            289        1d
kube
-scheduler-10.1.10.110              1/1       Running            9          1d
kubernetes
-dashboard-3954469829-nyvie   0/1       CrashLoopBackOff   1          20s

... with these logs:
ROSELCDV0070481:.kubernetes cobbr$ kubectl logs kubernetes-dashboard-3954469829-nyvie --namespace=kube-system
Starting HTTP server on port 9090
Creating API server client for http://10.1.12.1:8080
Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration) or the --apiserver-host param points to a server that does not exist. Reason: Get http://10.1.12.1:8080/version: dial tcp 10.1.12.1:8080: getsockopt: no route to host

My routeable network is: 10.1.10.0/20, with POD networks being 10.1.11.0/20, and SERVICE network being 10.1.12.0/20; so I generated all certs using these, where appropriate (and in all config values for services and manifests). 

Brandon Philips

unread,
Aug 3, 2016, 4:00:59 PM8/3/16
to randal cobb, CoreOS User, t...@tigera.io, stefan...@coreos.com
Hello Randal-

My hunch is that the location of your API server, http://10.1.12.1:8080, is not actually correct. This is why kube2sky and the dashboard aren't working.

Can you SSH into one of the hosts and confirm that http://10.1.12.1:8080 works?

Thank You,

Brandon

randal cobb

unread,
Aug 3, 2016, 4:34:14 PM8/3/16
to CoreOS User, rco...@gmail.com, t...@tigera.io, stefan...@coreos.com
Yeah, this is also why I think I have a network config issue... the 10.1.12.x/20 network isn't routable.  I get this from any of my machines:

core@kube-worke-5 ~ $ curl http://10.1.12.1:8080
curl
: (7) Failed to connect to 10.1.12.1 port 8080: No route to host

randal cobb

unread,
Aug 3, 2016, 4:43:03 PM8/3/16
to CoreOS User, rco...@gmail.com, t...@tigera.io, stefan...@coreos.com
This is also interesting...  on the controller node, I can curl to port 8080 on the lo interface, but not the public routable interface:
core@kubernetes ~ $ curl http://127.0.0.1:8080
{
 
"paths": [
   
"/api",
   
"/api/v1",
   
"/apis",
   
"/apis/apps",
   
"/apis/apps/v1alpha1",
   
"/apis/autoscaling",
   
"/apis/autoscaling/v1",
   
"/apis/batch",
   
"/apis/batch/v1",
   
"/apis/batch/v2alpha1",
   
"/apis/extensions",
   
"/apis/extensions/v1beta1",
   
"/apis/policy",
   
"/apis/policy/v1alpha1",
   
"/apis/rbac.authorization.k8s.io",
   
"/apis/rbac.authorization.k8s.io/v1alpha1",
   
"/healthz",
   
"/healthz/ping",
   
"/logs/",
   
"/metrics",
   
"/swaggerapi/",
   
"/ui/",
   
"/version"
 
]
}core@kubernetes ~ $ curl http://10.1.10.110:8080
curl
: (7) Failed to connect to 10.1.10.110 port 8080: Connection refused
core@kubernetes
~ $

is that by design?
Brandon

<blockquote class="gm

Rob Szumski

unread,
Aug 3, 2016, 4:54:35 PM8/3/16
to randal cobb, CoreOS User, t...@tigera.io, stefan...@coreos.com
Yes, Kubernetes listens insecurely only on the box, but uses 443/TLS for communicating between machines.

randal cobb

unread,
Aug 3, 2016, 4:59:01 PM8/3/16
to CoreOS User, rco...@gmail.com, t...@tigera.io, stefan...@coreos.com
Ah... ok.  Expected behavior, then.

randal cobb

unread,
Aug 3, 2016, 6:26:00 PM8/3/16
to CoreOS User, rco...@gmail.com, t...@tigera.io, stefan...@coreos.com
ok, I tried this... I changed my subnet mask to 255.252.0.0 (and updated all nodes), and changed all network configs back to default values:  10.1.x.x publicly routable, 10.2.x.x for pod network, and 10.3.x.x for service network and still can't curl to 10.3.0.1:8080 on any of the minions (or controller for that matter).  I'm NOT a network specialist, but do I need to add a route to the 10.3.x.x subnet?  My routing table on the 6 nodes is:

kubernetes ~ # route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.1.10.1       0.0.0.0         UG    0      0        0 ens18
10.0.0.0        0.0.0.0         255.252.0.0     U     0      0        0 ens18
10.2.0.0        0.0.0.0         255.255.0.0     U     0      0        0 flannel.1
10.2.58.0       0.0.0.0         255.255.255.0   U     0      0        0 docker0


...and is the same on all 6 nodes in the cluster.

On Wednesday, August 3, 2016 at 4:34:14 PM UTC-4, randal cobb wrote:

randal cobb

unread,
Aug 4, 2016, 9:57:24 AM8/4/16
to CoreOS User
ok... I finally figure it all out on my own...  Here's the poop:

1) It finally clicked what Kubernetes was telling me when I was trying to get the logs for the dns-addon.  When it was saying: 
ROSELCDV0070481:.kubernetes cobbr$ kubectl logs kube-dns-v11-o5f7i --namespace=kube-system
Error from server: a container name must be specified for pod kube-dns-v11-o5f7i, choose one of: [etcd kube2sky skydns healthz]
it was asking me to specify a given container name, such as 'etcd2', 'kube2sky', 'skydns', or 'healthz' to see the logs for.  Googling this lead me to look at the 'kube2sky' container using this command to actually see the relevant log information I needed:  'kubectl logs kube-dns-v11-o5f7i kube2sky --namespace=kube-system'

Those logs finally led me to the fact that the kube2sky container wasn't able to properly authenticate to the kube-apiserver.  More Googling led me to an identical situation that was listed as a "bug" in an earlier version of Kubernetes (1.1.0, I believe) where the generated authtoken was being cached by etcd and the old cached token was being used instead of the current token.  So, I basically shut down the entire cluster, purges all entries from etcd, re-created namespaces (because I stoopidly purged them, too), restarted the cluster, and re-deployed the dns-addon.  Viola!  It came up first try!  So, I rolled the dice and redeployed the dashboard again, and BOOM!  It came up, 1st try, too!

I think I now have a stable environment to play with!

Thanks, all, for the help!

John Griessen

unread,
Aug 4, 2016, 2:13:56 PM8/4/16
to coreo...@googlegroups.com
On 08/04/2016 08:57 AM, randal cobb wrote:
> I think I now have a stable environment to play with!

If you make a howto for this, I'd like to try it on my 6 bare metal machines too.
It's not urgent -- they are able to heat up my office when all on 24hours a day, so
they're just for testing, learning.

Rob Szumski

unread,
Aug 4, 2016, 2:46:38 PM8/4/16
to John Griessen, coreo...@googlegroups.com
@john I think Randal ended up following https://coreos.com/kubernetes/docs/latest/deploy-master.html and his error was randomly transient. Check out that guide and let the group know how it goes.

@randal Glad you got it working!

- Rob

Brandon Philips

unread,
Aug 4, 2016, 3:02:24 PM8/4/16
to randal cobb, CoreOS User
On Thu, Aug 4, 2016 at 6:57 AM randal cobb <rco...@gmail.com> wrote:
Those logs finally led me to the fact that the kube2sky container wasn't able to properly authenticate to the kube-apiserver.  More Googling led me to an identical situation that was listed as a "bug" in an earlier version of Kubernetes (1.1.0, I believe) where the generated authtoken was being cached by etcd and the old cached token was being used instead of the current token.  So, I basically shut down the entire cluster, purges all entries from etcd, re-created namespaces (because I stoopidly purged them, too), restarted the cluster, and re-deployed the dns-addon.  Viola!  It came up first try!  So, I rolled the dice and redeployed the dashboard again, and BOOM!  It came up, 1st try, too!

Ah, yes, this has caught many users in the past. We need to fix it.
 
I think I now have a stable environment to play with!

Awesome! Let us know how it goes.

Cheers,

Brandon 

randal cobb

unread,
Aug 4, 2016, 11:14:43 PM8/4/16
to Brandon Philips, CoreOS User

I'll let you know how it goes.

Reply all
Reply to author
Forward
0 new messages