Hello,
I have installed seldon on my linux machine to explore the recommendation example (Movielens 100K) and
facing an issue of "ml100k-import-6wt31" pod not being started with warning "FailedScheduling" warning
with message "no nodes available to schedule pods".
I do not think it is a memory problem as the system has sufficient free memory. I suspect that the
pods and services inside minikube node are not able to communicate with each other due to DNS or ip
issue.
Appreciate your help to fix this. Below are the steps and the logs.
Logs attached:
- kubectl describe pod ("ml100k-import-6wt31")
- minikube nodes
- minikube pods
- minikube services
- minikube jobs
- minikube service url
- curl response from seldon server
- nslookup output
- Linux system information
- seldon Server log
What I have done so far
1.1) For a single machine exploration i installed minikube.
2) Started minikube with 12GB of memory
minikube start --memory=12000
(kubectl create -f ml100k-import.json)
4) Try to search for the fix in the internet, but no luck.
------------------------------- ISSUE -------------------------------
When I create the kubernetes job (kubectl create -f ml100k-import.json) to download movielens data, I see "FailedScheduling" warning with message "no nodes available to schedule pods". Though minikube node is available. The ml100k pod is still in pending state for 20h.
ml100k-import-6wt31 0/1 Pending 0 20h
---------- kubectl describe pod ml100k-import-6wt31 ------------
root@sprod:~# kubectl describe pod ml100k-import-6wt31
Name: ml100k-import-6wt31
Namespace: default
Node: <none>
Labels: controller-uid=5c869f84-61e4-11e7-bc4b-525400d2a1fd
job-name=ml100k-import
Annotations:
kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"Job","namespace":"default","name":"ml100k-import","uid":"5c869f84-61e4-11e7-bc4b-525400d2a1fd","apiVersion...
Status: Pending
IP:
Created By: Job/ml100k-import
Containers:
ml100k-create:
Image: seldonio/examples-ml100k:2.2.5_v2
Port: <none>
Command:
/create_ml100k_recommender.sh
Environment:
GRAFANA_ADMIN_PASSWORD: <set to the key 'grafana-admin-password.txt' in secret 'grafana-admin-password'> Optional: false
Mounts:
/seldon-data from data-volume (rw)
Conditions:
Type Status
PodScheduled False
Volumes:
data-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: seldon-claim
ReadOnly: false
default-token-jh0v8:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-jh0v8
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: <none>
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
20h 9s 4175 default-scheduler Warning FailedScheduling no nodes available to schedule pods
---------- minikube nodes ------------
root@sprod:~# kubectl get nodes --output=wide
NAME STATUS AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION
minikube Ready 21h v1.6.4 <none> Buildroot 2017.02 4.9.13
---------- minikube pods ------------
root@sprod:~# kubectl get pods --output=wide
NAME READY STATUS RESTARTS AGE IP NODE
influxdb-grafana-842592602-mqznp 2/2 Running 0 21h 172.17.0.12 minikube
kafka-controller-1424591021-jqxpk 1/1 Running 0 21h 172.17.0.13 minikube
kafka-stream-impressions-433453233-4vg67 1/1 Running 0 21h 172.17.0.20 minikube
kafka-stream-predictions-398714225-v0m8m 1/1 Running 0 21h 172.17.0.21 minikube
memcached1-2136693305-hpjrh 1/1 Running 0 21h 172.17.0.5 minikube
memcached2-2533120572-q04wm 1/1 Running 0 21h 172.17.0.6 minikube
ml100k-import-6wt31 0/1 Pending 0 21h <none> <none>
mysql-3966129362-jh1fr 1/1 Running 0 21h 172.17.0.4 minikube
redis-1963070708-qtf5w 1/1 Running 0 21h 172.17.0.7 minikube
seldon-control-
2707388371-j3xxb 1/1 Running 0 21h 172.17.0.11 minikube
seldon-server-3494098190-tn6kh 3/3 Running 0 21h 172.17.0.19 minikube
spark-master-controller-3720462731-rbcqd 1/1 Running 0 21h 172.17.0.15 minikube
spark-ui-proxy-controller-1688034969-v0q8w 2/2 Running 0 21h 172.17.0.18 minikube
spark-worker-controller-3381690000-5jvqq 1/1 Running 0 21h 172.17.0.17 minikube
spark-worker-controller-3381690000-5l27w 1/1 Running 0 21h 172.17.0.16 minikube
td-agent-server-3988194731-c3j6k 1/1 Running 0 21h 172.17.0.14 minikube
zookeeper1-467704625-gk8xf 1/1 Running 0 21h 172.17.0.9 minikube
zookeeper2-1006738229-c35f9 1/1 Running 0 21h 172.17.0.8 minikube
zookeeper3-1545771833-8p7lr 1/1 Running 0 21h 172.17.0.10 minikube
---------- minikube services ------------
root@sprod:~# kubectl get services --output=wide
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
kafka-service 10.0.0.149 <nodes> 9092:30010/TCP 21h app=kafka
kubernetes 10.0.0.1 <none> 443/TCP 21h <none>
memcached1 10.0.0.197 <none> 11211/TCP 21h name=memcached1
memcached2 10.0.0.191 <none> 11211/TCP 21h name=memcached2
monitoring-grafana 10.0.0.189 <pending> 80:30002/TCP 21h name=influxGrafana
monitoring-influxdb 10.0.0.221 <none> 8083/TCP,8086/TCP 21h name=influxGrafana
mysql 10.0.0.21 <none> 3306/TCP 21h name=mysql
redis 10.0.0.207 <none> 6379/TCP 21h name=redis
seldon-server 10.0.0.254 <nodes> 80:30015/TCP,5000:30017/TCP 21h name=seldon-server
spark-master 10.0.0.102 <none> 7077/TCP 21h component=spark-master
spark-ui-proxy 10.0.0.22 <pending> 8000:30005/TCP 21h component=spark-ui-proxy
spark-webui 10.0.0.83 <none> 8080/TCP 21h component=spark-master
td-agent-server 10.0.0.93 <none> 24224/TCP,24224/UDP 21h name=td-agent-server
zookeeper-1 10.0.0.223 <none> 2181/TCP,2888/TCP,3888/TCP 21h server-id=1
zookeeper-2 10.0.0.68 <none> 2181/TCP,2888/TCP,3888/TCP 21h server-id=2
zookeeper-3 10.0.0.134 <none> 2181/TCP,2888/TCP,3888/TCP 21h server-id=3
---------- minikube jobs ------------
root@sprod:~# kubectl get jobs
NAME DESIRED SUCCESSFUL AGE
ml100k-import 1 0 20h
---------- minikube service url ------------
No url except for seldon server
root@sprod:~# minikube service --url seldon-server
minikube service --url spark-master
-- no output
---------- curl response from seldon server ------------
When i try to curl to spark master or zookeepr (using their endpoint address) from seldon server,
i get empty reply from server.
root@sprod:~# kubectl exec -ti seldon-control-
2707388371-j3xxb -- /bin/bash
curl: (52) Empty reply from server
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
</BODY></HTML>
----------nslookup output------------
root@sprod:~# kubectl exec -ti seldon-control-
2707388371-j3xxb -- /bin/bash
root@seldon-control-
2707388371-j3xxb:/home/seldon# nslookup zookeeper-1
Server: 10.0.0.10
Address: 10.0.0.10#53
Name: zookeeper-1.default.svc.cluster.local
Address: 10.0.0.223
---------- Linux system information ------------
root@sprod:~# cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.10
DISTRIB_CODENAME=yakkety
DISTRIB_DESCRIPTION="Ubuntu 16.10"
NAME="Ubuntu"
VERSION="16.10 (Yakkety Yak)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.10"
VERSION_ID="16.10"
VERSION_CODENAME=yakkety
UBUNTU_CODENAME=yakkety
root@sprod:~# uname -a
Linux sprod 4.8.0-58-generic #63-Ubuntu SMP Mon Jun 26 17:08:21 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
root@sprod:~# free -m
total used free shared buff/cache available
Mem: 64379 13037 29760 664 21581 50025
Swap: 65488 0 65488
---------- seldon Server log ------------
root@sprod:~# kubectl logs seldon-server-3494098190-tn6kh seldon-server
Jul 06, 2017 12:12:16 AM org.apache.catalina.startup.SetAllPropertiesRule begin
WARNING: [SetAllPropertiesRule]{Server/Service/Connector} Setting property 'maxSpareThreads' to '100' did not find a matching property.
Jul 06, 2017 12:12:16 AM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
Jul 06, 2017 12:12:17 AM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-bio-8080"]
Jul 06, 2017 12:12:17 AM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["ajp-bio-8009"]
Jul 06, 2017 12:12:17 AM org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 1772 ms
Jul 06, 2017 12:12:17 AM org.apache.catalina.core.StandardService startInternal
INFO: Starting service Catalina
Jul 06, 2017 12:12:17 AM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet Engine: Apache Tomcat/7.0.78
Jul 06, 2017 12:12:17 AM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory /usr/local/tomcat/webapps/examples
Jul 06, 2017 12:12:18 AM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deployment of web application directory /usr/local/tomcat/webapps/examples has finished in 910 ms
Jul 06, 2017 12:12:18 AM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory /usr/local/tomcat/webapps/ROOT
Jul 06, 2017 12:12:18 AM org.apache.catalina.loader.WebappClassLoaderBase validateJarFile
INFO: validateJarFile(/usr/local/tomcat/webapps/ROOT/WEB-INF/lib/servlet-api-2.5-20081211.jar) - jar not loaded. See Servlet Spec 3.0, section 10.7.2. Offending class: javax/servlet/Servlet.class
Jul 06, 2017 12:12:18 AM org.apache.catalina.loader.WebappClassLoaderBase validateJarFile
INFO: validateJarFile(/usr/local/tomcat/webapps/ROOT/WEB-INF/lib/servlet-api-2.5.jar) - jar not loaded. See Servlet Spec 3.0, section 10.7.2. Offending class: javax/servlet/Servlet.class
Jul 06, 2017 12:12:31 AM org.apache.catalina.startup.TldConfig execute
INFO: At least one JAR was scanned for TLDs yet contained no TLDs. Enable debug logging for this logger for a complete list of JARs that were scanned but no TLDs were found in them. Skipping unneeded JARs during scanning can improve startup time and JSP compilation time.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/seldon/ROOT/WEB-INF/lib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/seldon/ROOT/WEB-INF/lib/slf4j-log4j12-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2017-07-06 00:12:45.546 INFO net.spy.memcached.MemcachedConnection: Added {QA sa=memcached1/
10.0.0.197:11211, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=0} to connect queue
2017-07-06 00:12:45.547 INFO net.spy.memcached.MemcachedConnection: Added {QA sa=memcached2/
10.0.0.191:11211, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=0} to connect queue
2017-07-06 00:12:45.551 INFO net.spy.memcached.MemcachedConnection: Added {QA sa=memcached1/
10.0.0.197:11211, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=0} to connect queue
2017-07-06 00:12:45.552 INFO net.spy.memcached.MemcachedConnection: Added {QA sa=memcached2/
10.0.0.191:11211, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=0} to connect queue
2017-07-06 00:12:45.555 INFO net.spy.memcached.MemcachedConnection: Connection state changed for sun.nio.ch.SelectionKeyImpl@43ce1ee7
2017-07-06 00:12:45.556 INFO net.spy.memcached.MemcachedConnection: Connection state changed for sun.nio.ch.SelectionKeyImpl@7e231328
2017-07-06 00:12:45.561 INFO net.spy.memcached.MemcachedConnection: Connection state changed for sun.nio.ch.SelectionKeyImpl@63971701
2017-07-06 00:12:45.564 INFO net.spy.memcached.MemcachedConnection: Connection state changed for sun.nio.ch.SelectionKeyImpl@38fb0801
Jul 06, 2017 12:12:50 AM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deployment of web application directory /usr/local/tomcat/webapps/ROOT has finished in 32,324 ms
Jul 06, 2017 12:12:50 AM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory /usr/local/tomcat/webapps/host-manager
Jul 06, 2017 12:12:50 AM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deployment of web application directory /usr/local/tomcat/webapps/host-manager has finished in 176 ms
Jul 06, 2017 12:12:50 AM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory /usr/local/tomcat/webapps/manager
Jul 06, 2017 12:12:50 AM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deployment of web application directory /usr/local/tomcat/webapps/manager has finished in 98 ms
Jul 06, 2017 12:12:50 AM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory /usr/local/tomcat/webapps/docs
Jul 06, 2017 12:12:50 AM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deployment of web application directory /usr/local/tomcat/webapps/docs has finished in 61 ms
Jul 06, 2017 12:12:50 AM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["http-bio-8080"]
Jul 06, 2017 12:12:51 AM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["ajp-bio-8009"]
Jul 06, 2017 12:12:51 AM org.apache.catalina.startup.Catalina start
INFO: Server startup in 33878 ms
Thanks,
Bhabani