Hi,
I am getting unusual timeouts with the 'LoadBalancer' service type on Kubernetes 1.3.3 running on GCE, and I don't know where to start troubleshooting.
My environment:
KUBERNETES_VERSION=1.3.3
KUBERNETES_PROVIDER=gce
KUBE_GCE_ZONE=us-central1-b
NODE_SIZE=n1-standard-4
Cluster created via the `cluster/kube-up.sh` script.
Here is some repro steps to try to illustrate what I'm seeing.
1. Create two simple nginx RCs and two LoadBalancer services.
2. Curl the first nginx's LoadBalancer IP.
3. Scale replicas for the second nginx RC.
4. Watch the curl command of the first nginx timeout for 2 minutes.
--------------------
1. Create two simple nginx RCs and two LoadBalancer services.
---------------------
kubectl create -f - <<- EOF
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx-alpine
spec:
replicas: 1
selector:
name: nginx-alpine
template:
metadata:
labels:
name: nginx-alpine
spec:
containers:
- name: nginx-alpine
image: rohan/nginx-alpine
ports:
- containerPort: 80
EOF
kubectl create -f - <<- EOF
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx-alpine2
spec:
replicas: 1
selector:
name: nginx-alpine2
template:
metadata:
labels:
name: nginx-alpine2
spec:
containers:
- name: nginx-alpine2
image: rohan/nginx-alpine
ports:
- containerPort: 80
EOF
kubectl create -f - <<- EOF
apiVersion: v1
kind: Service
metadata:
name: nginx-alpine
labels:
app: nginx-alpine
spec:
type: LoadBalancer
ports:
- port: 80
protocol: TCP
name: http
selector:
name: nginx-alpine
EOF
kubectl create -f - <<- EOF
apiVersion: v1
kind: Service
metadata:
name: nginx-alpine2
labels:
app: nginx-alpine2
spec:
type: LoadBalancer
ports:
- port: 80
protocol: TCP
name: http
selector:
name: nginx-alpine2
EOF
------
2. Curl the first nginx's LoadBalancer IP 10times/sec, ts for timestamps
------
------
3. Scale replicas for the second nginx RC.
------
kubectl scale rc nginx-alpine2 --replicas 4
--------
4. Watch the curl command of the first nginx timeout for 2 minutes.
-----
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 612 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
Jul 26 21:09:50 HTTP/1.1 200 OK
Jul 26 21:09:50 Server: nginx/1.6.2
Jul 26 21:09:50 Date: Wed, 27 Jul 2016 04:09:50 GMT
Jul 26 21:09:50 Content-Type: text/html
Jul 26 21:09:50 Content-Length: 612
Jul 26 21:09:50 Last-Modified: Mon, 17 Nov 2014 14:48:17 GMT
Jul 26 21:09:50 Connection: keep-alive
Jul 26 21:09:50 ETag: "546a0ab1-264"
Jul 26 21:09:50 Accept-Ranges: bytes
Jul 26 21:09:50
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0
...
0 0 0 0 0 0 0 0 --:--:-- 0:02:04 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:02:05 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:02:06 --:--:-- 0
curl: (7) Failed to connect to 104.155.142.86 port 80: Connection timed out
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 612 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
Jul 26 21:11:58 HTTP/1.1 200 OK
Jul 26 21:11:58 Server: nginx/1.6.2
Jul 26 21:11:58 Date: Wed, 27 Jul 2016 04:11:58 GMT
Jul 26 21:11:58 Content-Type: text/html
Jul 26 21:11:58 Content-Length: 612
Jul 26 21:11:58 Last-Modified: Mon, 17 Nov 2014 14:48:17 GMT
Jul 26 21:11:58 Connection: keep-alive
Jul 26 21:11:58 ETag: "546a0ab1-264"
Jul 26 21:11:58 Accept-Ranges: bytes
Jul 26 21:11:58
During the timeout, netstat says:
Interestingly, a GCE Kubernetes cluster running 1.2.5 does not exhibit this timeout.
Is this normal or am I doing something wrong?
I should also point out that communications to the service inside the cluster via k8s dns work perfect.
Any help would be greatly appreciated.
Christopher McKenzie