False positive "HighNumberOfFailedGRPCRequests Watch " alarms?

1,210 views
Skip to first unread message

peter schulten

unread,
Feb 9, 2018, 5:11:55 AM2/9/18
to etcd-dev
Hi,

I installed the *awesome* prometheus operator in my kubernetes cluster and configured a service monitor for my external etcds.
I added  your suggested alerting rules from https://github.com/coreos/etcd/blob/f5d02f02791716ed99c489ffa2441b8cc7925457/Documentation/op-guide/etcd3_alert.rules.yml#L68 and now run into an endless loop of 
[FIRING:2]  (HighNumberOfFailedGRPCRequests Watch etcdserverpb.Watch)
and 
[RESOLVED]  (HighNumberOfFailedGRPCRequests Watch etcdserverpb.Watch)
every 5 minutes

Using the prometheus console with:
sum(grpc_server_handled_total{job="etcd"}) BY (grpc_code, grpc_method) > 0

I see:
ElementValue
{grpc_code="Unavailable",grpc_method="LeaseKeepAlive"}226
{grpc_code="OK",grpc_method="MemberRemove"}1
{grpc_code="OK",grpc_method="Txn"}167011
{grpc_code="OK",grpc_method="LeaseGrant"}1555
{grpc_code="OK",grpc_method="Range"}752031
{grpc_code="Unavailable",grpc_method="Watch"}1270
{grpc_code="Unavailable",grpc_method="Txn"}2
{grpc_code="OK",grpc_method="Compact"}291
{grpc_code="Unavailable",grpc_method="Range"}2
{grpc_code="OK",grpc_method="MemberList"}1

My etcds (v3.3.0 ) are run like this:

# /etc/systemd/system/etcd.service
#v7
[Unit]
Description=etcd
After=network.target r53update.service rejoin.service
Requires=r53update.service

[Service]
Type=notify
EnvironmentFile=/etc/metadata.env
ExecStart=/usr/bin/etcd \
  --discovery-srv=${kube_domain_name} \
  --cert-file=/etc/ssl/COMPANY/etcd_peer_cert.pem \
  --client-cert-auth \
  --peer-client-cert-auth \
  --key-file=/etc/ssl/COMPANY/etcd_peer_key.pem \
  --peer-cert-file=/etc/ssl/COMPANY/etcd_peer_cert.pem \
  --peer-key-file=/etc/ssl/COMPANY/etcd_peer_key.pem \
  --trusted-ca-file=/etc/ssl/COMPANY/ca_cert.pem \
  --peer-trusted-ca-file=/etc/ssl/COMPANY/ca_cert.pem \
  --initial-cluster-token=${cluster_name}-etcd-token \
  --initial-advertise-peer-urls=https://etcd-${AVAILABILITY_ZONE}.${kube_domain_name}:2380 \
  --advertise-client-urls=https://etcd-${AVAILABILITY_ZONE}.${kube_domain_name}:2379 \
  --listen-client-urls=https://${DEFAULT_IPV4}:2379,http://127.0.0.1:2379 \
  --listen-peer-urls=https://0.0.0.0:2380 \
  --data-dir=/var/lib/etcd \
  --name=%H

#Restart=always
#RestartSec=10
#StartLimitInterval=200s
#StartLimitBurst=10
#TimeoutStartSec=0
LimitNOFILE=40000

[Install]
WantedBy=multi-user.target


Is there something wrong with my setup because I have so many "Unavailable" grpc codes metrics?

/peter


Gyuho Lee

unread,
Feb 15, 2018, 5:05:35 PM2/15/18
to etcd-dev
Can you share client or server logs when you get Unavailable request errors?

peter schulten

unread,
Feb 16, 2018, 8:34:03 AM2/16/18
to Gyuho Lee, etcd-dev
Hi @gyuho
client logs:
Time COMM   aws_private_ip    aws_az    MESSAGE   
14:05:36
kube-apiserver
10.52.187.105
eu-central-1c
I0216 13:05:36.739670 1528 logs.go:41] http: TLS handshake error from 10.52.186.233:36361: EOF
14:05:36
kube-apiserver
10.52.187.105
eu-central-1c
I0216 13:05:36.907637 1528 logs.go:41] http: TLS handshake error from 10.52.186.233:39500: EOF
14:05:36
kube-apiserver
10.52.187.105
eu-central-1c
I0216 13:05:36.373433 1528 logs.go:41] http: TLS handshake error from 10.52.186.233:52670: EOF
14:05:36
kube-apiserver
10.52.187.105
eu-central-1c
I0216 13:05:36.639663 1528 logs.go:41] http: TLS handshake error from 10.52.186.233:1623: EOF
14:05:36
kube-apiserver
10.52.187.105
eu-central-1c
I0216 13:05:36.762449 1528 logs.go:41] http: TLS handshake error from 10.52.186.233:65045: EOF
14:05:36
kube-apiserver
10.52.187.105
eu-central-1c
I0216 13:05:36.065624 1528 logs.go:41] http: TLS handshake error from 10.52.186.233:2872: EOF
14:05:35
kube-apiserver
10.52.187.85
eu-central-1b
I0216 13:05:35.718033 1536 logs.go:41] http: TLS handshake error from 10.52.186.213:55316: EOF
14:05:35
kube-apiserver
10.52.187.105
eu-central-1c
I0216 13:05:35.341891 1528 logs.go:41] http: TLS handshake error from 10.52.186.233:62144: EOF
14:05:34
kube-apiserver
10.52.187.85
eu-central-1b
I0216 13:05:34.447191 1536 logs.go:41] http: TLS handshake error from 10.52.186.213:10759: EOF
14:05:34
kube-apiserver
10.52.187.85
eu-central-1b
I0216 13:05:34.977787 1536 logs.go:41] http: TLS handshake error from 10.52.186.213:54188: EOF
14:05:34
kube-apiserver
10.52.187.25
eu-central-1a
I0216 13:05:34.137333 1530 logs.go:41] http: TLS handshake error from 10.52.186.179:31175: EOF
14:05:34
kube-apiserver
10.52.187.25
eu-central-1a
I0216 13:05:34.418580 1530 logs.go:41] http: TLS handshake error from 10.52.186.179:32852: EOF
14:05:34
kube-apiserver
10.52.187.25
eu-central-1a
I0216 13:05:34.428388 1530 logs.go:41] http: TLS handshake error from 10.52.186.179:63490: EOF
14:05:34
kube-apiserver
10.52.187.25
eu-central-1a
I0216 13:05:34.469058 1530 logs.go:41] http: TLS handshake error from 10.52.186.179:46948: EOF
server logs:

Time 
COMM   aws_private_ip    aws_az    MESSAGE   
14:05:34
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:48238" (error "EOF", ServerName "")
14:05:34
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:48236" (error "EOF", ServerName "")
14:05:34
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:48240" (error "EOF", ServerName "")
14:05:33
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.105:42124" (error "EOF", ServerName "")
14:05:33
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.105:42126" (error "EOF", ServerName "")
14:05:33
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.105:42122" (error "EOF", ServerName "")
14:05:32
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.25:42720" (error "EOF", ServerName "")
14:05:32
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.25:42718" (error "EOF", ServerName "")
14:05:29
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:48202" (error "EOF", ServerName "")
14:05:29
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:48204" (error "EOF", ServerName "")
14:05:29
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:48200" (error "EOF", ServerName "")
14:05:28
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.105:42070" (error "EOF", ServerName "")
14:05:28
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.105:42068" (error "EOF", ServerName "")
14:05:28
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.105:42066" (error "EOF", ServerName "")
14:05:27
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.25:42680" (error "EOF", ServerName "")
14:05:27
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.25:42678" (error "EOF", ServerName "")
14:05:24
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:48148" (error "EOF", ServerName "")
14:05:24
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:48144" (error "EOF", ServerName "")
14:05:24
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:48146" (error "EOF", ServerName "")
14:05:23
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.105:41994" (error "EOF", ServerName "")
14:05:23
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.105:41996" (error "EOF", ServerName "")
14:05:23
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.105:41998" (error "EOF", ServerName "")
14:05:22
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.25:42638" (error "EOF", ServerName "")
14:05:22
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.25:42640" (error "EOF", ServerName "")
14:05:19
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:48062" (error "EOF", ServerName "")
14:05:19
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:48064" (error "EOF", ServerName "")
14:05:19
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:48060" (error "EOF", ServerName "")
14:05:18
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.105:41888" (error "EOF", ServerName "")
14:05:18
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.105:41892" (error "EOF", ServerName "")
14:05:18
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.105:41890" (error "EOF", ServerName "")
14:05:17
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.25:42520" (error "EOF", ServerName "")
14:05:17
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.25:42522" (error "EOF", ServerName "")
14:05:14
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:47936" (error "EOF", ServerName "")
14:05:14
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:47938" (error "EOF", ServerName "")
14:05:14
etcd
10.52.187.26
eu-central-1a
rejected connection from "10.52.187.85:47940" (error "EOF", ServerName "")
14:05:13
etcd
10.52.187.118
eu-central-1c
wrote database snapshot out [total bytes: 8122368]
14:05:13
etcd
10.52.187.118
eu-central-1c
database snapshot [index: 1071338, to: 6313fc348c41fe92] sent out successfully
14:05:13
etcd
10.52.187.118
eu-central-1c
lost the TCP streaming connection with peer 6313fc348c41fe92 (stream MsgApp v2 reader)
14:05:13
etcd
10.52.187.118
eu-central-1c
failed to dial 6313fc348c41fe92 on stream MsgApp v2 (peer 6313fc348c41fe92 failed to find local node 724eda5ae89be8b7)
14:05:13
etcd
10.52.187.118
eu-central-1c
peer 6313fc348c41fe92 became active
I rebooted the server in eu-central-1a shortly before.
The kubernetes cluster runs fine. I didn't notice anything problematic

(bibi:kube-system) pschu@ip-192-168-178-115 ~> kubectl get cs
NAME                 STATUS    MESSAGE             ERROR
controller-manager   Healthy   ok
scheduler            Healthy   ok
etcd-2               Healthy   {"health":"true"}
etcd-1               Healthy   {"health":"true"}
etcd-0               Healthy   {"health":"true"} 


/peter

--
You received this message because you are subscribed to a topic in the Google Groups "etcd-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/etcd-dev/ujWldr3ys9M/unsubscribe.
To unsubscribe from this group and all its topics, send an email to etcd-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

peter schulten

unread,
Mar 5, 2018, 2:18:09 PM3/5/18
to etcd-dev
Hey,

running curl with the certs and keys the api-server node uses, runs fine. Is this a client, server or setup problem? Can this be ignored?

Thanks,
Peter
/peter

To unsubscribe from this group and all its topics, send an email to etcd-dev+unsubscribe@googlegroups.com.
Message has been deleted

peter schulten

unread,
May 1, 2018, 4:18:54 AM5/1/18
to etcd-dev
Hi all,

the problem seems to be gone after upgrading to etcd v3.3.4 and kube v1.10.2

/peter

Denis Arslanbekov

unread,
Nov 29, 2018, 11:08:48 AM11/29/18
to etcd-dev
Reply all
Reply to author
Forward
0 new messages