metadata-agent keeps restarting in my GKE cluster

682 views
Skip to first unread message

ji...@comware.com.au

unread,
Oct 31, 2018, 4:40:16 AM10/31/18
to Google Stackdriver Discussion Forum
My metadata-agent pod just keeps restarting. Here is the log below

I1031 07:53:24 7f5fbd183740 updater.cc:40 Not starting DockerUpdater

E1031 07:53:34 7f5fbab5d700 environment.cc:102 Exception: Host not found (non-authoritative), try again later: 'http://metadata.google.internal./computeMetadata/v1/instance/id'

E1031 07:53:44 7f5fbd183740 kubernetes.cc:697 Failed to query https://kubernetes.default.svc/api/v1/nodes?limit=1: Host not found (non-authoritative), try again later

E1031 07:53:44 7f5fbab5d700 environment.cc:102 Exception: Host not found (non-authoritative), try again later: 'http://metadata.google.internal./computeMetadata/v1/instance/zone'

E1031 07:53:54 7f5fbab5d700 environment.cc:102 Exception: Host not found (non-authoritative), try again later: 'http://metadata.google.internal./computeMetadata/v1/instance/id'

E1031 07:54:00 7f5fbd183740 kubernetes.cc:697 Failed to query https://kubernetes.default.svc/api/v1/nodes?limit=1: Host not found (authoritative)

I1031 07:54:01 7f5fbd183740 kubernetes.cc:1322 Watching for cluster-level metadata

I1031 07:54:01 7f5fb17fa700 reporter.cc:46 Metadata reporter started

I1031 07:54:01 7f5fb8d26700 kubernetes.cc:1273 Watch thread started for endpoints

I1031 07:54:01 7f5fb3fff700 kubernetes.cc:1239 Watch thread started for services

I1031 07:54:01 7f5fba35c700 kubernetes.cc:1163 Watch thread (pods) started for node <unscheduled>

I1031 07:54:01 7f5fb9b5b700 kubernetes.cc:1203 Watch thread (node) started for node <all>

I1031 07:54:04 7f5fb17fa700 environment.cc:270 No credentials found at /etc/google/auth/application_default_credentials.json

I1031 07:54:04 7f5fb17fa700 environment.cc:146 Got project id from metadata server: 206352482676

I1031 07:54:04 7f5fb17fa700 oauth2.cc:283 Getting auth token from metadata server

W1031 07:59:24 7f5fb37fe700 api_server.cc:183 /healthz returning 500; unhealthy components: Service

Caught SIGTERM; shutting down

Stopping server

I1031 07:59:24 7f5fbd183740 api_server.cc:102 API server stopped

Stopping updaters

Exiting



It looks like maybe the service account is not set correctly. 

Any ideas?


Additional, what permissions are required on the service account to interact with Stackdriver.

cheers
</jima>

Igor Peshansky

unread,
Oct 31, 2018, 1:50:54 PM10/31/18
to ji...@comware.com.au, google-stackdr...@googlegroups.com
These errors show two symptoms:

E1031 07:53:34 7f5fbab5d700 environment.cc:102 Exception: Host not found (non-authoritative), try again later: 'http://metadata.google.internal./computeMetadata/v1/instance/id'
E1031 07:53:44 7f5fbd183740 kubernetes.cc:697 Failed to query https://kubernetes.default.svc/api/v1/nodes?limit=1: Host not found (non-authoritative), try again later

The first one is caused by kube-dns not resolving metadata.google.internal (you might need to add a mapping in the kube-dns config from metadata.google.internal to 169.254.169.254).
The second is because the master is not accessible from the metadata agent pod. Since the master service should be resolvable by kube-dns, this is probably a restriction on the pod — you might need to change the RBAC to allow the pod to talk to the master (https://kubernetes.io/docs/reference/access-authn-authz/rbac/#role-and-clusterrole).

Additional, what permissions are required on the service account to interact with Stackdriver.

To write metrics, you need the monitoring.metricWriter role (https://cloud.google.com/monitoring/access-control#roles). To write logs, you need the logging.logWriter role (https://cloud.google.com/logging/docs/access-control#permissions_and_roles). Either of those should be sufficient for the metadata agent, but if you use a shared service account for all three agents, it's best to give it both roles (or the associated permissions).
        Igor

--
© 2016 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdr...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.
---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-d...@googlegroups.com.
To post to this group, send email to google-stackdr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-stackdriver-discussion/81c3c783-9d29-45b8-a822-1a04ca2e3031%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ji...@comware.com.au

unread,
Oct 31, 2018, 8:10:09 PM10/31/18
to Google Stackdriver Discussion Forum
Igor,

I am using "gcloud beta container clusters create" so would have thought DNS entries etc would be in place. This is what i am using to create the cluster

gcloud beta container clusters create xxx-staging \
--zone=us-west1-a \
--cluster-version=1.11.2-gke.9 \
--node-version=1.11.2-gke.9 \
--disk-size=100 \
--disk-type=pd-ssd \
--enable-autorepair \
--enable-autoupgrade \
--enable-stackdriver-kubernetes \
--labels=project=xxxx,environment=staging \
--machine-type=n1-standard-2 \
--node-labels=project=xxxx,environment=staging \
--tags=orion,staging \
--node-locations=us-west1-a,us-west1-b,us-west1-c \
--num-nodes=1 \
--enable-autoscaling \
--min-nodes=1 \
--max-nodes=5 \
--username=orionadmin \
--password='xxxxxx' \
--addons=HttpLoadBalancing,HorizontalPodAutoscaling,Istio

The cluster comes up okay but the pods keep restarting

metadata-agent-cluster-level-54bf88c8dc-bbhtn             1/1       Running   4          25m

metadata-agent-crg8h                                      1/1       Running   4          25m

metadata-agent-wkqx7                                      1/1       Running   4          25m

metadata-agent-xdc6m                                      1/1       Running   3          25m

metrics-server-v0.2.1-fd596d746-7qk4z                     2/2       Running   0          25m

prabir...@bnc.ca

unread,
Oct 31, 2018, 8:25:17 PM10/31/18
to Google Stackdriver Discussion Forum
That is exactly the error that I am getting too. 

prabir...@bnc.ca

unread,
Oct 31, 2018, 8:49:34 PM10/31/18
to Google Stackdriver Discussion Forum
I see that it is documented here. Apparently it is working as designed. 

Igor Peshansky

unread,
Oct 31, 2018, 10:42:03 PM10/31/18
to prabir...@bnc.ca, Google Stackdriver Discussion Forum
A few restarts per hour are expected in beta (https://cloud.google.com/monitoring/kubernetes-engine/release-guide#b1_10_6). As long as your metadata is ingested properly (the UI shows your workloads), it's working fine.

If you use a custom networking setup (e.g., a severely restricted VPC), the agent will likely not be able to contact the necessary hosts, and metadata won't be ingested.
        Igor
-- sent from a mobile device, please excuse tyops and omissns

--
© 2016 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdr...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.
---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-d...@googlegroups.com.
To post to this group, send email to google-stackdr...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages