Monitoring and logging of GKE cluster stopped working with new GKE node pool

frank....@est.fujitsu.com

unread,

Sep 4, 2020, 5:27:03 AM9/4/20

to Google Stackdriver Discussion Forum

Hello,

we have a GKE cluster which has been running for years. It has only one node pool. The cluster's pods were monitored by and had their logs pushed to Stackdriver. To upgrade our nodes from 1.14.8-gke.12 to 1.15.12-gke.2 we created a new node pool with the new version, migrated all pods and then deleted the old pool. We have used this method several times in the past to upgrade our nodes and haven't had any issues.

Since this upgrade no pod logs and monitoring metrics appear in Stackdriver any longer. At first I through, maybe something went wrong with the creation of the new pool. So I have tried to create new pools with the same node version 1.15.12-gke.2 and also with the previous version 1.14.8-gke.12. Then created pods on these new pools, but without success. The pod logs can be viewed with kubectl logs, but they never show up in Stackdriver.

The cluster's configuration was not changed. It has "Cloud operations for GKE" set to "System and workload logging and monitoring".

On the Kubernetes nodes the fluentd-gcp pods are running. In the container prometheus-to-sd-exporter I see some error messages like this:

I0904 09:00:33.605455 1 main.go:134] Running prometheus-to-sd, monitored target is fluentd localhost:24231

E0904 09:00:33.605672 1 main.go:90] listen tcp :6061: bind: address already in use

And another example from a different node (project name redacted):

I0902 10:25:56.301201 1 main.go:134] Running prometheus-to-sd, monitored target is fluentd localhost:24231

E0902 10:25:56.301467 1 main.go:90] listen tcp :6061: bind: address already in use

E0902 11:02:56.305468 1 stackdriver.go:58] Error while sending request to Stackdriver Post https://monitoring.googleapis.com/v3/projects/REDACTED/timeSeries?alt=json: read tcp 10.240.0.34:48560->64.233.167.95:443: read: connection reset by peer

E0902 18:37:06.323094 1 stackdriver.go:58] Error while sending request to Stackdriver googleapi: Error 503: Deadline expired before operation could complete., backendError

I'm not sure if these error messages are related to the problem but it's something I noticed.

Since I'm out of ideas at this point, I'd be glad for any ideas or advice.

Regards

Frank

Summit Tuladhar

unread,

Sep 4, 2020, 10:33:19 AM9/4/20

to frank....@est.fujitsu.com, Google Stackdriver Discussion Forum

Hello Frank,

Sorry to hear about the issue you are having with GKE logs. I believe this is because starting from 1.15, GKE uses the "Cloud Operations for GKE" option instead of the "Legacy Logging and Monitoring". Please take a look at the following documentation:

With the new "Cloud Operations for GKE" option, your logs and metrics start using the data model. You should find your logs in the "Kubernetes Container" (k8s_container), "Kubernetes Node" (k8s_node), "Kubernetes Pod" (k8s_pod), and "Kubernetes Cluster" (k8s_cluster) resource types.

Let us know if this is not the case and you still cannot find logs and metrics from your GKE cluster.

Regards,

Summit

--
© 2016 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043

Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdr...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.
---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-stackdriver-discussion/5c7b9e4f-d7b2-4cc1-8dd3-3725e76e752an%40googlegroups.com.

frank....@est.fujitsu.com

unread,

Sep 7, 2020, 7:14:56 AM9/7/20

to Google Stackdriver Discussion Forum

Hi Summit,

sorry for the delayed reply. Thanks a lot for the info and for linking the relevant documentation. The logs are indeed available in "Kubernetes Container".