Hello,
we have a GKE cluster which has been running for years. It has only one node pool. The cluster's pods were monitored by and had their logs pushed to Stackdriver. To upgrade our nodes from 1.14.8-gke.12 to 1.15.12-gke.2 we created a new node pool with the new version, migrated all pods and then deleted the old pool. We have used this method several times in the past to upgrade our nodes and haven't had any issues.
Since this upgrade no pod logs and monitoring metrics appear in Stackdriver any longer. At first I through, maybe something went wrong with the creation of the new pool. So I have tried to create new pools with the same node version 1.15.12-gke.2 and also with the previous version 1.14.8-gke.12. Then created pods on these new pools, but without success. The pod logs can be viewed with kubectl logs, but they never show up in Stackdriver.
The cluster's configuration was not changed. It has "Cloud operations for GKE" set to "System and workload logging and monitoring".
On the Kubernetes nodes the fluentd-gcp pods are running. In the container prometheus-to-sd-exporter I see some error messages like this:
I0904 09:00:33.605455 1 main.go:134] Running prometheus-to-sd, monitored target is fluentd localhost:24231
E0904 09:00:33.605672 1 main.go:90] listen tcp :6061: bind: address already in use
And another example from a different node (project name redacted):
I0902 10:25:56.301201 1 main.go:134] Running prometheus-to-sd, monitored target is fluentd localhost:24231
E0902 10:25:56.301467 1 main.go:90] listen tcp :6061: bind: address already in use
E0902 18:37:06.323094 1 stackdriver.go:58] Error while sending request to Stackdriver googleapi: Error 503: Deadline expired before operation could complete., backendError
I'm not sure if these error messages are related to the problem but it's something I noticed.
Since I'm out of ideas at this point, I'd be glad for any ideas or advice.
Regards
Frank