Kubernetes Elastic Agents - Pods can no longer be created

33 views
Skip to first unread message

Kim Pham

unread,
Oct 26, 2023, 10:56:58 AM10/26/23
to go-cd
Hi All,

We recently began to encounter issues where pods were unable to be created.  Nothing has changed in terms of GoCD server, agent, and Kubernetes elastic agent plugin versions.  However, we did notice that the cluster went through an automatic upgrade and updated gke version to 1.24.14.  GoCD is able to see the node pools through the 'Status Report' button.  

When attempting to create an agent on those node pools, I do see a 500 in the plugin logs and gocd-server logs.  Attached are logs.  

I've tried updating GoCD and the plugins to latest release versions.  Our static agents that are running on older gke versions aren't having any issues.

Has anyone encountered this?

Thanks in advance.

gocd-server-log.txt
elasticagent-plugin-log.txt

Ashwanth Kumar

unread,
Oct 26, 2023, 11:21:46 AM10/26/23
to go...@googlegroups.com
A wild guess, anything changed on the service account side or a custom role being added as part of the upgrade that is probably not allowing the gocd plugin to create the pod? 

Thanks,


--
You received this message because you are subscribed to the Google Groups "go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to go-cd+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/go-cd/a6b8e99d-f415-4c18-b67d-e86c3df16733n%40googlegroups.com.


--

Ashwanth Kumar / ashwanthkumar.in

Kim Pham

unread,
Oct 26, 2023, 11:33:41 AM10/26/23
to go...@googlegroups.com
Hi Ashwanth,

I checked the clusterrole of the service account it's using and it basically has full access atm.

PolicyRule:
  Resources   Non-Resource URLs  Resource Names  Verbs
  ---------   -----------------  --------------  -----
  events      []                 []              [*]
  namespaces  []                 []              [*]
  nodes       []                 []              [*]
  pods/log    []                 []              [*]
  pods        []                 []              [*]

Chad Wilson

unread,
Oct 26, 2023, 11:37:29 AM10/26/23
to go...@googlegroups.com
Just curious - were the errors/stack traces on failure essentially identical before and after you upgraded your gocd and elastic agent plugin versions?

Chad Wilson

unread,
Oct 26, 2023, 11:46:51 AM10/26/23
to go...@googlegroups.com
Unfortunately the error message is a bit mysterious and useless. Which agent image you are using? Anything special in the elastic agent pod spec that might no longer work as expected on Kubernetes 1.24? (e.g use of docker dind images)

Does the pod get created (and fail) if you look at the events on the kubernetes side, or does it never get that far?

-Chad

Kim Pham

unread,
Oct 26, 2023, 11:55:25 AM10/26/23
to go...@googlegroups.com
Hi Chad,

I was just reading up on the changes for Kubernetes.  Looks 1.24 moves to containerd runtime images and we are still using DIND for our elastic agents.  That seems like it could be the culprit.  I'll do some testing and change our elastic agent images.  Thanks for pointing that out.

Kim Pham

unread,
Oct 27, 2023, 5:16:44 PM10/27/23
to go...@googlegroups.com
Hi,

I just wanted to give an update.  We're still unsure what's preventing agents from spinning up on our old cluster.  The pods aren't getting created at all and there's no logs in Kubernetes side indicating activity.  I've tried with a regular Debian agent and the logs are the same.  I did roll back to GoCD 23.1.0 to see what the log output was and it's showing a SocketTimeoutException error.  We're still investigating but it's possible some configuration changed that we're not aware of with that cluster.

I was able to spin up a new cluster and it works fine there.  That cluster is running:
  • GoCD Server: 23.3.0
  • GoCD DinD Agent: 23.3.0
  • Elastic Agent Node Pools: 1.25.10-gk3.2700
I'll update if there's anything relevant to share.  Thanks for all the help!

Chad Wilson

unread,
Oct 27, 2023, 10:26:06 PM10/27/23
to go...@googlegroups.com
Thanks for sharing. Sounds like it could be a network policy thing or something like that? Did you double check the cluster profile config to see if anything is amiss there?

I can't see why you'd get a different error message with 23.1.0 (and the same plugin version) unless there was something different about the cluster profile config (API server url etc), so that might be an avenue for investigation?

-Chad

Reply all
Reply to author
Forward
0 new messages