Issue with gocd-agent-dind image

11 views
Skip to first unread message

Ashwanth Kumar

unread,
Dec 23, 2024, 9:45:35 AM12/23/24
to go...@googlegroups.com
Hello,

I'm running GoCD: 24.3.0 with an elastic agent profile running on AWS EKS cluster. My pipelines run properly most of the time, but sometimes certain runs get into some limbo state with pod logs as below. When it does, the pipeline is just stuck waiting on assigning agents.

Is there anything obvious for anyone who has seen this error before? Should I just upgrade to 24.5.0 and see? The only way I get out of this is by terminating my node on EC2 forcefully. Killing the pod also doesn't help, because all newer pods have the same issue.

$ sudo /run-docker-daemon.sh
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
/run-docker-daemon.sh: line 23:     9 Killed                  $(which dind) dockerd --host=unix:///var/run/docker.sock ${DOCKERD_ADDITIONAL_ARGS:-'--host=tcp://localhost:2375'} > /var/log/dockerd.log 2>&1
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
dockerd startup failed...
time="2024-12-23T14:35:58.923528879Z" level=info msg="Starting up"
time="2024-12-23T14:35:58.924393339Z" level=warning msg="Binding to IP address without --tlsverify is insecure and gives root access on this machine to everyone who has access to your network." host="tcp://localhost:2375"
time="2024-12-23T14:35:58.924414660Z" level=warning msg="Binding to an IP address, even on localhost, can also give access to scripts run in a browser. Be safe out there!" host="tcp://localhost:2375"
time="2024-12-23T14:35:58.924442010Z" level=warning msg="[DEPRECATION NOTICE] In future versions this will be a hard failure preventing the daemon from starting! Learn more at: https://docs.docker.com/go/api-security/" host="tcp://localhost:2375"
time="2024-12-23T14:35:59.925995985Z" level=info msg="containerd not running, starting managed containerd"
time="2024-12-23T14:35:59.928570903Z" level=info msg="started new containerd process" address=/var/run/docker/containerd/containerd.sock module=libcontainerd pid=141
time="2024-12-23T14:35:59.956936540Z" level=info msg="starting containerd" revision=8fc6bcff51318944179630522a095cc9dbf9f353 version=v1.7.20
time="2024-12-23T14:36:00.000700964Z" level=info msg="loading plugin \"io.containerd.event.v1.exchange\"..." type=io.containerd.event.v1
time="2024-12-23T14:36:00.000778006Z" level=info msg="loading plugin \"io.containerd.internal.v1.opt\"..." type=io.containerd.internal.v1
time="2024-12-23T14:36:00.001422560Z" level=info msg="loading plugin \"io.containerd.warning.v1.deprecations\"..." type=io.containerd.warning.v1
time="2024-12-23T14:36:00.001454791Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.blockfile\"..." type=io.containerd.snapshotter.v1
time="2024-12-23T14:36:00.001564513Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.blockfile\"..." error="no scratch file generator: skip plugin" type=io.containerd.snapshotter.v1
time="2024-12-23T14:36:00.001590374Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.devmapper\"..." type=io.containerd.snapshotter.v1
time="2024-12-23T14:36:00.001613715Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.devmapper\"..." error="devmapper not configured: skip plugin" type=io.containerd.snapshotter.v1
time="2024-12-23T14:36:00.001631625Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.native\"..." type=io.containerd.snapshotter.v1
time="2024-12-23T14:36:00.001791238Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.overlayfs\"..." type=io.containerd.snapshotter.v1
time="2024-12-23T14:36:00.002375252Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.aufs\"..." type=io.containerd.snapshotter.v1
time="2024-12-23T14:36:00.018509514Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.aufs\"..." error="aufs is not supported (modprobe aufs failed: exit status 1 \"ip: can't find device 'aufs'\\nmodprobe: can't change directory to '/lib/modules': No such file or directory\\n\"): skip plugin" type=io.containerd.snapshotter.v1
time="2024-12-23T14:36:00.018690338Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.zfs\"..." type=io.containerd.snapshotter.v1
time="2024-12-23T14:36:00.019063176Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
time="2024-12-23T14:36:00.019096327Z" level=info msg="loading plugin \"io.containerd.content.v1.content\"..." type=io.containerd.content.v1
time="2024-12-23T14:36:00.019363883Z" level=info msg="loading plugin \"io.containerd.metadata.v1.bolt\"..." type=io.containerd.metadata.v1
time="2024-12-23T14:36:00.019483946Z" level=info msg="metadata content store policy set" policy=shared
time="2024-12-23T14:36:00.025371189Z" level=info msg="loading plugin \"io.containerd.gc.v1.scheduler\"..." type=io.containerd.gc.v1
time="2024-12-23T14:36:00.025466381Z" level=info msg="loading plugin \"io.containerd.differ.v1.walking\"..." type=io.containerd.differ.v1
time="2024-12-23T14:36:00.025513112Z" level=info msg="loading plugin \"io.containerd.lease.v1.manager\"..." type=io.containerd.lease.v1
time="2024-12-23T14:36:00.025568054Z" level=info msg="loading plugin \"io.containerd.streaming.v1.manager\"..." type=io.containerd.streaming.v1
time="2024-12-23T14:36:00.025598394Z" level=info msg="loading plugin \"io.containerd.runtime.v1.linux\"..." type=io.containerd.runtime.v1
time="2024-12-23T14:36:00.025975922Z" level=info msg="loading plugin \"io.containerd.monitor.v1.cgroups\"..." type=io.containerd.monitor.v1
time="2024-12-23T14:36:00.026418103Z" level=info msg="loading plugin \"io.containerd.runtime.v2.task\"..." type=io.containerd.runtime.v2
time="2024-12-23T14:36:00.026879433Z" level=info msg="loading plugin \"io.containerd.runtime.v2.shim\"..." type=io.containerd.runtime.v2
time="2024-12-23T14:36:00.026911204Z" level=info msg="loading plugin \"io.containerd.sandbox.store.v1.local\"..." type=io.containerd.sandbox.store.v1
time="2024-12-23T14:36:00.026934074Z" level=info msg="loading plugin \"io.containerd.sandbox.controller.v1.local\"..." type=io.containerd.sandbox.controller.v1
time="2024-12-23T14:36:00.026959075Z" level=info msg="loading plugin \"io.containerd.service.v1.containers-service\"..." type=io.containerd.service.v1
time="2024-12-23T14:36:00.026981415Z" level=info msg="loading plugin \"io.containerd.service.v1.content-service\"..." type=io.containerd.service.v1
time="2024-12-23T14:36:00.027001746Z" level=info msg="loading plugin \"io.containerd.service.v1.diff-service\"..." type=io.containerd.service.v1
time="2024-12-23T14:36:00.027134208Z" level=info msg="loading plugin \"io.containerd.service.v1.images-service\"..." type=io.containerd.service.v1
time="2024-12-23T14:36:00.027172850Z" level=info msg="loading plugin \"io.containerd.service.v1.introspection-service\"..." type=io.containerd.service.v1
time="2024-12-23T14:36:00.027207840Z" level=info msg="loading plugin \"io.containerd.service.v1.namespaces-service\"..." type=io.containerd.service.v1
time="2024-12-23T14:36:00.027228851Z" level=info msg="loading plugin \"io.containerd.service.v1.snapshots-service\"..." type=io.containerd.service.v1
time="2024-12-23T14:36:00.027250042Z" level=info msg="loading plugin \"io.containerd.service.v1.tasks-service\"..." type=io.containerd.service.v1
time="2024-12-23T14:36:00.027280312Z" level=info msg="loading plugin \"io.containerd.grpc.v1.containers\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027302052Z" level=info msg="loading plugin \"io.containerd.grpc.v1.content\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027322173Z" level=info msg="loading plugin \"io.containerd.grpc.v1.diff\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027343804Z" level=info msg="loading plugin \"io.containerd.grpc.v1.events\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027364884Z" level=info msg="loading plugin \"io.containerd.grpc.v1.images\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027385374Z" level=info msg="loading plugin \"io.containerd.grpc.v1.introspection\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027486097Z" level=info msg="loading plugin \"io.containerd.grpc.v1.leases\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027533828Z" level=info msg="loading plugin \"io.containerd.grpc.v1.namespaces\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027563219Z" level=info msg="loading plugin \"io.containerd.grpc.v1.sandbox-controllers\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027596579Z" level=info msg="loading plugin \"io.containerd.grpc.v1.sandboxes\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027616510Z" level=info msg="loading plugin \"io.containerd.grpc.v1.snapshots\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027644821Z" level=info msg="loading plugin \"io.containerd.grpc.v1.streaming\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027668330Z" level=info msg="loading plugin \"io.containerd.grpc.v1.tasks\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027701042Z" level=info msg="loading plugin \"io.containerd.transfer.v1.local\"..." type=io.containerd.transfer.v1
time="2024-12-23T14:36:00.027736933Z" level=info msg="loading plugin \"io.containerd.grpc.v1.transfer\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027756793Z" level=info msg="loading plugin \"io.containerd.grpc.v1.version\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.027778393Z" level=info msg="loading plugin \"io.containerd.internal.v1.restart\"..." type=io.containerd.internal.v1
time="2024-12-23T14:36:00.027988048Z" level=info msg="loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." type=io.containerd.tracing.processor.v1
time="2024-12-23T14:36:00.028029879Z" level=info msg="skip loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." error="skip plugin: tracing endpoint not configured" type=io.containerd.tracing.processor.v1
time="2024-12-23T14:36:00.028055519Z" level=info msg="loading plugin \"io.containerd.internal.v1.tracing\"..." type=io.containerd.internal.v1
time="2024-12-23T14:36:00.028086000Z" level=info msg="skip loading plugin \"io.containerd.internal.v1.tracing\"..." error="skip plugin: tracing endpoint not configured" type=io.containerd.internal.v1
time="2024-12-23T14:36:00.028110501Z" level=info msg="loading plugin \"io.containerd.grpc.v1.healthcheck\"..." type=io.containerd.grpc.v1
time="2024-12-23T14:36:00.028163852Z" level=info msg="loading plugin \"io.containerd.nri.v1.nri\"..." type=io.containerd.nri.v1
time="2024-12-23T14:36:00.028196603Z" level=info msg="NRI interface is disabled by configuration."
time="2024-12-23T14:36:00.028682933Z" level=info msg=serving... address=/var/run/docker/containerd/containerd-debug.sock
time="2024-12-23T14:36:00.028929999Z" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock.ttrpc
time="2024-12-23T14:36:00.029152474Z" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock
time="2024-12-23T14:36:00.029203025Z" level=info msg="containerd successfully booted in 0.073659s"
time="2024-12-23T14:36:01.003481197Z" level=info msg="Loading containers: start."
time="2024-12-23T14:36:02.301133968Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
time="2024-12-23T14:36:02.600676172Z" level=info msg="Loading containers: done."
time="2024-12-23T14:36:02.635377282Z" level=warning msg="[DEPRECATION NOTICE]: API is accessible on http://localhost:2375 without encryption.\n         Access to the remote API is equivalent to root access on the host. Refer\n         to the 'Docker daemon attack surface' section in the documentation for\n         more information: https://docs.docker.com/go/attack-surface/\nIn future versions this will be a hard failure preventing the daemon from starting! Learn more at: https://docs.docker.com/go/api-security/"
time="2024-12-23T14:36:02.635459914Z" level=info msg="Docker daemon" commit=cc13f95 containerd-snapshotter=false storage-driver=overlay2 version=27.1.1
time="2024-12-23T14:36:02.635735160Z" level=info msg="Daemon has completed initialization"
/docker-entrypoint.sh: cannot sudo /run-docker-daemon.sh


--

Ashwanth Kumar / ashwanthkumar.in

Sriram Narayanan

unread,
Dec 23, 2024, 10:30:14 AM12/23/24
to go...@googlegroups.com
Hi Ashwath,

We have seen this with older version of EKS, namely we were on 1.23 for too long.

We had faced this issue and also loss of connectivity.

We tested connections manually to various end points other than GoCD server and would see that the node would lose connectivity to various end points. We connected over the AWS Session Manager and not over SSH.

A restart of the EC2 worker node was the only way to get connectivity back again.

We were able to use these two issues to urge that the EKS upgrade be prioritised.  ( We are helping the client with their journey to achieve Continuous Deliver and have added the IaC for EKS to our work backlog ).

— Sriram

--
You received this message because you are subscribed to the Google Groups "go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to go-cd+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/go-cd/CAD9m7Cz8R3sZtuNL_T49g_SW%2BqjdGjt087c-6NrbbiqSgh4QYg%40mail.gmail.com.

Ashwanth Kumar

unread,
Dec 23, 2024, 10:34:30 AM12/23/24
to go...@googlegroups.com
Thanks Sriram,

I'm currently on EKS version 1.31 though (that's the latest version).

Ashwanth Kumar

unread,
Dec 23, 2024, 11:14:08 AM12/23/24
to go...@googlegroups.com
A quick update folks, We recently integrated Crowdstrike Falcon agents into our EKS Cluster and noticed that Falcon has something called Drift Detection where if any new executables were created and executed in the container it would kill / block it. In our setup, there was an executable called "/check" that was getting created and executed. This process was killed by Falcon as part of a Drift Indicator called "RecentlyModifiedFileExecutedInContainer". I had to disable the "Container drift prevention" policy check to make sure gocd agents do not have this issue.

After disabling new pods (agents) that were getting assigned on the underlying host started working just fine.

Sharing it here hoping someone on the internet will find this useful and don't want to spend 5+ hours of their life trying to figure out why DinD setup is likely to fail in a Falcon protected environment.

Thanks,

Sriram Narayanan

unread,
Dec 23, 2024, 12:22:32 PM12/23/24
to go...@googlegroups.com
Thanks for sharing this.

It might be worthwhile understanding the relationship between /check and the docker daemon not being reachable.

Perhaps due to compliance, this particular Falcon setting could get reapplied someday and reintroduce this particular failure.

— Sriram

--
You received this message because you are subscribed to the Google Groups "go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to go-cd+un...@googlegroups.com.

Chad Wilson

unread,
Dec 23, 2024, 10:50:14 PM12/23/24
to go...@googlegroups.com
In any case, the log seems to imply the Docker daemon is being forcibly killed before completing startup.

I'm not aware of Docker daemon creating an executable file like "/check" that it then runs as an important part of its startup. Seems possible that there is some missing context here, or that this is coming from something else specific to your nodes/containers?

Nevertheless, I can imagine a DIND setup is the exact opposite of what "container drift protection" seeks to deal with in a sense. Docker by design is downloading random executables within layered filesystems, writing them and then executing them. If you are mounting a host socket into these pods, even harder for something like CrowdStrike Falcon to make sense of.

-Chad

Chad Wilson

unread,
Dec 24, 2024, 12:20:01 AM12/24/24
to go...@googlegroups.com
Nevertheless, you may want to upgrade to 24.5.0 (for agent base image) to pick up the fix to https://github.com/gocd/gocd/pull/13321 if running the dind images on AL2023 EKS nodes; even though the "fix" itself is not 100% reliable until the race condition can be addressed upstream within moby/Docker.

I don't see the telltale signs of that particular issue in your setup, but when you add privileged tools like Falcon to nodes, who knows what's going on :-)

-Chad
Reply all
Reply to author
Forward
0 new messages