GOCD AWS ECS Elastic Agent allocation is falling

pradeep devaraj

unread,

Sep 2, 2024, 12:25:48 PM9/2/24

to go-cd

We are using a GOCD AWS ECS elastic agent plugin.

GOCD version: GoCD Version: 23.4.0

GoCD Elastic Agent Plugin for Amazon ECS

Version7.3.0-416

AMI id: ami-0ba9fb6bc8faf1fe0

Elastic instance is coming up and its not getting assigned to ECS cluster, we logged in to server and found the blow error.

[root@ip-******* ~]# systemctl restart docker
Job for docker.service failed because start of the service was attempted too often. See "systemctl status docker.service" and "journalctl -xe" for details.
To force a start use "systemctl reset-failed docker.service" followed by "systemctl start docker.service" again.
[root@ip- ******* ~]# journalctl -xe
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ecs.service has finished shutting down.
Sep 02 16:03:20 ip-10-226-11-63.aws.cloud.epsilon systemd[1]: start request repeated too quickly for docker.service
Sep 02 16:03:20 ip-10-226-11-63.aws.cloud.epsilon systemd[1]: Failed to start Docker Application Container Engine.
-- Subject: Unit docker.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit docker.service has failed.
--
-- The result is failed.
Sep 02 16:03:20 ip-10-226-11-63.aws.cloud.epsilon systemd[1]: docker.service failed.
Sep 02 16:03:20 ip-10-226-11-63.aws.cloud.epsilon systemd[1]: Starting Amazon Elastic Container Service - container agent...
-- Subject: Unit ecs.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ecs.service has begun starting up.
Sep 02 16:03:20 ip-10-226-11-63.aws.cloud.epsilon systemd[1]: ecs.service: control process exited, code=exited status=1
Sep 02 16:03:20 ip-10-226-11-63.aws.cloud.epsilon amazon-ecs-init[6236]: level=info time=2024-09-02T16:03:20Z msg="post-stop"
Sep 02 16:03:20 ip-10-226-11-63.aws.cloud.epsilon amazon-ecs-init[6236]: level=info time=2024-09-02T16:03:20Z msg="Cleaning up the credentials endpoint setup for Amazon El
Sep 02 16:03:20 ip-10-226-11-63.aws.cloud.epsilon amazon-ecs-init[6236]: level=error time=2024-09-02T16:03:20Z msg="Error performing action 'delete' for iptables route: ex
Sep 02 16:03:20 ip-10-226-11-63.aws.cloud.epsilon amazon-ecs-init[6236]: level=error time=2024-09-02T16:03:20Z msg="Error performing action 'delete' for iptables route: ex
Sep 02 16:03:20 ip-10-226-11-63.aws.cloud.epsilon amazon-ecs-init[6236]: level=error time=2024-09-02T16:03:20Z msg="Error performing action 'delete' for iptables route: ex
Sep 02 16:03:20 ip-10-226-11-63.aws.cloud.epsilon amazon-ecs-init[6236]: level=error time=2024-09-02T16:03:20Z msg="Error performing action 'delete' for iptables route: ex
Sep 02 16:03:20 ip-10-226-11-63.aws.cloud.epsilon systemd[1]: Failed to start Amazon Elastic Container Service - container agent.
-- Subject: Unit ecs.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ecs.service has failed.
--
-- The result is failed.
Sep 02 16:03:20 ip-10-226-11-63.aws.cloud.epsilon systemd[1]: Unit ecs.service entered failed state.
Sep 02 16:03:20 ip-10-226-11-63.aws.cloud.epsilon systemd[1]: ecs.service failed.

[root@ipXXXX ~]# df -hT
Filesystem     Type      Size Used Avail Use% Mounted on
devtmpfs       devtmpfs 7.7G     0 7.7G   0% /dev
tmpfs          tmpfs     7.7G     0 7.7G   0% /dev/shm
tmpfs          tmpfs     7.7G 376K 7.7G   1% /run
tmpfs          tmpfs     7.7G     0 7.7G   0% /sys/fs/cgroup
/dev/nvme0n1p1 xfs       100G 2.4G   98G   3% /
tmpfs          tmpfs     1.6G     0 1.6G   0% /run/user/0
[root@ip-10-226-11-63 ~]# docker --version
Docker version 25.0.5, build 5dc9bcc

BELOW User data script we are using and getting excited while spinning up an error.

"ECS_INSTANCE_ATTRIBUTES={"server-id":"31e424ad-e242-45d2-a5bb-0ef7be0d8306"} EOT echo 'File /etc/ecs/ecs.config successfully created.' log "Finished executing GoCD's user data script, now executing custom user data script from use, if present." #!/bin/bash echo "ECS_CLUSTER=GoCD-ECS-UAT" >> /etc/ecs/ecs.config log "Finished executing user specified user data script." --// #cloud-config cloud_final_modules: - [scripts-user, always] --// Content-Type: text/x-shellscript; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="initialize_instance_store" #!/bin/bash exec > >(tee /var/log/initialize_instance_store.log | logger -t user-data -s 2>/dev/console) 2>&1 function log() { echo "[$(date "+%Y-%m-%d %H:%M:%S")] - $1" >> /var/log/initialize_instance_store.log } function try() { $@ return 0 } log "Starting to setup instance store for the docker." INSTANCE_STORES=$(ls /dev/disk/by-id/*EC2_NVMe_Instance_Storage*-ns-1) if [ -z "${INSTANCE_STORES}" ]; then log "No instance store detected." fi VOLUMES="$INSTANCE_STORES" if [ -e "/dev/xvdcz" ]; then log "Instance has /dev/xvdcz EBS volume. Using it for docker logical volume group." VOLUMES="$VOLUMES /dev/xvdcz" fi if [ -z "${VOLUMES}" ]; then log "No addition volumes. Using box standard docker setup." else log "Available instance stores: ${VOLUMES}." log "Setting up the docker logical volume group." service docker stop rm -rf /var/lib/docker/* dmsetup remove_all VOLUME_GROUP=docker LOGICAL_VOLUME=docker-pool try vgremove -y "${VOLUME_GROUP}" try lvremove -y "${LOGICAL_VOLUME}" vgcreate -y "${VOLUME_GROUP}" ${VOLUMES} sleep 2 lvcreate -y -l 5%VG -n ${LOGICAL_VOLUME}\meta ${VOLUME_GROUP} lvcreate -y -l 90%VG -n ${LOGICAL_VOLUME} ${VOLUME_GROUP} sleep 2 lvconvert -y --zero n --thinpool ${VOLUME_GROUP}/${LOGICAL_VOLUME} --poolmetadata ${VOLUME_GROUP}/${LOGICAL_VOLUME}\meta echo 'DOCKER_STORAGE_OPTIONS=" --storage-driver devicemapper --storage-opt dm.thinpooldev=/dev/mapper/docker-docker--pool --storage-opt dm.use_deferred_removal=true --storage-opt dm.use_deferred_deletion=true --storage-opt dm.fs=ext4 --storage-opt dm.use_deferred_deletion=true"' > /etc/sysconfig/docker-storage test -f /bin/systemctl && systemctl reset-failed docker.service service docker restart test -f /bin/systemctl && systemctl enable --no-block --now ecs fi log "Setup completed." --//"

pradeep devaraj

unread,

Sep 2, 2024, 1:51:06 PM9/2/24

to go-cd

Adding++

we are getting the agnet creation and deletion in loop

[go] Received a request to create an agent for the job: [SpecOps_UAT_Elastic_Img_crt/6/test/1/test]

[go] No running instance(s) found to build the ECS Task to perform current job.

[go] Creating a new container instance to schedule ECS Task.

[go] Waiting for instance(s) ([i-061187c3d2ea07317]) to register with cluster.

[go] Received a request to create an agent for the job: [SpecOps_UAT_Elastic_Img_crt/6/test/1/test]

[go] No running instance(s) found to build the ECS Task to perform current job.

[go] Creating a new container instance to schedule ECS Task.

[go] Waiting for instance(s) ([i-00bb68d594121ab15]) to register with cluster.

[go] Received a request to create an agent for the job: [SpecOps_UAT_Elastic_Img_crt/6/test/1/test]

[go] No running instance(s) found to build the ECS Task to perform current job.

[go] Creating a new container instance to schedule ECS Task.

pradeep devaraj

unread,

Sep 3, 2024, 7:23:52 AM9/3/24

to go-cd

Hi Team / Chad Wilson.

Docker service and ECS service is failing when new server comes up. AMI id: ami-0a5f593ecaa0f722d community one. when we manully spin the server and attach via ASG it's registering to cluster. when we try the same from gocd ecs cluster profile(AWS ECS ELastic plugin) it's not working and Docker service and ECS service is failing.

Sriram Narayanan

unread,

Sep 3, 2024, 8:36:02 AM9/3/24

to go...@googlegroups.com

( I am ill so please excuse the limited questions)

- does the ECS consumer get created and registered if you remove the user data script?

- what changed between when this ECS used to work vs now?

— Sriram

--
You received this message because you are subscribed to the Google Groups "go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to go-cd+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/go-cd/763a2904-4962-4c8b-ae2a-b8bf72701e5bn%40googlegroups.com.

pradeep devaraj

unread,

Sep 3, 2024, 9:55:00 AM9/3/24

to go-cd

Hi Sriram,

- does the ECS consumer get created and registered if you remove the user data script? : yes.

We have taken marketplace AMI: ami-0a5f593ecaa0f722d , if we create the server manually via the launch template and added to the ECS cluster, its works. the same step if we are doing it via from GOCD - GoCD Elastic Agent Plugin for Amazon ECS. its failing and ocker, ECS is not running.
Docker version: Docker version 25.0.5, build 5dc9bcc

- what changed between when this ECS used to work vs now?

nothing has changed, it was working till last Thursday night

Chad Wilson

unread,

Sep 4, 2024, 4:02:47 AM9/4/24

to go...@googlegroups.com

Something must have changed, e.g you changed AMI, or when instances start they now upgrade pre-installed software during cloud-init to different versions of pre-installed tools. In future, you need to share the specific name of the AMI, the release date and the region etc - an AMI ID on its own is not useful to look up.

The plugin doesn't work with Docker 25, so I doubt it was using the same AMI before - did you see https://github.com/gocd/gocd-ecs-elastic-agent/issues/345 ? You'll have to find/use an Amazon Linux 2 (not 2023) AMI which still has Docker 20.10 pre-installed until the plugin can be modified to support Docker 25.

According to https://alas.aws.amazon.com/announcements/2024-009.html as of September 3 a yum upgrade --security on AL2 will cause Docker to upgrade to Docker 25, which would break the plugin. Likely if you are using a new ECS AMI it is pre-upgraded. However additionally, the last AL2 AMI that will work is https://github.com/aws/amazon-ecs-ami/releases/tag/20240625

Any Amazon Linux 2 ECS AMIs newer than 2024-06-05 will not work, as Docker has been upgraded to v25:

Since the plugin is still working for https://build.gocd.org which uses the ECS plugin, it's definitely possible to have it work - but it does mean using an unpatched ECS image, or managing the patching yourself to upgrade everything except Docker.

-Chad

To view this discussion on the web visit https://groups.google.com/d/msgid/go-cd/6c87f4bc-3fff-450f-9182-c6854fb06c1en%40googlegroups.com.

Chad Wilson

unread,

Sep 4, 2024, 4:16:32 AM9/4/24

to go...@googlegroups.com

With some trial and error, seems these are us-east-1 AMIs. The last one you shared is indeed too new - this wont work. (2024-08-21).

{
  "body": {
    "VirtualizationType": "hvm",
    "Description": "Amazon Linux AMI 2.0.20240821 x86_64 ECS HVM GP2",
    "Hypervisor": "xen",
    "ImageOwnerAlias": "amazon",
    "EnaSupport": true,
    "SriovNetSupport": "simple",
    "ImageId": "ami-0a5f593ecaa0f722d",
    "State": "available",
    "BlockDeviceMappings": [
      {
        "DeviceName": "/dev/xvda",
        "Ebs": {
          "DeleteOnTermination": true,
          "SnapshotId": "snap-0dc7b37b7792952a7",
          "VolumeSize": 30,
          "VolumeType": "gp2",
          "Encrypted": false
        }
      }
    ],
    "Architecture": "x86_64",
    "ImageLocation": "amazon/amzn2-ami-ecs-hvm-2.0.20240821-x86_64-ebs",
    "RootDeviceType": "ebs",
    "OwnerId": "591542846629",
    "RootDeviceName": "/dev/xvda",
    "CreationDate": "2024-08-22T20:53:11.000Z",
    "Public": true,
    "ImageType": "machine",
    "Name": "amzn2-ami-ecs-hvm-2.0.20240821-x86_64-ebs"
  }
}

The earlier one you shared is this one: (2024-02-01)

{
  "body": {
    "VirtualizationType": "hvm",
    "Description": "Amazon Linux AMI 2.0.20240201 x86_64 ECS HVM GP2",
    "Hypervisor": "xen",
    "ImageOwnerAlias": "amazon",
    "EnaSupport": true,
    "SriovNetSupport": "simple",
    "ImageId": "ami-0ba9fb6bc8faf1fe0",
    "State": "available",
    "BlockDeviceMappings": [
      {
        "DeviceName": "/dev/xvda",
        "Ebs": {
          "DeleteOnTermination": true,
          "SnapshotId": "snap-0ca36cd61121c93d2",
          "VolumeSize": 30,
          "VolumeType": "gp2",
          "Encrypted": false
        }
      }
    ],
    "Architecture": "x86_64",
    "ImageLocation": "amazon/amzn2-ami-ecs-hvm-2.0.20240201-x86_64-ebs",
    "RootDeviceType": "ebs",
    "OwnerId": "591542846629",
    "RootDeviceName": "/dev/xvda",
    "CreationDate": "2024-02-03T00:52:53.000Z",
    "Public": true,
    "ImageType": "machine",
    "Name": "amzn2-ami-ecs-hvm-2.0.20240201-x86_64-ebs"
  }
}

This second one might work, as it at least had Docker 20.10 on it, but since you have shared two different AMIs and I'm not sure which log is from which, I don't know what the problem is here.

https://build.gocd.org is using amzn2-ami-ecs-kernel-5.10-hvm-2.0.20240625-x86_64-ebs so this one definitely works. Find the AMI ID for your region (us-east-1 it seems) and try that?

-Chad

Chad Wilson

unread,

Dec 9, 2024, 10:30:25 AM12/9/24

to go...@googlegroups.com

This problem should be fixed now with the new plugin version here (or a later release) https://github.com/gocd/gocd-ecs-elastic-agent/releases/tag/v8.0.0-775

Validated with Amazon Linux 2023 / Docker 25.0.6 via al2023-ami-ecs-hvm-2023.0.20241115-kernel-6.1-x86_64 (and the arm64 version).

-Chad

Reply all

Reply to author

Forward