Elastic agents plugin usage

7 views
Skip to first unread message

Satya Elipe

unread,
Apr 25, 2024, 10:01:48 AMApr 25
to go...@googlegroups.com
Hi All

I'm encountering some issues with the way Elastic agents are launched, assigned, and terminated. Despite setting the maximum agent count to two, both agents launch sequentially, with only the first being assigned to the job.


Here's where it gets tricky: when the staging job completes and triggers the production job, I expect one of the active agents to take over. Instead, the production job attempts to launch new agents, fails due to the max count limit, and runs without any agents, leading to failure.


Additionally, some agent instances remain active for an extended period, requiring manual termination. This disrupts the workflow significantly.


Has anyone experienced similar issues, or anyone has any suggestions for a workaround?


Thanks in advance !

Sriram Narayanan

unread,
Apr 25, 2024, 10:21:56 AMApr 25
to go...@googlegroups.com
On Thu, Apr 25, 2024 at 10:01 PM Satya Elipe <satya...@gmail.com> wrote:
Hi All

I'm encountering some issues with the way Elastic agents are launched, assigned, and terminated. Despite setting the maximum agent count to two, both agents launch sequentially, with only the first being assigned to the job.


Do you want the job to run on both the agents? If so, then these instructions will help you: https://docs.gocd.org/current/advanced_usage/admin_spawn_multiple_jobs.html
 


Here's where it gets tricky: when the staging job completes and triggers the production job, I expect one of the active agents to take over. Instead, the production job attempts to launch new agents, fails due to the max count limit, and runs without any agents, leading to failure.



Do the various jobs have an elastic profile ID set?

What is the error that you see due to the max count limit? 

When you say "staging job", do you have a stage in a pipeline called "staging" with one job in it? Or do you have a stage in a pipeline with one job called "staging" and the other called "prod"?

Could you share how your pipelines are composed? I'm especially asking this since many new users tend to use GoCD after using other tools and carry over some of the terminology but also the constraints. If you share your pipeline structure and what you want to achieve, then we can design something together.
 

Additionally, some agent instances remain active for an extended period, requiring manual termination. This disrupts the workflow significantly.



On our cluster, we see the pods being activated upon need, then the relevant job runs in the pod, and the pod is then deactivated. We are sticking to the default of "10 pods" right now, and will be increasing the limit after certain parallel-load reviews. 

Could you share your Cluster Profile and the Elastic Profile? Please take care to obfuscate any org-specific information such as IP addresses, hostnames, AWS ARNs, URLs, etc.
 

Has anyone experienced similar issues, or anyone has any suggestions for a workaround?


Thanks in advance !

--
You received this message because you are subscribed to the Google Groups "go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to go-cd+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/go-cd/CADKEDRo_0yJjA0y31vOkXzgtVA_MOiSPQEc_uB3fE%3DfguO-wWQ%40mail.gmail.com.

Chad Wilson

unread,
Apr 25, 2024, 10:33:06 AMApr 25
to go...@googlegroups.com
Can you be specific about the type of elastic agents you are creating and the plugin you are using. Kubernetes? Docker? Something else? There are many elastic agent plugins.



Here's where it gets tricky: when the staging job completes and triggers the production job, I expect one of the active agents to take over. Instead, the production job attempts to launch new agents, fails due to the max count limit, and runs without any agents, leading to failure.


I believe elastic agents are generally launched sequentially - i.e a new one won't be launched until there are no pending-launch ones - but this depends on the specific elastic agent type.

If you are new to elastic agents, you'll want to be aware that in almost all elastic agent plugin variants the elastic agents are single-shot/single-job usage, and are not re-used. The specific type of elastic agent and its plugin implementation defines how it handles such things though, so need to know specifics to guess.

Look at the specific elastic agent plugin's log on the server to see what it is doing. Perhaps your elastic agents are not shutting down automatically for some reason due to a configuration issue or a problem with the jobs you are running?

-Chad


Satya Elipe

unread,
Apr 25, 2024, 10:49:16 AMApr 25
to go...@googlegroups.com
Thank you Sriram. 
Please find my comments below. 

>Do the various jobs have an elastic profile ID set?
Yes, I have two environments staging and prod, so we have separate profiles set for them.

Here is pretty much what each profile has:
  1. ec2_ami
  2. ec2_instance_profile
  3. ec2_subnets
  4. ec2_instance_type
  5. ec2_key
  6. ec2_user_data
    echo "agent.auto.register.environments=staging,sandbox" | sudo tee -a /var/lib/go-agent/config/autoregister.properties > /dev/null
  7. ec2_sg

>What is the error that you see due to the max count limit? 
```
[go] Received request to create an instance for brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job at 2024-04-09 11:21:38 +00:00
[go] Successfully created new instance i-093b44f70992505cc in subnet-555bba0d
[go] Received request to create an instance for brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job at 2024-04-09 11:23:38 +00:00
[go] The number of instances currently running is currently at the maximum permissible limit, "2". Not creating more instances for jobs: brxt-core-service-deploy-staging/86/prepare-for-deploy-stage/1/prepare-for-deploy-job, brxt-core-service-deploy-staging/86/deploy-stage/1/deploy-job, brxt-core-service-deploy-staging/86/verify-stage/1/verify-job, brxt-config-service-deploy-staging/18/deploy-stage/1/deploy-job, brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job.
[go] Received request to create an instance for brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job at 2024-04-09 11:25:39 +00:00
[go] The number of instances currently running is currently at the maximum permissible limit, "2". Not creating more instances for jobs: brxt-core-service-deploy-staging/86/prepare-for-deploy-stage/1/prepare-for-deploy-job, brxt-core-service-deploy-staging/86/deploy-stage/1/deploy-job, brxt-core-service-deploy-staging/86/verify-stage/1/verify-job, brxt-config-service-deploy-staging/18/deploy-stage/1/deploy-job, brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job.
[go] Received request to create an instance for brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job at 2024-04-09 11:27:39 +00:00
[go] The number of instances currently running is currently at the maximum permissible limit, "2". Not creating more instances for jobs: brxt-core-service-deploy-staging/86/prepare-for-deploy-stage/1/prepare-for-deploy-job, brxt-core-service-deploy-staging/86/deploy-stage/1/deploy-job, brxt-core-service-deploy-staging/86/verify-stage/1/verify-job, brxt-config-service-deploy-staging/18/deploy-stage/1/deploy-job, brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job.
[go] Received request to create an instance for brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job at 2024-04-09 11:39:58 +00:00
[go] The number of instances currently running is currently at the maximum permissible limit, "2". Not creating more instances for jobs: brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job.
[go] Received request to create an instance for brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job at 2024-04-09 11:41:56 +00:00
[go] Successfully created new instance i-0ca1b2dc4996c210b in subnet-555bba0d
[go] Received request to create an instance for brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job at 2024-04-09 11:43:56 +00:00
[go] Successfully created new instance i-0bc0bf6e763b6ebf0 in subnet-555bba0d
[go] Received request to create an instance for brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job at 2024-04-09 11:45:56 +00:00
[go] The number of instances currently running is currently at the maximum permissible limit, "2". Not creating more instances for jobs: brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job.
[go] Received request to create an instance for brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job at 2024-04-09 11:47:56 +00:00
[go] The number of instances currently running is currently at the maximum permissible limit, "2". Not creating more instances for jobs: brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job.
Go cancelled this job as it has not been assigned an agent for more than 10 minute(s)```

In here, that all happened as you see in the log, so we have two instances running but none of them got assigned to the job and then job failed eventually. 

>When you say "staging job", do you have a stage in a pipeline called "staging" with one job in it? Or do you have a stage in a pipeline with one job called "staging" and the other called "prod"?
Attached is one of our pipelines, if you trigger the build job that in turn triggers the second and second triggers the third. Attached is the snippet from the dashboard. 

Please let me know if I have not covered any point or if you need more details on any specific thing, thank you. 

Regards
Satya


Screenshot 2024-04-25 at 15.43.02.png

Satya Elipe

unread,
Apr 25, 2024, 11:23:55 AMApr 25
to go...@googlegroups.com
Hi Chad

We use EC2 elastic agent plugin, attached is the screenshot from the server. 
It works well in general but with those caveats mentioned. 

Thanks
Satya



--
You received this message because you are subscribed to the Google Groups "go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to go-cd+un...@googlegroups.com.
Screenshot 2024-04-25 at 15.51.28.png

Chad Wilson

unread,
Apr 25, 2024, 11:42:18 AMApr 25
to go...@googlegroups.com
It seems you are using the third party EC2 Elastic Agent Plugin. This plugin does not implement the plugin API correctly and does not support environments correctly, which is presumably why you have that user data hack.

It seems to me from a quick look at the code, this plugin only runs a single job on an EC2 instance before terminating it so you shouldn't expect re-use for multiple jobs. If you want to know more about how it is designed, you are better to ask on their GitHub repo.

I don't know 100% why your agents aren't shutting down correctly, and you probably need to look at the plugin logs (on both server and the agent itself) to investigate.

However, since it looks like you have a ec2_user_data hack in place to get some environment support with the plugin, you need to manually make sure that the environments in the config agent.auto.register.environments=staging,sandbox exactly match the possible pipeline environments for all possible jobs you assign to this elastic agent profile ID.

I also think having multiple environments registered here will possibly cause chaos, because that is not how elastic agents manage environments normally. They normally register only a single environment.

The problem will be that if any single job on your GoCD is assigned to say, profile "elastic_profile_staging" with the autoregister config like you have below, but then that job is configured for a pipeline inside a GoCD environment called "other_env" an elastic agent will start but then never get assigned the job. This is because it has registered only for "staging,sandbox" via your hardcoded user_data, NOT "other_env".

This breaks the elastic agent plugin contract - GoCD thinks it has already told the plugin to create an agent for "other_env", but it never does. Now GoCD is confused as to what is happening with the agents. Thus the job will likely never get assigned, and the plugin will never complete a job, and it will never shut down the EC2 instance. Perhaps this is what is happening to you? Might want to check if you have EC2 instances whose agent logs don't show them doing any work or mismatched environments and elastic profiles.

With a correctly behaving plugin GoCD tells the plugin a single environment to register for (the one it needs a new agent to run a job on), and expects the plugin to register for the environment it tells it to. This EC2 plugin breaks that contract, which allows you to misconfigure things very easily and create all sorts of problems. Personally I wouldn't use it if I am using GoCD environments, but that's your decision to make.

-Chad
 
Reply all
Reply to author
Forward
0 new messages