How to cancel pipeline scheduling for non-responsive agents

821 views
Skip to first unread message

Carl Reid

unread,
Oct 7, 2014, 10:03:36 AM10/7/14
to go...@googlegroups.com
We use workstations as additional build and test agents due to them having the correct software installed (Visual Studio) and being quite powerful machines.

However since these are workstations users often turn them off! 

When a pipeline job is set to run on all agents the pipeline schedule seems to include the agents of the machines that are turned off and never completes until the machine is brought back online. The problem with this is that a new execution of the pipeline will not start until the previous one has completed. This requires us to manually go into each pipeline and cancel the execution. Not a great situation especially when there are a large number of pipelines that this occurs on.

My questions are:

  1. Is there a way of preventing GO from scheduling an "all agents" job to NOT schedule the job for a non-responsive or missing agent?
  2. Is there a way of getting GO from timing out and cancelling the pipeline scheduling for non-responsive or missing agents?
I have tried setting the job timeout however this only seems to apply once a job has started. 


Thanks in advance.

Carl



Carl Reid

unread,
Oct 13, 2014, 1:13:27 PM10/13/14
to go...@googlegroups.com
Anyone got any thoughts on this issue?

srinivas upadhya

unread,
Oct 13, 2014, 1:47:09 PM10/13/14
to Carl Reid, go...@googlegroups.com
We use workstations as additional build and test agents due to them having the correct software installed (Visual Studio) and being quite powerful machines.

​Thats neat.​
 

However since these are workstations users often turn them off! 

​Can they disable the agent before bringing them down? May be someone can write a script that will "disable" agent using Agent API. Then shutdown the agent.

When a pipeline job is set to run on all agents the pipeline schedule seems to include the agents of the machines that are turned off and never completes until the machine is brought back online. The problem with this is that a new execution of the pipeline will not start until the previous one has completed. This requires us to manually go into each pipeline and cancel the execution. Not a great situation especially when there are a large number of pipelines that this occurs on.

​May be you can use Pipeline Groups API (>14.3) & Stage API to find & cancel them?​

My questions are:

  1. Is there a way of preventing GO from scheduling an "all agents" job to NOT schedule the job for a non-responsive or missing agent?
​Its by design. Run-On-All-Agents feature was developed with parallel production deploys in mind. So if you have 20 nodes with 20 agents managing deployments to them, but one of them is down Go should still schedule Job for it otherwise the deployment would happen only on 19 nodes & still go green!​

  1. Is there a way of getting GO from timing out and cancelling the pipeline scheduling for non-responsive or missing agents?
​Thats a bug :-| This should be fixed. I think its logged. @arika?

 
I have tried setting the job timeout however this only seems to apply once a job has started. 


Thanks in advance.

Carl



--
You received this message because you are subscribed to the Google Groups "go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to go-cd+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Carl Reid

unread,
Mar 17, 2015, 11:18:00 AM3/17/15
to go...@googlegroups.com
I finally started looking at this today because the situation is getting worse. We are constantly having to cancel stages manually because agents are unavailable.

We have a large number of agents on machines that are turned off at night or that get rebuilt often and end up as "lost contact". It is not really possible to script the agents being uninstalled or removed therefore I need to work out a solution using the APIs. I thought about looking through all the agents that are "lost contact" and disabling them however if these machines are then brought back online they will be disabled! I would then need to somehow work out which agents are disabled but "responsive".!

If I instead look at an approach based on pipelines, stages and jobs, I need to work out which jobs are waiting for an agent that is "unresponsive" or "lost contact" and cancel that stage. Looking through the APIs I don't see an easy way of doing this.

According to the documentation, the jobs API is for jobs which are scheduled but have not been assigned. In my case the job has been assigned but the agent is unresponsive.

Are there any API calls that I can use to find these stages and cancel them?

Thanks

Carl




On Tuesday, 7 October 2014 15:03:36 UTC+1, Carl Reid wrote:

Zabil C M

unread,
Mar 18, 2015, 1:22:30 AM3/18/15
to Carl Reid, go...@googlegroups.com
Hi Carl,

There's an unsupported API (will be deprecated/deleted 3 or 4 releases from now) to cancel stages. 

$ curl -d id=[stageId] http://[user]:[pass]@[goserver]/go/admin/stage/cancel.json


If you are running Go 14.3+ you can get the stage id and status from pipeline history API

$ curl http://[user]:[pass]@[goserver]/go/api/pipelines/[pipelineName]/history

Hope that helps.


Carl Reid

unread,
Mar 18, 2015, 8:00:17 AM3/18/15
to go...@googlegroups.com, carland...@gmail.com
Thank you for the reply.

Whilst I am keen to use an API approach to do this I am less keen on using an API that is going to be deprecated for obvious reasons.
Is there going to be a replacement for this? Are there new APIs in the works?

And also, how would I find out which stages I should cancel? Is there an API that can show me stages that contain jobs that are assigned but not started (i.e. assigned to agents that are non-responsive?)

Another thought I had to prevent this from happening is create a scheduled task that runs frequently looking for agents that are "lost contact" for more than, say, 10 minutes and to disable them.
I can then add a scheduled task to each of our servers and workstations to "enable" themselves if they are not currently enabled. This way agents on machines that are available and have connectivity to the GO Server will always be enabled and those that are turned off or lose connectivity will be disabled. It won't help pipelines that have already been scheduled however it will make the problem smaller.

Any input on these ideas appreciated.

I have also submitted a issue on Github for the fact that the job timeout does not cover this scenario: https://github.com/gocd/gocd/issues/956

Zabil C M

unread,
Mar 19, 2015, 5:52:36 AM3/19/15
to Carl Reid, go...@googlegroups.com
On Wed, Mar 18, 2015 at 5:30 PM, Carl Reid <carland...@gmail.com> wrote:
Thank you for the reply.

Whilst I am keen to use an API approach to do this I am less keen on using an API that is going to be deprecated for obvious reasons.
Is there going to be a replacement for this? Are there new APIs in the works?


We'll be disabling this only after it's alternative is in place. We are putting a few API's in place to shift to a RESTful/API centric app and UI. It's in our roadmap
 
And also, how would I find out which stages I should cancel? Is there an API that can show me stages that contain jobs that are assigned but not started (i.e. assigned to agents that are non-responsive?)


Unfortunately there's no api to figure this out. The pipeline history api reports the job/stage status and state as unknown. There's a schedule date field which may be used to assume that the job's  stuck not sure if that's the right way to go about it. 

Another thought I had to prevent this from happening is create a scheduled task that runs frequently looking for agents that are "lost contact" for more than, say, 10 minutes and to disable them.
I can then add a scheduled task to each of our servers and workstations to "enable" themselves if they are not currently enabled. This way agents on machines that are available and have connectivity to the GO Server will always be enabled and those that are turned off or lose connectivity will be disabled. It won't help pipelines that have already been scheduled however it will make the problem smaller.


Go reschedules the job when the agents are back online, is that not happening in your case? 

Carl Reid

unread,
Mar 19, 2015, 9:14:03 AM3/19/15
to go...@googlegroups.com, carland...@gmail.com
Go reschedules the job when the agents are back online, is that not happening in your case? 

The jobs will run when the agents are back online however in some cases the agents never come back online, this is particularity the case  when we provision machines in AWS that are then destroyed.

Also subsequent pipeline execution will wait until the previous iteration has completed which can hold things up.

I feel the whole area of agent registration and agent management needs some reconsideration in light of new ways of working and focusing on the need for full automation of the life cycle of an agent.

Thanks for your replies.

Aravind SV

unread,
Mar 19, 2015, 9:26:56 AM3/19/15
to Carl Reid, go...@googlegroups.com
Shameless plug: You might want to consider replying to this thread. :)

Carl Reid

unread,
Apr 2, 2015, 11:03:01 AM4/2/15
to go...@googlegroups.com, carland...@gmail.com
Can you please shed some light on what format the "scheduled_date" value is in in the following API http://www.go.cd/documentation/user/current/api/stages_api.html

I have tried parsing it multiple ways from the system.datetime object in .NET however this is not giving me a valid date.

How do I convert this into a date time?

Thanks

Aravind SV

unread,
Apr 2, 2015, 11:38:24 AM4/2/15
to Carl Reid, go...@googlegroups.com
On Thu, Apr 2, 2015 at 11:03 AM, Carl Reid <carland...@gmail.com> wrote:
Can you please shed some light on what format the "scheduled_date" value is in in the following API http://www.go.cd/documentation/user/current/api/stages_api.html

I have tried parsing it multiple ways from the system.datetime object in .NET however this is not giving me a valid date.

How do I convert this into a date time?

Strange that it is that, and not a proper time stamp, as it should be. It looks like milliseconds after epoch (Unix epoch, Jan 1, 1970). You probably need something like this, after diving that value by 1000, to get it down to seconds, instead of milliseconds.

If that works, can you create an issue or pull request in the documentation?

- Aravind

Carl Reid

unread,
Apr 7, 2015, 9:10:58 AM4/7/15
to go...@googlegroups.com, carland...@gmail.com
Unfortunately there's no api to figure this out. The pipeline history api reports the job/stage status and state as unknown. There's a schedule date field which may be used to assume that the job's  stuck not sure if that's the right way to go about it.

I have put together a script that uses the Pipeline groups, stage history and stage controller APIs to try to deal with this problem.

I had assumed that the scheduled_date column would have a date in it but the agent (being unresponsive) would never carry out the job so the "result" would be unknown and the scheduled_date would be sometime in the past.
I could then use some logic to say that jobs with a result of "unknown" and a scheduled_date of (for example) 1 hour previous are "stuck" and their stage should be cancelled.

However it seems that the scheduled_date is always set to 1/1/1970 for these jobs (and others).

When does the scheduled_date field become populated? Is there some other way I can detect "stuck" jobs for non-responsive agents?

Thanks
 

Carl Reid

unread,
Apr 17, 2015, 8:37:34 AM4/17/15
to go...@googlegroups.com, carland...@gmail.com
Can anyone help me with the trying to resolve this issue?
Otherwise I have a job of manually cancelling pipelines from the console ahead of me!

Aravind SV

unread,
Apr 17, 2015, 9:08:18 AM4/17/15
to Carl Reid, go...@googlegroups.com
I have kind of lost context of this thread, and don't know if I have suggested this, but have you seen the scheduled jobs API (/go/api/jobs/scheduled.xml)? It lists all jobs which are scheduled but not assigned. I talked about job states here. To me, it looks like you're looking for jobs which are not assigned and hence "stuck".

--

Carl Reid

unread,
Apr 27, 2015, 8:44:33 AM4/27/15
to go...@googlegroups.com
Thanks to everyone who has helped me with this.

To recap, I needed a way of automatically cancelling pipeline stages which contained jobs scheduled for agents that were not responding (because they had been turned off or in the case of AWS hosted agents, terminated).

Using a a combination of API calls and some logic in PowerShell I now have a task that periodically checks for "stuck" stages and cancels them, allowing new pipeline runs.
If anyone else is interested in how to do this let me know.

Cheers

Carl




On Tuesday, 7 October 2014 15:03:36 UTC+1, Carl Reid wrote:

Mimansha Bhargav

unread,
Jun 16, 2017, 3:44:33 AM6/16/17
to go-cd
Hi Carl,

I am also working on solving the same problem you had i.e. we do have agents who goes in lost_contact state because either they aren't bootstrapped properly or something like this, due to these agents the jobs running on them get stuck and never allow a chance to another jobs. We manually disable the host but the problem is when a job is scheduled in an agent which has lost_contact even that jobs are affected. I am trying to use APIs to look for agents whenever I schedule a job and if there are any agent with that status, I want to disable them and restart all the running and scheduled jobs.
Can you please let me know about what you have done.

Thanks
Mimansha

Christopher Dean

unread,
Nov 27, 2024, 8:00:25 PM11/27/24
to go-cd
WELDING MACHINES AND ACCESSORIES FOR SALE ONLINE WITH SAFE AND GUARANTEED SHIPPING.


SHOP BELOW

Buy discounted welders, plasma cutters and welding safety gear by premium brands like Miller, Lincoln Electric, Hypertherm, Hobart and Black Stallion - from the mouse with over 86 years of welding experience.

where to order welding machines and accessories online with safe shipping.

we have the best welding tools like reels,miller remote,pipe bender,welding cable, lincoln rods,leads,miller welding helmet,mig guns,tig torches,plasma torches,welding cables,mig welders,tig welders,Etc.

Order now from our website below with safe and guaranteed shipping

Reply all
Reply to author
Forward
0 new messages