Agents failing to start after server forced shut down (Windows)

1,678 views
Skip to first unread message

Carl Reid

unread,
Jun 20, 2014, 5:39:28 AM6/20/14
to go...@googlegroups.com

We have had some issues on our VMWare platform that has meant that some machines become unresponsive and are required to be forcibly rebooted by the VMWare console host software.

When this has happened we have noticed that in all cases the GO Agent does not start correctly. It starts then stops immediately.

Looking in the go-agent-bootstrapper.log file we can see that the bootstrapper finds an existing lock file and terminates its process. I assume this is because the lock file was not cleanly removed from the previous shutdown.

We can start the service manually after the first failure which appears to fix it.

Is this normal behaviour? Can we do something to fix it? I am considering setting Service recovery options so the service will restart on failure.


This seems quite an odd design choice considering how common this scenario will occur.


Thanks


Carl


OS: Windows 2012 Server R2


Relevant log file entries are:

STATUS | wrapper | 2014/06/19 11:16:04 | --> Wrapper Started as Service 
STATUS | wrapper | 2014/06/19 11:16:04 | Java Service Wrapper Standard Edition 3.3.3 
STATUS | wrapper | 2014/06/19 11:16:04 | Copyright (C) 1999-2009 Tanuki Software, Ltd. All Rights Reserved. 
STATUS | wrapper | 2014/06/19 11:16:04 | http://wrapper.tanukisoftware.org 
STATUS | wrapper | 2014/06/19 11:16:04 | Licensed to ThoughtWorks for Cruise Agent 
STATUS | wrapper | 2014/06/19 11:16:04 | 
STATUS | wrapper | 2014/06/19 11:16:04 | Launching a JVM... 
INFO | jvm 1 | 2014/06/19 11:16:13 | WrapperManager: Initializing... 
INFO | jvm 1 | 2014/06/19 11:16:15 | logFile Environment Variable= null 
INFO | jvm 1 | 2014/06/19 11:16:15 | Logging to go-agent-bootstrapper.log 
2014-06-19 11:16:16,859 [WrapperSimpleAppMain] INFO agent.common.util.JarUtil:68 - Attempting to load Go-Agent-Launcher-Class from aedd84b2-bda7-4f4c-b77f-9553dfc2ccd8agent-launcher.jar File: 
2014-06-19 11:16:16,890 [WrapperSimpleAppMain] INFO agent.common.util.JarUtil:77 - manifestLibDirKey: Go-Agent-Launcher-Lib-Dir: libs 
2014-06-19 11:16:17,202 [WrapperSimpleAppMain] INFO agent.common.util.JarUtil:83 - manifestClassKey: Go-Agent-Launcher-Class: com.thoughtworks.cruise.agent.launcher.AgentLauncherImpl 
2014-06-19 11:16:17,218 [WrapperSimpleAppMain] INFO agent.common.util.ParentClassAccessFilteringClassloader:43 - Loading com.thoughtworks.cruise.agent.common.launcher.AgentLauncher using com.simontuffs.onejar.JarClassLoader 
2014-06-19 11:16:17,343 [WrapperSimpleAppMain] INFO go.agent.bootstrapper.AgentBootstrapper:72 - Attempting create and start launcher... 
INFO | jvm 1 | 2014/06/19 11:16:17 | 0 [WrapperSimpleAppMain] INFO com.thoughtworks.go.agent.launcher.Lockfile - Sleeping for 10000 secs to before 'last modified check'... 
2014-06-19 11:16:27,484 [WrapperSimpleAppMain] INFO go.agent.bootstrapper.AgentBootstrapper:76 - Launcher returned with code 12245933(0xBADBAD)
2014-06-19 11:16:27,531 [WrapperSimpleAppMain] INFO go.agent.bootstrapper.AgentBootstrapper:116 - Destroying launcher creator 
INFO | jvm 1 | 2014/06/19 11:16:27 | Already running agent launcher in this folder. 
INFO | jvm 1 | 2014/06/19 11:16:27 | 10000 [WrapperSimpleAppMain] ERROR com.thoughtworks.go.agent.launcher.Lockfile - Already running agent launcher in this folder. 
2014-06-19 11:16:28,593 [WrapperSimpleAppMain] INFO go.agent.bootstrapper.DefaultAgentLauncherCreatorImpl:90 - Attempt No: 1 to cleanup launcher temp files 
2014-06-19 11:16:28,609 [WrapperSimpleAppMain] INFO go.agent.bootstrapper.AgentBootstrapper:99 - Waiting for 10000 ms before re-launch.... 
2014-06-19 11:16:38,718 [WrapperSimpleAppMain] INFO go.agent.bootstrapper.AgentBootstrapper:116 - Destroying launcher creator 
2014-06-19 11:16:38,968 [WrapperSimpleAppMain] INFO go.agent.bootstrapper.AgentBootstrapper:93 - Agent Bootstrapper stopped

Jyoti Singh

unread,
Jun 23, 2014, 1:46:21 AM6/23/14
to go...@googlegroups.com
Multiple Go agents can be installed and run on a given machine concurrently.
The '.agent-bootstrapper.running' lock file prevents launching multiple agents from a given working directory as each agent is associated with a unique GUID (available in config folder) that it cannot share with other agents. When a agent-bootstrapper process starts, it creates this lock file which gets deleted on a graceful shutdown of agents.

When agent process starts, it checks for the existence of this file and the timestamp when the file was last updated. If lock file does not exist or if its been more than 10mins from the last update, the agent process would start successfully else it throws the 'Already running agent launcher in this folder' error and stops the process.

Obviously the above solution does not work very well when the agent is forced shutdown and brought up too quickly.
Given that we now know why was it implemented this way, are there any suggestions on handling it in a better manner? 

Jason D

unread,
Jun 23, 2014, 11:58:28 AM6/23/14
to go...@googlegroups.com
Carl - we have experienced similar issues.  Unfortunately, a windows service cannot be set to delay upon restart.  We have been dealing with it manually (due to time, more than anything) thus far which comes to about 3-4 restarts in our environment (110 agents on VMs).  One of the work-arounds proposed elsewhere is to create a windows scheduled task instead of a service.  This has more flexibility and would allow the delay before restarting to allow for the lock to clear.  Feels like a hack and something that GO should be able to handle with a different scheme, but I have have no insight into what might work better.
 
If you discover something that works, please respond to the thread as we would be interested as well.

Carl Reid

unread,
Aug 18, 2014, 11:15:20 AM8/18/14
to go...@googlegroups.com
I have now modified all our Windows servers that run the GO-Agent to use the DelayedAutoStart feature that came in Windows 2008. This has helped the problem although not solved it completely.

You can script this (as we do) using SC.EXE http://technet.microsoft.com/en-us/library/cc990290.aspx

Hope this helps

carl

Carl Reid

unread,
Mar 20, 2015, 6:40:47 AM3/20/15
to go...@googlegroups.com
This problem hit us hard this week.

We had a power outage and ups failure and a number of machines were powered down and then restarted with the 10 minute period.
This of course caused the agents to fail to start and required a manual intervention on a large number of machines.

I am trying to think of alternatives to manually starting the service. For now the best idea I have is a start-up scripts that looks for the presents of a lock file and a non-running GO Agent service. If it finds this condition it will delete the lock file then start the service.
Any thoughts on this appreciated.

Carl

Carl Reid

unread,
Apr 2, 2015, 8:53:18 AM4/2/15
to go...@googlegroups.com
I fixed this by setting up a Powershell script to run on agent start-up. The startup task itself was pushed out to all the agents using GO so it was a very quick fix which seems to have done the trick.

Aravind SV

unread,
Apr 2, 2015, 9:41:33 AM4/2/15
to Carl Reid, go...@googlegroups.com
Hello!

On Thu, Apr 2, 2015 at 8:53 AM, Carl Reid <carland...@gmail.com> wrote:
I fixed this by setting up a Powershell script to run on agent start-up. The startup task itself was pushed out to all the agents using GO so it was a very quick fix which seems to have done the trick. 

If that script is share-able, please share it somewhere (maybe here), so that others might benefit. I'm particularly interested in the delivery mechanism of that as well ("pushed out to all the agents using GO"), which seems like a nice solution! If you can be persuaded to write a blog post (either on your own blog, or by submitting a PR to put it on go.cd/blog.html), that would be brilliant! :)

Cheers,
Aravind

Jason D

unread,
Apr 2, 2015, 2:20:01 PM4/2/15
to go...@googlegroups.com
We too would be interested in this Carl.  Thanks.

Carl Reid

unread,
Apr 7, 2015, 7:47:03 AM4/7/15
to go...@googlegroups.com
If I find time to blog this I will however in the mean time this is what I did. It's not particular clever, just a small fix to a problem we had.

  • The problem is that agents fail to start due to the presents of the lock file.
  • The lock file is not removed when the agent host is shut down non gracefully therefore we need a way of checking for the presence of a lock file on the agent machines and a non-running service. 
  • If we find both of these then we need to remove the lock file and start the service.

The idea is therefore to have a script pushed out to each agent that runs on Start-Up and does the above tasks. 

The question is how to do this without getting IT involved?

Some background....
As I have said previously, we run GO agents on our workstations and we have a number of what I call infrastructure Pipelines that run on these machines to set them up for use by GO. This is not really what GO is all about however it is something we can do ourselves and it works so we use this mechanism.

We typically run three agents per workstation. The installation of the agents onto the workstation and registration to the GO Server is fully automated using Powershell scripts.

Agent 1 runs as LOCAL SYSTEM and is used for the majority of jobs.
Agent 2 runs as a Windows Domain account and is used to run integration tests that need Windows authentication. I played with getting the LOCAL SYSTEM account to impersonate a domain user but was unsuccessful.
Agent 3 is similar to agent 2 and is used to maximise the power of the machines. The workstations we use are very powerful machines and therefore we try to maximise their use by running as many agents jobs on them as possible, including those jobs that require Visual Studio which is of course available on the workstations.

The agent setup infrastructure pipelines are used to do the following:
  • Install required software for agent job runs (mostly concerned with testing)
  • Unblock Windows Firewall for test software to run
  • Grant permission to files and folders needed to run tests
  • Setup an environment variable to control the DNS entry that the GO Agents use for the GO Server
  • Install client certificates needed for client authentication in systems that use x509 based authentication schemes
It therefore seemed sensible to add another task to this pipeline. Namely "ensure agent startup" which is setup like this:

Command: %POWERSHELL% 
Arguments: Register-StartupScript -Name 'Ensure Go Agents Started' -SourcePath 'Scripts\Ensure-AgentStartUp.ps1' -DestinationFolderPath $([System.Environment]::GetFolderPath('UserProfile')) -Force -Verbose

%POWERSHELL% is an environment variable mapped to 'C:\Windows\sysnative\WindowsPowerShell\v1.0\powershell.exe')


The script called, "Ensure-AgentStartup" takes care of deleting the lock file and starting the agents.

  • A function called "Register-StartupScript" is used to set the "Ensure-AgentStartup" script is copied locally to the machine and runs when the machine starts. 
  • This is quite easy to achieve in Powershell 4 because there are good cmdlets in place for controlling scheduled tasks however I hit a problem with running this through GO because the Powershell cmdlets to control access to scheduled tasks do not work under the LOCAL SYSTEM account which is of course the account the GO Service run as. 
  • Each time they run an ACCESS DENIED error is thrown. It seems the cmdlets require Administrator account access which LOCAL SYSTEM doe snot have (although it has the privileges it appears the cmdlets specifically check for Admin rights).
  • I therefore modified the "Register-StartupScript" to use the schtasks.exe command when running as LOCAL SYSTEM. This command achieves the same thing but does not have the same restrictions as the Powershell cmdlets in terms of Administrator rights. (It is much messier to work with however....)
Hopefully that all makes some kind of sense for anyone interested.

I have attached the script used and the supporting functions that are needed. We have the supporting functions pushed out in Windows Powershell Modules therefore they are always available to each machine. If you do not use custom modules then you will need to incorporate these into your scripts.

Carl
supporting functions.txt
Ensure-AgentStartUp.ps1
Reply all
Reply to author
Forward
0 new messages