Rundeck job fails after 60 min

1,047 views
Skip to first unread message

Gururaja Bhat

unread,
Jul 29, 2015, 10:46:43 AM7/29/15
to rundeck-discuss
Hi,

I have a job that has 5 workflow steps in sequence. The second job in the workflow is a script that take a lot of time, may be more than 1 hr depending on data that it has to process. If I run that script on my batch machine, it works.
However, the rundeck job fails exactly at 1hr 1m with following message.

Remote command failed with exit status -1

I tried most of the options -
1. By increasing ssh timeouts
2. By specifying timeouts at job level
3. By specifying framework.ssh.timeout at framework level

Nothing seems to be working. Can someone point me to right direction?

Thank you!

Scott Chapman

unread,
Jul 29, 2015, 3:17:22 PM7/29/15
to rundeck-discuss, gurubha...@gmail.com
There's a Timeout value on the edit job page, with the following description:
The maximum time for an execution to run. Time in seconds, or specify time units: "120m", "2h", "3d". Use blank or 0 to indicate no timeout. Can include option value references like "${option.timeout}".

I don't know what the default is though. Maybe ~61 minutes?

Gururaja Bhat

unread,
Jul 30, 2015, 5:46:18 AM7/30/15
to rundeck-discuss, sc...@we3chapmans.com
Thank you Scott for checking this.

As I mentioned in the post, I have tried setting the job level time-out value with no positive result.
I don't see anything indicative in logs as well. Just trying to find out various possibilities of timing out.

--Guru

Mathieu Chateau

unread,
Jul 30, 2015, 5:52:10 AM7/30/15
to rundeck...@googlegroups.com, sc...@we3chapmans.com
Hello,

any firewall in the middle that may drop ssh connection after 1 hour without data exchange ? This setting is named tcp timeout on firewall config. By default set to 1h on Checkpoint firewall.

I have remote job working for hours, even days without issue.
Check maximum time on both the main one and child one



Cordialement,
Mathieu CHATEAU
http://www.lotp.fr

--
You received this message because you are subscribed to the Google Groups "rundeck-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rundeck-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rundeck-discuss/7c91a41a-daca-403c-b812-adb73bc2ba0c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Scott Chapman

unread,
Jul 30, 2015, 11:48:06 AM7/30/15
to rundeck-discuss, gurubha...@gmail.com
ack. Sorry--I missed that you checked that in your original post. 

Gururaja Bhat

unread,
Jul 31, 2015, 4:45:14 AM7/31/15
to rundeck-discuss, gurubha...@gmail.com, sc...@we3chapmans.com
Absolutely no problem Scott! I apreciate you are trying to help me.

As Mathieu pointed out on other components, it looks to me that Quest VAS service that we use for authentication has some time-out setting. I am geting it fixed today and I believe that it should resolve this problem. If not, I will let you all know.

Thank you!
--Guru

santhoshkuma...@gmail.com

unread,
Jul 25, 2018, 10:23:01 AM7/25/18
to rundeck-discuss
Hi Team,

I do have the same issue , I have also checked all this below options, but still the error exist , any other solution to fix the issue ?

1.By increasing ssh timeouts

Paul M. Lambert

unread,
Jul 25, 2018, 11:53:09 AM7/25/18
to rundeck-discuss
The solution for the original author was to change or remove the timeouts being enforced by the network devices, firewalls, routers, etc. in the path between Rundeck and the remote node.

An easy way to tell if the problem is outside of Rundeck is to ssh from a command line shell on the Rundeck server to a host where the problem is occurring, and run “sleep 86400”. Look at it in 61 minutes, and hit control-C to interrupt the sleep command. If it immediately says the connection was dropped, then it’s not Rundeck.

Paul M. Lambert

edu...@rundeck.com

unread,
Jul 25, 2018, 12:38:40 PM7/25/18
to rundeck-discuss
Hi Santosh,

To further expand on Paul's recommendation, if ssh is used, its also recommended to verify your ssh and sshd configurations, as ServerAliveCountMax, ClientAliveCountMax and TCPKeepAlive settings can also trigger a disconnect if a temporary network event is experienced that cause the server/client to become unresponsive. A good test would be to running ssh with the -v option for verbosity, and if you suspect the system wide configuration, the -F option can be used to ignore it and specify a per-use config file.

Hope this helps!

santhoshkuma...@gmail.com

unread,
Jul 27, 2018, 9:33:54 AM7/27/18
to rundeck-discuss
Thanks for the reply Paul, Yes the sleep job is failing automatically after 61 mins , but the ssh timeout is set for 3 hours , even though the job is failing in 60 mins.

santhosh

santhoshkuma...@gmail.com

unread,
Aug 8, 2018, 1:44:23 AM8/8/18
to rundeck-discuss
Hi Team,
the issue was with the firewall connection timeout (the firewall timeout's the connection if it is old then 60 mins). I am working with network team to disable this.

Thanks for all you time and support.

Santhosh  

edu...@rundeck.com

unread,
Aug 8, 2018, 10:06:16 AM8/8/18
to rundeck-discuss
Hi Santosh,

Great news!

Glad that you managed to identify the cause.

Cheers!

Eduardo.
Reply all
Reply to author
Forward
0 new messages