Any ideas how to fix JENKINS-12235

92 views
Skip to first unread message

hajush

unread,
Apr 17, 2013, 5:41:51 PM4/17/13
to jenkin...@googlegroups.com
The intermittent failure of slave jobs due to issue 12235
<https://issues.jenkins-ci.org/browse/JENKINS-12235> looks like it might
start undoing progress in getting my work teams to adopt Jenkins.

Has anyone given any thought to the issue and how to address it? Some folks
had luck by increasing the ClientInterval on unix masters - but others did
not.

I see that late last month Kohsuke increased the pipe window size in
hudson.remoting.Channel - though I'm not sure that would address this - and
since it's intermittent - it's hard to test. Here's what our stack trace
failure looks like.

FATAL: Unable to delete script file c:\temp\hudson985794291407431615.bat
hudson.util.IOException2: remote file operation failed:
c:\temp\hudson985794291407431615.bat at
hudson.remoting.Channel@e553b0:vcvmwin061
at hudson.FilePath.act(FilePath.java:848)
at hudson.FilePath.act(FilePath.java:825)
at hudson.FilePath.delete(FilePath.java:1202)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:101)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:60)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
at
hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:810)
at hudson.model.Build$BuildExecution.build(Build.java:199)
at hudson.model.Build$BuildExecution.doRun(Build.java:160)
at
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:592)
at hudson.model.Run.execute(Run.java:1543)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:236)
Caused by: hudson.remoting.ChannelClosedException: channel is already closed
at hudson.remoting.Channel.send(Channel.java:494)
at hudson.remoting.Request.call(Request.java:129)
at hudson.remoting.Channel.call(Channel.java:672)
at hudson.FilePath.act(FilePath.java:841)




--
View this message in context: http://jenkins.361315.n4.nabble.com/Any-ideas-how-to-fix-JENKINS-12235-tp4663279.html
Sent from the Jenkins dev mailing list archive at Nabble.com.

Kohsuke Kawaguchi

unread,
Apr 17, 2013, 7:49:17 PM4/17/13
to jenkin...@googlegroups.com, hajush

"hudson.remoting.ChannelClosedException: channel is already closed"
indicates an unexpected loss of connection to the slave. The nested
"Caused by: java.io.EOFException" indicates that the slave side has shut
down the communication with the slave.

The thing is, the communication to the slave (InputStream that Channel
reads) is tunneled over several layers, and the way this part of the
code discovers the problem is by InputStream.read() returning -1.

This design of InputStream does not allow us to report the underlying
cause of the communication problem through a chained exception, so we
really can't properly report the root cause.

The slave console log does normally capture the last dying message from
the slave JVM or a transport level errors, but this gets rotated quickly
as soon as the next connection attempt starts, and while on
$JENKINS_HOME this file is still available, there's no way to look at
this from the web UI. Jenkins does pretty aggressively auto-reconnect
slaves that fail, and it takes some time for someone to notice a build
failure by ChannelClosedException and try to understand what's going on,
so that makes the trouble-shooting even more tricky.

I was just sweeping the ssh-slaves plugin ticket backlog, and there are
many reports of this same issue, so this clearly is a gap in the
diagnosability of the slave connectivity.

If anyone has a good idea of how to capture the errors, that'd be
greatly appreciated.


One approach that I think about is to introduce a proper log rotation
mechanism (that handles LargeText.doProgressText() correctly), and
somehow use that to let people scroll back the slave console log.

Perhaps another possibility is to let the ComputerLauncher record a
connection loss as an Exception on a failing Channel.
--
Kohsuke Kawaguchi | CloudBees, Inc. | http://cloudbees.com/
Try Nectar, our professional version of Jenkins

Sandell, Robert

unread,
Apr 19, 2013, 3:40:31 AM4/19/13
to jenkin...@googlegroups.com, hajush
We get a lot of these ChannelClosedExceptions as well, and have also problems reproducing it. So far "it just happens".
What we have seen is that it is happening a lot less (# of incidents per build) on masters with <240 slaves connected at once, but there is also less building going on in general on those masters, so the number of slaves theory is not fully supported by any evidence yet.
We are starting by splitting up the cluster into one more master and are also working towards automatically disconnecting idle slaves and connect them when needed (that setting is very easy to do in Jenkins already but we have some auto-maintenance scripts that needs to adapt to that kind of setup first) And we'll see if that helps.


Robert Sandell
Software Tools Engineer - SW Environment and Product Configuration
Sony Mobile Communications
> --
> You received this message because you are subscribed to the Google
> Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jenkinsci-de...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

Reply all
Reply to author
Forward
0 new messages