| Just for the record: At the same server, we experienced a deadlock situation today, which could be related to this issue: Today, we had a very sluggish server (long latency in response). Checking, we found several hanging inbound GET requests which took ages (>1,100,000 ms) to complete. A thread dump showed that several threads were blocked by a lock in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:369), which was indirectly triggered by hudson.plugins.performance.actions.PerformanceProjectAction.doRespondingTimeGraph. Note that there was not just the performance plugin, but we also saw other GET requests, such as /job/.../test/trend (there also locked in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber) or /job/.../jacoco/graph (also in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber) were affected. After a little of analysis, we found one of the /job/.../performance/throughputGraph jobs running (state "running"), which apparently was in an endless loop. It also held the lock of the critical monitor, which blocked all the other requests. The interesting (triggering) block within this thread to me was:
...
at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1189)
at hudson.util.XStream2.unmarshal(XStream2.java:114)
at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1173)
at hudson.XmlFile.unmarshal(XmlFile.java:160)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.reload(WorkflowRun.java:603)
at hudson.model.Run.<init>(Run.java:325)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.<init>(WorkflowRun.java:209)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at jenkins.model.lazy.LazyBuildMixIn.loadBuild(LazyBuildMixIn.java:165)
at jenkins.model.lazy.LazyBuildMixIn$1.create(LazyBuildMixIn.java:142)
Killing the thread did the trick - and the rest started to work again. Afterwards, we had to restart the server - but that was due to another problem, which is unrelated to this one here. However, the situation today was different than before: Today, we had a significant load average / CPU load during that situation. In the previous situation, load average / CPU load was very normal - also for hours before the blocking event. Given that the symptoms are different, I am currently not sure whether we just saw the "early stage" of yet-another occurrence of this issue, which we could cure with a courageous thread kill, or whether this was something totally different. For sure, it makes sense to closely look at the list of locks pending, if the issue reappears. |