| I've experienced this kind of issue where I work. In our case, it was caused by the host running out of memory, triggering the oom-killer, and the oom-killer then decided that the Jenkins "java -jar slave.jar" process (the one responsible for keeping the slave connected to the master) was the least important process and killed it. The result was that, when things got busy, slaves died at random, despite doing nothing wrong themselves. This was particularly caused by our use of certain software packages that decide how much memory they're going to allocate to themselves based on the amount of memory available ... and that look at the whole host's memory instead of the container's fair share of that memory. It doesn't take many processes to each allocate themselves half of the host's entire RAM before things get tight and the oom-killer gets invoked. Try turning off memory overcommit in your docker host, limiting the amount of memory available to each container, and limiting the number of containers you run concurrently. |