Hello,
I am Jochen and I am Software Developer at 1&1. We (the team of Stefan Brausch) have already developed some plugins (like the JobConfigHistory). In my Masters thesis I checked how to utilize Jenkins more efficient and how to extend it dynamically with more build capacity. I cannot post the thesis here (because of lock flag, as well as the German language), however I will post a little summary with the relevant content. Maybe there are some points to discuss about.
Reading load information from agents:
At the moments it is pretty difficult to get the different load values of an agent (system load, free / used memory, disk I/O, network traffic, etc.). You can do a remote procedure call to a slave, to make commands on this agent, but it doesn't change anything on how to get these values. Java itself only provides a method to read the system load (and this only works on some operating systems). To get further values, I accessed to procfs on the agents and parsed these values. Of course this is very complex and error-prone (and only possible on systems who have procfs).
Getting load values of a build:
It would be interessting to get load values of a build (used CPU time, used memory (peak, average), etc.), to use these values for statistic usecases (see load balancing below). That's also not very simple. Java itself only provides the possibility to get the process ID. With this ID I have to access the procfs again and look for the values / calculate them myself. Additional, there's the problem, that after a build step has finished the directory of the process in the procfs doesn't exist anymore; you cannot read the values any longer. So you have to access the directory periodically to read the values. With Java 9 there comes a advanced process API with that you can get more process information, but I think, that in the end the priciple is the same.
Load balancing:
Jenkins uses consistent hashing to distribute the builds on the different agents. That's good from the checkout point of view (because of existing workspace, local Maven repo, etc.), but not from the agents' load point of view (CPU load, memory usages), in case the agents are dedicated machines. In worstcase it can happen that when you have 4 executors / agent, 4 builds run on the same machine, this machine is under heavy load and the other 3 machines idle. Wouldn't it be wise to optimize the load balancing not just on checkout stage, but also on build stage with taking system load into account? To test it I implemented an own load balancing that reads the system load of all agents and distributes the builds always on that node with the least load average. And if we have load values of the last builds of the same job, we could determine statistically which node would be the best to run the current build on.
Dynamic executors management:
At the moment you can set a fix number of executors per agent. However it could be wise to increment the number of executors temporary as needed ("overclocking"), to complete the queue faster. You could set the maximum manually or with the system load of the agent, e.g.: "If the system load average exceeds 30.0, then do not increment the number of executors.". We had some days where there were over 1.000 tasks in the queue and we incremented to number of executors manually until the queue was completed. So I built a prototype that increments and decrements the number of executors per node dynamically. With the usage of this (in combination with the optimized load balancing from above) we reduced the average waiting time of all jobs to about 50% and also utilize the different nodes more even. The average build time didn't increment significantly. From a psychological point of views is a reduced waiting time more important for the user, than a reduced build time - the user doesn't get impatient "why doesn't my build start?".
Parallelization and clustering:
Jenkins can distribute the builds on the different agents, but when a build needs massive ressources, it still runs on one single machine. You can instruct the build tools like maven or make to use more than one thread. However this only brings an advantage if the project has submodules and these submodules are not dependent from each other. Theoretically: I have a make project with 128 non-dependent submodules. When I run this build with just the "make" command, it takes very long to finish. So I use a machine with a 8 core CPU with Hyperthreading -> 16 virtual cores. Then I could run the build with "make -j 16" to compile 16 submodules at the same time. But when I want to run all 128 submodules concurrently, I have to scale. Either I use a bigger CPU with more cores (very limited) or I use more physical machines at the same time. That make or maven make use of it, the underlaying operating system has to span over the physical machines (see vSMP).
Further topics of the thesis were the connection to the internal Cloud and VM infrastructure. But this isn't very relevant here.
What are your opinions?
Thank you!
Jochen