Re: GSOC - Further details on "Dynamic thread pool sizing"

Philipp Wollermann

unread,

Mar 8, 2017, 6:56:25 AM3/8/17

to Jakob Buchgraber, Jeanderson, bazel-discuss

Hi,

I'm cc'ing bazel-discuss@ on this as it might be interesting for others, too.

On Tue, Mar 7, 2017 at 5:05 AM, Jakob Buchgraber <buc...@google.com> wrote:

As Philipp recently mentioned to me, one problem with parallel builds is that one might do too much work. If targets A and B both depend on target C, then evaluating A and B in parallel might result in C being built twice and thus the whole build might be slower. Maybe we could add the ability to cancel the slower build of C after the faster one finished and use the cached result.

In general, Bazel should never build the same target twice due to parallelism. If that happens, it's a P0 (priority: "lunch blocker") correctness bug. :)

Sorry if I explained this wrong!

The details of how this all works are more complex:

In general, Bazel uses an action graph with dependencies between actions that it uses to run things in the correct order and with the maximum parallelism that is possible without violating dependency constraints.

However, we limit the parallelism by a number of means:

1) Bazel runs each action (e.g. "execute gcc to compile foo.c into foo.o") in its own Java thread. The number of threads is configured via --jobs and defaults to 200.

You can't run more actions in parallel than you have threads.

Obviously, if we ran 200 instances of gcc in parallel on the average single CPU quad core workstation, it would be slightly overwhelmed. The reason for why 200 threads still make sense in some scenarios is explained below.

2) Thus, Bazel has a mechanism called the "ResourceManager". Every thread that wants to run an external program on your local workstation must also acquire a set of "resources", where resources are modeled like: "gcc needs 1 CPU core, 200 MB of RAM and a bit of I/O".

Two problems exist in this area:

- How many resources does your local machine have? Let's say your machine has a quad core CPU, 16 GB of RAM and a 7200 RPM hard-disk. But what if those cores use hyper-threading, so your OS reports them as 8 cores? 16 GB of RAM are nice, but what if you're also running Google Chrome, which takes half of it for web-browsing? Should Bazel just claim all the RAM for itself or be conservative and just use what's free (what if the free memory changes during the build, because you quit Chrome)? And how do you even measure how fast your hard-disk is (and then also let that work for SSDs)? Bazel already has some sort of "guess the available resources for this computer" system. It's in the LocalHostCapacity / LocalHostResourceManagerDarwin / LocalHostResourceManagerLinux classes.

- The other side of the problem is knowing how many resources an action needs. Does the Java compiler need 1 CPU core? Or can it use 2, because it is multi-threaded? How much RAM does it need (and does that maybe depend on the size of the source-code that it is asked to compile?). Let's not even talk about how to model I/O. The current way this is handled is by guessing and hoping for the best. For example, the Java compiler supposedly only uses 0.5 CPU cores (even though it's more like 2 when you look at it via "top", which makes sense, considering that the JVM also does GC and JIT in separate threads). Nevertheless, the current system works pretty OK (which is also a mystery ;)).

3) Individual execution strategies (classes implementing SpawnActionContext) might ignore the ResourceManager part or impose additional limits. For example, our remote execution strategy would not try to acquire local resources first, because it runs stuff remotely (this is also why --jobs defaults to 200 threads). The WorkerSpawnStrategy on the other hand imposes an additional limit of never spawning more than 4 workers (e.g. Java compilers) of one kind (by default, configurable), because they keep running in the background and I thought that spawning e.g. 12 memory-hungry persistent Java compilers that never quit would be bad for system performance.

Personally, I think the current system is *way* overengineered and removing all this "my process needs 0.73 CPUs, 379.4 MB of RAM and a quarter of an MLC-SSD in terms of I/O" stuff and instead just limiting the maximum number of parallel local jobs to the number of CPU cores (by default, make it configurable) would also work and give more predictable results. So, one project could look like: Make the ResourceManager used by Bazel configurable via a flag, then implement the simple job counting manager and benchmark it against the existing ResourceManager.

But I also wouldn't oppose an experimental project where you try to use some cool algorithms to improve the resource utilization of Bazel. Maybe track resource usage during process lifetime, then use the knowledge to improve the estimate for the next time the action runs. :)

Ideas are very welcome on this topic.

Cheers,

Philipp

Jeanderson

unread,

Mar 8, 2017, 7:18:07 AM3/8/17

to Philipp Wollermann, Jakob Buchgraber, bazel-discuss

Awesome!

Thanks for the detailed information Philipp!

I'm doing some research and I will ping you guys back to discuss more details and ask more specific questions if necessary :)

Thank you very much for your time.

Regards

--

Jeanderson Candido

Graduate Student in Computer Science

Federal University of Pernambuco, Brazil

http://jeandersonbc.github.io

Johnson Li

unread,

Mar 21, 2017, 2:18:58 AM3/21/17

to bazel-discuss, buc...@google.com, jeande...@gmail.com

Hi Philipp,

I am applying for GSoC 2017 and have some question about "Dynamic thread pool sizing".

1. Multithreading is always inspiring but I am not sure about its effect in compiling. When compiling Bazel, my laptop reports `Elapsed time: 183.156s, Critical Path: 178.70s` and my PC server reports `Elapsed time: 77.687s, Critical Path: 62.96s` with default parameters. I guess that critical path can not be optimized by multithreading and it has already taken the most part of the elapsed time, so I wonder how much can 'Dynamic thread pool sizing' save the compiling time and is that your goal?

2. I thought compile was a CPU-intensive work, but you mentioned other resources like RAM and I/O are also important. And how to model these resource requirements is the core problem. You provide two methods, make the model configurable or design a cool algorithm to predict the model intelligently. The latter seems challenging and is what I prefer, do you have any suggestion on it? And is there any previous work on resource management in other build tools?

3. You discussed whether Bazel should be conservative or not when requiring resources but in my opinion, the compiler should always be aggressive in competing for resources. So what's your consideration?

4. At last, do you have some real scenarios to measure the performance of Bazel or is the benchmark sufficient?