Confidential or protected information may be contained in this email and/or attachment. Unless otherwise marked, all TRI email communications are considered "PROTECTED" and should not be shared or distributed. Thank you. --
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/0e73d5d1-c75a-4e07-b960-5339d73216b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
+Ulf, Philipp
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/0e73d5d1-c75a-4e07-b960-5339d73216b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ulf,We have a primarily C++ code base, about 750k lines, which heavily relies on templates and other things that like to cause the compiler to use a lot of memory. We are using machines spanning laptops with 16G of RAM to workstations with 64 cores and 128G. Actions can use from 100MB of RAM to 6GB of RAM and up. I am 99.5% confident these particular problems are memory related. We have side by side runs of identical everything modulo bazel version between 0.9 and 0.11, and 0.11 consistently OOMs our cloud build machines. It also makes a wide variety of our other machines OOM or display symptoms of extreme memory pressure. With this hack... ahem patch, all memory related symptoms are resolved.I don't think that whatever changed in bazel is at fault of course, it merely just further exposes the fact that bazel does a poor job scheduling memory heavy workloads.
I have no problem here if things were merged under an opt-in flag. At the moment, we're going to just be running a patched bazel version so as to make progress. We have one other potential sandbox related bug to fix before we flip to 0.11 though, which I haven't fully triaged yet.
Also, the approach I've outlined in this patch basically ignores the memory estimates anyhow and would work just fine if you had no memory estimates at all. The logic boils down to, don't start a job if there is insufficient remaining usable memory, with the exception of always allowing at least one job to run.Josh
On Mon, Mar 26, 2018 at 1:30 PM, Ulf Adams <ulf...@google.com> wrote:
Hi Josh,we're certainly aware that the scheduling isn't very good, although I am not aware of any changes in 0.11 (off the top of my head). The memory 'estimates' in Bazel are basically just random numbers - we were actually thinking of removing the memory estimates entirely.Can you say more about the situations where you're seeing problems? What kinds of actions are you running? What kind of machine do you have? How certain are you of your diagnosis that it's related to memory?We could add something like your change, although we'd likely require it be opt-in with a flag (--experimental_local_memory_estimates?). We're a bit reluctant to add more complexity here given that the code isn't very good to start with. It's impressive that you got it to work!Thank,-- Ulf
On Mon, Mar 26, 2018 at 9:38 AM Jingwen Chen <jin...@google.com> wrote:
+Ulf, Philipp
On Tue, Mar 20, 2018 at 5:12 PM Josh Pieper <josh....@tri.global> wrote:
As I've mentioned before in previous threads, we've had challenges with bazel's estimation of memory that actions require. With bazel 0.11, something in the scheduling has changed that makes it trigger the linux out of memory killer reliably for nearly all our builds. There are enough different configurations of machines and variation in the memory usage across actions that it isn't feasible to just limit the number of jobs.I made up a very simple patch as a proof of concept to see if we could make bazel respect the machine's resources a bit better. It just reads /proc/meminfo each time that a query for resources is made and allows work to proceed only if a sufficient amount of system memory is available. It has worked locally for us.I recognize this patch is not mergeable as is for many reasons (it is linux only, probably violates a number of design principles, etc). I am looking to see if there is anyone on the bazel team who could help work with me to figure out how to integrate something that achieves these goals in an acceptable way?Regards,Josh
Confidential or protected information may be contained in this email and/or attachment. Unless otherwise marked, all TRI email communications are considered "PROTECTED" and should not be shared or distributed. Thank you. --
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/0e73d5d1-c75a-4e07-b960-5339d73216b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
+Ulf, Philipp
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/0e73d5d1-c75a-4e07-b960-5339d73216b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Confidential or protected information may be contained in this email and/or attachment. Unless otherwise marked, all TRI email communications are considered "PROTECTED" and should not be shared or distributed. Thank you.
--Lukács T. Berki | Software Engineer | lbe...@google.com |Google Germany GmbH | Erika-Mann-Str. 33 | 80636 München | Germany | Geschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891
I am fairly certain the problem is (2), or rather just that it happens to schedule more of the worst memory consuming jobs at the same time. I did notice that when switching to 0.11 we had to work around a poor temporary file creation problem with NVIDIA's nvcc. It was making temporary files with no randomness based on the input filename. In 0.11, bazel either started simultaneously scheduling the PIC and non-PIC versions of those actions, or I suppose it might have let them start sharing the same temporary directory, which gave nvcc problems.
I ran a full build of our tree with --batch at 0.9.0, 0.11.0, and 0.11.1 with the above patch. All were done on a 24 core machine with 128G of RAM and `-j 200` (because 0.9.0 doesn't use all the processors otherwise), and ram_utilization_factor=50. While it was running, I used smem (https://www.selenic.com/smem/) every 10s to sum up the RSS usage of bazel plus all the active compilation processes. bazel's maximum memory usage was about the same in all three instances, although the child usage did vary significantlybazel RSS (bazel + everything)0.9.0 5.7G 84G0.11.0 5.8G 110G0.11.1+patch 6.4G 79GOur build machines have less RAM per CPU core, which ultimately causes the OOM events under 0.11.0. It is hard to represent succinctly in email form, but the 0.11.0 plots had extensive periods at or near 110G utilization, whereas with 0.9.0 and the patched version, the peak was achieved only briefly.
So I don't believe there was a regression in bazel's capability between 0.9.0 and 0.11.0, just that it exacerbated for us its already poor handling of memory heavy workloads.
Josh
+Ulf, Philipp
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/0e73d5d1-c75a-4e07-b960-5339d73216b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Confidential or protected information may be contained in this email and/or attachment. Unless otherwise marked, all TRI email communications are considered "PROTECTED" and should not be shared or distributed. Thank you.
--Lukács T. Berki | Software Engineer | lbe...@google.com |Google Germany GmbH | Erika-Mann-Str. 33 | 80636 München | Germany | Geschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891
Confidential or protected information may be contained in this email and/or attachment. Unless otherwise marked, all TRI email communications are considered "PROTECTED" and should not be shared or distributed. Thank you.
On Tue, 27 Mar 2018 at 21:43, Josh Pieper <josh....@tri.global> wrote:I am fairly certain the problem is (2), or rather just that it happens to schedule more of the worst memory consuming jobs at the same time. I did notice that when switching to 0.11 we had to work around a poor temporary file creation problem with NVIDIA's nvcc. It was making temporary files with no randomness based on the input filename. In 0.11, bazel either started simultaneously scheduling the PIC and non-PIC versions of those actions, or I suppose it might have let them start sharing the same temporary directory, which gave nvcc problems.Interesting. I'm not aware of any relevant changes in picness handling that could caused this. Marcel maybe?I ran a full build of our tree with --batch at 0.9.0, 0.11.0, and 0.11.1 with the above patch. All were done on a 24 core machine with 128G of RAM and `-j 200` (because 0.9.0 doesn't use all the processors otherwise), and ram_utilization_factor=50. While it was running, I used smem (https://www.selenic.com/smem/) every 10s to sum up the RSS usage of bazel plus all the active compilation processes. bazel's maximum memory usage was about the same in all three instances, although the child usage did vary significantlybazel RSS (bazel + everything)0.9.0 5.7G 84G0.11.0 5.8G 110G0.11.1+patch 6.4G 79GOur build machines have less RAM per CPU core, which ultimately causes the OOM events under 0.11.0. It is hard to represent succinctly in email form, but the 0.11.0 plots had extensive periods at or near 110G utilization, whereas with 0.9.0 and the patched version, the peak was achieved only briefly.Would maybe a smaller --ram_utilization_factor or --jobs help?
So I don't believe there was a regression in bazel's capability between 0.9.0 and 0.11.0, just that it exacerbated for us its already poor handling of memory heavy workloads.Well, nevertheless, we should figure something out... sadly, I don't think we can do a very good job at estimating the RAM use of a C++ compilation action, so it's mostly groping in the dark :(