Bazel and the Linux OOM killer

971 views
Skip to first unread message

iulian dragos

unread,
Mar 29, 2022, 5:26:06 AM3/29/22
to bazel-discuss
Hi,

We're seeing an increasing number of CI jobs failing by being killed by the Linux OOM killer. We're using the largest EC2 instances available (96 cores, 384G of RAM). Most often Bazel gets killed while running large tests.

I know we can limit parallelism by passing `--jobs`, but it seems very crude (most of our tests are small). Relying on the scheduler and tagging tests with their size doesn't seem to be a good option either, since the largest test size assumes a RAM usage of 800MB, which is 1 order of magnitude below the real number (JVM tests..)

Googling shows up a few discussions and StackOverflow answers, so this seems to happen to others too. 

How do others deal with this issue?

thanks,
iulian

Lars Clausen

unread,
Mar 29, 2022, 10:58:35 AM3/29/22
to iulian dragos, William Saakyan, bazel-discuss
One workaround would be to tag the (presumably) few very large tests with `exclusive` (https://docs.bazel.build/versions/main/test-encyclopedia.html#tag-conventions). Then they're run alone, which should help on the memory issues, but is likely to make the CI take longer. 

There doesn't seem to be an equivalent to the `cpu:N` tag for memory. Since fairly recently (https://github.com/bazelbuild/bazel/commit/d7f0724b6b91b6c57039a1634ff00ccebd872714), there has been support for specifying expected resource usage for a Starlark-defined rule, but that handle is not surfaced in the test rules. 

I take it you've already spent some time trying to reduce the size of the tests themselves.

If you use workers, consider reducing `--worker_max_instances` from the default 4, if you see workers sitting idle holding on to a lot of memory. We're looking into better management of worker memory usage.

-Lars
--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/CAMNsu3nZ7_Bxa%3DSdg4iv7%3D729rLcmXvJtJz-dUci1fJ0k-%2BWrg%40mail.gmail.com.

--

Lars Clausen

Software Engineer

lar...@google.com

Google Germany GmbH

Erika-Mann-Straße 33

80636 München


Geschäftsführer: Paul Manicle, Liana Sebastian

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde. 

     

This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.


iulian dragos

unread,
Mar 30, 2022, 4:09:56 AM3/30/22
to Lars Clausen, William Saakyan, bazel-discuss
On Tue, Mar 29, 2022 at 4:58 PM Lars Clausen <lar...@google.com> wrote:
One workaround would be to tag the (presumably) few very large tests with `exclusive` (https://docs.bazel.build/versions/main/test-encyclopedia.html#tag-conventions). Then they're run alone, which should help on the memory issues, but is likely to make the CI take longer.  

There doesn't seem to be an equivalent to the `cpu:N` tag for memory. Since fairly recently (https://github.com/bazelbuild/bazel/commit/d7f0724b6b91b6c57039a1634ff00ccebd872714), there has been support for specifying expected resource usage for a Starlark-defined rule, but that handle is not surfaced in the test rules. 

I take it you've already spent some time trying to reduce the size of the tests themselves.

Probably not enough, but the JVM footprint is pretty high, and most tests would qualify as enormous. I guess we could use `--local_ram_resources` to reduce the total memory consumption (so, equivalently scaling up the logical per-test memory) but didn't try that yet.

If you use workers, consider reducing `--worker_max_instances` from the default 4, if you see workers sitting idle holding on to a lot of memory. We're looking into better management of worker memory usage.

We're using multiplex workers, so my understanding is that we should already have only one instance per mnemonic.

Our alternatives we're looking at:
- reduce parallelism globally (--jobs). Cons: conservative and slower but most likely to solve the issue. Requires tweaking
- reduce memory consumption (--local_ram_resources): Cons: needs tweaking, unsure how the scheduler works
- reduce OOM score for the Bazel server process. This may still kill individual tests, which is somewhat better but not a solution (the CI run would still fail but partially populate the cache so the next run would be more likely to succeed)

Anyone can share their experience with any of them?

iulian

Lars Clausen

unread,
Mar 30, 2022, 5:05:22 AM3/30/22
to iulian dragos, William Saakyan, bazel-discuss
On Wed, 30 Mar 2022 at 10:09, iulian dragos <iulian...@databricks.com> wrote:


On Tue, Mar 29, 2022 at 4:58 PM Lars Clausen <lar...@google.com> wrote:
One workaround would be to tag the (presumably) few very large tests with `exclusive` (https://docs.bazel.build/versions/main/test-encyclopedia.html#tag-conventions). Then they're run alone, which should help on the memory issues, but is likely to make the CI take longer.  

There doesn't seem to be an equivalent to the `cpu:N` tag for memory. Since fairly recently (https://github.com/bazelbuild/bazel/commit/d7f0724b6b91b6c57039a1634ff00ccebd872714), there has been support for specifying expected resource usage for a Starlark-defined rule, but that handle is not surfaced in the test rules. 

I take it you've already spent some time trying to reduce the size of the tests themselves.

Probably not enough, but the JVM footprint is pretty high, and most tests would qualify as enormous. I guess we could use `--local_ram_resources` to reduce the total memory consumption (so, equivalently scaling up the logical per-test memory) but didn't try that yet.

The JVM footprint, as in what's used by the JVM per se, is not going to be GBs high. You may want to use YourKit or similar to see what's actually taking all that space. There may be a legit reason, or you may accidentally load a lot of unnecessary things, or you may be holding on to useless objects. 
 
If you use workers, consider reducing `--worker_max_instances` from the default 4, if you see workers sitting idle holding on to a lot of memory. We're looking into better management of worker memory usage.

We're using multiplex workers, so my understanding is that we should already have only one instance per mnemonic.

Given the size of your tests, the workers are probably not your main concern, then.
 
Our alternatives we're looking at:
- reduce parallelism globally (--jobs). Cons: conservative and slower but most likely to solve the issue. Requires tweaking

That's a very brutal solution indeed. Since it's only tests causing this problem, --local_test_jobs would be more suitable.
 
- reduce memory consumption (--local_ram_resources): Cons: needs tweaking, unsure how the scheduler works

This is probably your best handle. The scheduler estimates the memory usage of each action and doesn't schedule actions if there isn't enough memory available (except it'll always schedule at least one, to make progress). The actions may end up using more memory, but lowering this is a good start for sure. It defaults to 2/3 of the host memory, so with your giant tests it's almost guaranteed to end up using much more. Try aggressively reducing this until you start seeing the CI runs taking longer. 
 
- reduce OOM score for the Bazel server process. This may still kill individual tests, which is somewhat better but not a solution (the CI run would still fail but partially populate the cache so the next run would be more likely to succeed)

Not a great solution indeed. Your CI would still be flaky.

-Lars

iulian dragos

unread,
Mar 30, 2022, 5:24:00 AM3/30/22
to Lars Clausen, William Saakyan, bazel-discuss
On Wed, Mar 30, 2022 at 11:05 AM Lars Clausen <lar...@google.com> wrote:

Our alternatives we're looking at:
- reduce parallelism globally (--jobs). Cons: conservative and slower but most likely to solve the issue. Requires tweaking

That's a very brutal solution indeed. Since it's only tests causing this problem, --local_test_jobs would be more suitable.
 
- reduce memory consumption (--local_ram_resources): Cons: needs tweaking, unsure how the scheduler works

This is probably your best handle. The scheduler estimates the memory usage of each action and doesn't schedule actions if there isn't enough memory available (except it'll always schedule at least one, to make progress). The actions may end up using more memory, but lowering this is a good start for sure. It defaults to 2/3 of the host memory, so with your giant tests it's almost guaranteed to end up using much more. Try aggressively reducing this until you start seeing the CI runs taking longer. 

Thanks a lot for weighing in.

Just to make sure I understand. Setting --local_ram_resources=HOST_RAM*0.66 would be a no-op, so we need to start below? Or HOST_RAM itself is set to 2/3 of total host memory? 

thanks again,
iulian

Jared Neil

unread,
Mar 31, 2022, 1:51:14 AM3/31/22
to bazel-discuss
Just to make sure I understand. Setting --local_ram_resources=HOST_RAM*0.66 would be a no-op, so we need to start below? Or HOST_RAM itself is set to 2/3 of total host memory? 

Your understanding is correct. The default is literally `HOST_RAM*.67`, so you'll need to start below that. `HOST_RAM*.33` might be a good place to start so you can binary-search your way to the best value for your situation.
https://bazel.build/reference/command-line-reference#flag--local_ram_resources

Lars Clausen

unread,
Apr 1, 2022, 4:21:08 AM4/1/22
to Jared Neil, bazel-discuss
Given that the default resource usage estimation is 1 CPU and 250MB of RAM, (src/main/java/com/google/devtools/build/lib/actions/AbstractAction.java:82), you can happily go even lower and see what happens. I would suggest a lower bound of 96 * 250  for bisecting, which comes out to HOST_RAM * 0.0625 (1/16th of your RAM). 

-Lars


--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages