bazel brings my system to its knees... how to troubleshoot?

mpi...@gmail.com

unread,

Dec 25, 2017, 10:45:31 PM12/25/17

to bazel-discuss

Hi,

I'm fiddling with Bazel with the hopes of my company using it in place of their existing rake-based solution. I was making progress until some seemingly minor change I made to a BUILD file - I'm not even sure exactly what I changed.

After that change, Bazel is bringing my system to its knees; I've ran it 3x with that result, with two Bazel versions and two kernel versions [1]. While the last two times I recovered without resorting to the reset button, I'm hesitant to try again unless I know what to do to get some useful troubleshooting information out of the affair.

When run, Bazel seems to run correctly for a bit but eventually gets "stuck" on some set of files; each file's compile timer counts up but IO and CPU activity drop to zero. When I've checked, the compiler is not running and I only see one Bazel process. Pressing Ctrl-C causes Bazel to print out a message about waiting for a Bazel server process - listing a PID that doesn't exist.

Additional symptoms include:

* GUI locking up, only fixed by reboot (only on the first run, with old Bazel and old kernel); symptoms were consistent with OOM, but this doesn't appear to have happened on subsequent runs so I'm not sure it ran out of memory the first time either.

* processes that access /proc freezing for a while - ps, htop, pgrep, etc

* once the offending bazel process is gone, either via reboot or `kill -9`, attempting to remove the cache dir seems unreasonably slow [2].
** `bazel clean --expunge` seemed to be doing nothing, so I killed it.
** `rm -fr /path/to/cache/uuid/` was also quite slow but did succeed eventually. Before it succeeded, I attached strace and observed frequent multi-second delays between completion of one syscall and the next.
** After another run, executing `strace -o rm-cache -t rm ... ` and inspecting its output (`grep -c unlinkat`) indicated only ~112k files deleted.

I'm not sure how Bazel could create files/symlinks that would cause problems but it _is_ Bazel's cache that is exhibiting the symptoms...

-------

[1]
Run | Bazel | Linux kernel
#1 | 0.8.1 | 4.9.6 (+gentoo patches)
#2 | 0.8.1 | 4.9.60
#3 | 0.9.0 | 4.9.60

OS: Gentoo
Bazel is built from the gentoo ebuild https://github.com/gentoo/gentoo/blob/master/dev-util/bazel/bazel-0.7.0.ebuild , copied & renamed to build the newer versions above.

[2] This is ext4 (with discard) on an SSD. It's a consumer-grade SSD but I'd still expect it to be far faster.

Regards
Mark

Austin Schuh

unread,

Dec 26, 2017, 3:04:51 AM12/26/17

to mpi...@gmail.com, bazel-discuss

Does iostat -dx 1 tell you anything useful during the build?

I had a system that was brought to its knees as well, and it was the filesystem. I had some consumer grade NVME drive IIRC. I ended up debugging it with iostat. After a period of high sustained writes, the drive would slow down significantly and only process a couple hundred iops.

2 things fixed it for me.

1) I started using the --experimental_sandbox_base=/dev/shm/ flag. This moves all the symlink creations for each sandbox to /dev/shm/, which is a massive speedup. For me, that was a 10-100x speedup for builds with lots of files and simple compilations.

2) I upgraded to a datacenter grade SSD. I benchmarked this without the flag from 1) and saw massive sustained iops and reasonable build times. The sandbox flag still helped, but not by as much.

Good luck, and post back what you find!

Austin

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/34bde632-1133-4048-bf61-79f42381d7e3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

George Gensure

unread,

Dec 26, 2017, 5:23:19 AM12/26/17

to Austin Schuh, mpi...@gmail.com, bazel-discuss

There is an undocumented support determination of sandboxing which is invoked even if sandboxing is not used. This invokes the sandbox wrapper and enumerates mounts for read only, under a cloned, not forked, PID that will 'not exist' as a standalone process. The filesystem privatization and remount read/only can be made to hang if certain mounted filesystems are stuck and unavailable (i.e. nfs).

This is a diagnosis of environment only, since you cannot pass debug flags to this invocation in vanilla bazel. I have a patch which allowed some further control that I will try to release soon, check your mounts in the meantime.

-George

To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discuss+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/34bde632-1133-4048-bf61-79f42381d7e3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "bazel-discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/CABsbf%3DHuOetcs1JUZ9RJ25MB2nWtq6k4exRFZDVKuv6-vU2tMg%40mail.gmail.com.

mpi...@gmail.com

unread,

Dec 26, 2017, 7:11:03 PM12/26/17

to bazel-discuss

`iostat -dx 1` shows 98-100% utilization, and `--experimental_sandbox_base=/dev/shm/` fixes that - so this one can be marked as solved.

Thanks!

Reply all

Reply to author

Forward