bad interpreter: Text file busy

530 views
Skip to first unread message

mmanu.ch...@gmail.com

unread,
Jul 2, 2018, 4:42:07 PM7/2/18
to bazel-discuss
Hi,

I'm taking a shot in the dark here. I apologize in advance for being vague. We are getting the following error:

/usr/bin/env: bad interpreter: Text file busy

when some cpplint tests are run using `py_test` [1].

One of the ways that this problem can happen is when an executable file is open in writing mode while it's executed.

Bazel seems to be using a template to create a Python executable script which is run in order to run Python tests [2] [3].

I suspect that the Python script might be getting executed before it is closed and causing the error.

The error is very infrequent and non-deterministic. I have not been able to reproduce it by repeated running the related tests on my machine from our source code, or by creating a separate example with 10000 `py_test`s

Any help would be appreciated.

[1] https://github.com/RobotLocomotion/drake/issues/8651

[2] https://github.com/bazelbuild/bazel/blob/a610a2b77893ed9edd3038cffe803bce68f83a80/src/main/java/com/google/devtools/build/lib/bazel/rules/python/python_stub_template.txt

[3] https://github.com/bazelbuild/bazel/blob/8ae7a3ba8c68278b362f6c0e5aa127c96f8f6025/src/main/java/com/google/devtools/build/lib/bazel/rules/python/BazelPythonSemantics.java#L62

Brian Silverman

unread,
Jul 2, 2018, 4:50:15 PM7/2/18
to Mmanu Chaturvedi, bazel-discuss
Hi,

I occasionally see what looks like the same problem in my CI builds too. However, I see it more commonly with Skylark actions writing shell scripts, both in rules_go and some custom rules. I don't think it's related to Python in any way.

Unfortunately I don't think there's an easy fix... It looks like a JDK bug that nobody cares to fix; https://bugs.openjdk.java.net/browse/JDK-8068370 is the still-open bug report I found when I went looking.

If it is the same problem, you'll need to delete the output files to reproduce it. `bazel clean` will do that, or you might try `bazel shutdown` and then manually deleting just the problematic ones to get more tries in faster and have a better chance of reproducing it.

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/4d388fad-fed2-4f27-bda6-c353623b8013%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

George Gensure

unread,
Jul 2, 2018, 5:36:41 PM7/2/18
to Brian Silverman, Mmanu Chaturvedi, bazel-discuss
I encountered this repeatedly internally and fixed it with this change: https://github.com/werkt/bazel/commit/f4d1016b0fdf4deb163522a21cbb0758d8e36c6a

it's in a fairly low traffic section of the code, and should apply pretty cleanly to older versions, but I just rebase that onto master. I had believed it to be a fix for different problem, and my pull request got rejected. Thanks for reminding me to try to put it up again

-George

Brian Silverman

unread,
Jul 2, 2018, 6:10:46 PM7/2/18
to ggen...@uber.com, Mmanu Chaturvedi, bazel-discuss, buc...@google.com, Ulf Adams
I guess doing a lock there might solve the problem. Even if it doesn't fully fix it, making it happen less often would be good. I had given up on doing much with it, so that's pretty cool, especially given that it's so simple.

I'm surprised that ProcessBuilder#start actually waits for the child process to somehow signal that it has closed all the inherited FDs, vs just returning as soon as the fork() does. I'm curious: did you, +bu...@google.com, or +ulf...@google.com find any documentation stating that (or look at the implementation and see it)?

Have you looked for any performance impact? With a bunch of small actions, even on a laptop with only 4 cores (8 hyperthreads), I already see the bazel server process bottlenecked and unable to keep all the CPUs busy, and this seems like it'd make that worse. I guess if it's already in the remote worker code, that's probably not too big of a deal either way.

George Gensure

unread,
Jul 3, 2018, 11:06:33 AM7/3/18
to Brian Silverman, Mmanu Chaturvedi, bazel-discuss, Jakob Buchgraber, Ulf Adams
Jakob's comment there was retained due to its relevance. It is endemic to multithreaded process execution with that interface.

If you take a look at the interface that ProcessBuilder provides, it really cannot be any other way - aside from providing a guarantee that the exec completed successfully, it allows significant control around file descriptors and environment behaviors that must be synchronized. To boot, it uses vfork in my environment (oracle java8) further enforcing the race mitigation, and a trace of a simplistic execution of this on a "sleep 10" invocation yields that synchronization activities do occur (my notes inline):

[pid 13070] pipe([9, 10])               = 0

[pid 13070] pipe([11, 12])              = 0 << exec status pipe created in parent

[pid 13070] vfork(Process 13094 attached
 <unfinished ...>
[pid 13094] close(9)                    = 0
[pid 13094] close(10)                   = 0
[pid 13094] close(11)                   = 0

[pid 13094] dup2(12, 3)                 = 3 << pipe duplicated to fd 3. this may indicate a preferred order for the close, lowest will probably be last closed

[pid 13094] close(12)                   = 0
[pid 13094] close(4)                    = 0
[pid 13094] close(5)                    = 0
[pid 13094] openat(AT_FDCWD, "/proc/self/fd", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4
[pid 13094] mprotect(0x7f3b241be000, 32768, PROT_READ|PROT_WRITE) = 0
[pid 13094] getdents(4, /* 11 entries */, 32768) = 264
[pid 13094] close(6)                    = 0
[pid 13094] close(7)                    = 0
[pid 13094] close(8)                    = 0
[pid 13094] close(35)                   = 0
[pid 13094] getdents(4, /* 0 entries */, 32768) = 0
[pid 13094] close(4)                    = 0

[pid 13094] fcntl(3, F_SETFD, FD_CLOEXEC) = 0 << fd 3 will be closed when the exec has occurred

[pid 13094] execve("/bin/sleep", ["sleep", "10"], [/* 74 vars */] <unfinished ...>
[pid 13070] <... vfork resumed> )       = 13094
[pid 13070] close(12)                   = 0

[pid 13070] read(11, "", 4)             = 0 << read control message. If the exec does not succeed, errno is marshalled into the pipe as a 32bit int.
[pid 13070] close(11 <unfinished ...>       << the read completion here also offers synchronization with the exec.

[pid 13094] <... execve resumed> )      = 0
[pid 13070] <... close resumed> )       = 0
[pid 13070] close(9 <unfinished ...>
[pid 13094] brk(0 <unfinished ...>
[pid 13070] <... close resumed> )       = 0
[pid 13094] <... brk resumed> )         = 0x171d000

The performance is what the performance will be for something being correctly synchronized. The race case is very specific here, where an early fork concurrent with an open file does not have time to perform the exec (which would close the open fd) when another thread can get through its vfork and exec and have the operating system see the opened file.

-George


To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discuss+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discuss+unsubscribe@googlegroups.com.

George Gensure

unread,
Jul 9, 2018, 5:14:23 PM7/9/18
to Brian Silverman, Mmanu Chaturvedi, bazel-discuss, Jakob Buchgraber, Ulf Adams
Issued a pull request here: https://github.com/bazelbuild/bazel/pull/5556, please comment with any +1s and excitement for the fix.

-George

mmanu.ch...@gmail.com

unread,
Jul 19, 2018, 1:13:54 PM7/19/18
to bazel-discuss
> I'm surprised that ProcessBuilder#start actually waits for the child process to somehow signal that it has closed all the inherited FDs, vs just returning as soon as the fork() does. I'm curious: did you, +buc...@google.com, or +ulf...@google.com find any documentation stating that (or look at the implementation and see it)?
> To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
>
> To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/4d388fad-fed2-4f27-bda6-c353623b8013%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
Thanks a lot, George and Brian!
Reply all
Reply to author
Forward
0 new messages