Mark Waite: this bug cannot be understood by thinking about the behavior of one git client thread in isolation. The straight-line code in the git client is correct and it does always close the unique temporary file before the call to command-line git. The problem is that there are other threads in the JVM and they may also run commands and make subprocesses. The mechanics of making subprocesses creates duplicates of open files. It is one of the duplicates that is open, not the version opened by the git client code. To be more explicit, imagine that I have two build jobs running: job 1 needs to do a git checkout, and job 2 needs to run make. Say that Jenkins is running as process ID 1000, thread 1 is running the git checkout, and thread 2 is running the make. Here's a thread/process execution interleaving in which the bug manifests:
- Process 1000, thread 1: open ssh123456.sh for writing, file descriptor 4
- Process 1000, thread 2: fork in preparation to run make, creating process 1001. Inherits file descriptor 4 open for writing to ssh123456.sh.
- Process 1001: exec() make.
- Process 1000, thread 1: write contents of ssh123456.sh.
- Process 1000, thread 1: close ssh123456.sh. Process 1000 no longer has ssh123456.sh open for writing. However, this does not close file descriptor 4 in process 1001 (running make), hence ssh123456.sh is still open somewhere on the system for writing.
- Process 1000, thread 1: fork() in preparation to run git, creating process 1002.
- Process 1002: exec() git.
- Process 1002: fork in preparation to run SSH_AGENT script, creating process 1003.
- Process 1003: exec() ssh123456.sh --> ETXTBSY. ssh123456.sh is open for writing as file descriptor 4 in process 1001 (make).
So the script file is not open in the Jenkins process, but nonetheless it is open somewhere on the system, hence ETXTBSY. And the fact that some other totally unrelated code can make a copy of the file descriptor and mess things up is why it's a Java runtime bug. A combination of vfork() and the close-on-exec flag would ensure that the file descriptor 4 in process 1001 in step 3, thus closing the copy. That's what's being contemplated as the fix in the JVM. One workaround is what's in PR313: copy the script using cp, which doesn't create children, so can't have stranded an open file descriptor to its destination. Another is what I proposed, which is to use a lock to ensure that steps 2 and 3 above cannot happen between steps 1 and 5. |