Summary:
Running multiple instances of git (or msys tools) concurrently (especially non-interactively via a service or scheduled task) causes paths in some environment variables to lose their leading slash and thus become "corrupt".
Background:
We use Jenkins to perform continuous integration builds on Windows 7 nodes. Our source is stored in Git repositories, so we use msysgit 1.8.3 to check out the code to build. We run Jenkins as a Windows service (with a specific user account) on these build nodes. Jenkins allows multiple builds to execute concurrently on a single node, which is a feature we make use of. This all worked fairly well until the security updates listed at the end of this post were installed.
Following the updates, when multiple jobs were started concurrently, ssh would occasionally hang after displaying the following error message: "Could not create directory 'c/Users/buildbot/.ssh'". Note that $HOME is set to %USERPROFILE% as a user environment variable (HKCU\Environment\HOME). %USERPROFILE% evaluates to C:\Users\buildbot in this case. I've tried setting $HOME to an absolute path instead of %USERPROFILE% and also setting it as a system environment variable. After digging a bit deeper, I found that $HOME, $TEMP, $TMP, $TMPDIR, and parts of $PATH were missing the leading slash or missing portions.
Working environment:
HOME=/c/Users/buildbot
TEMP=/tmp
TMP=/tmp
TMPDIR=/tmp
PATH=/usr/libexec/git-core:/usr/bin:/usr/bin:/usr/mingw/bin:/c/Perl/bin:/c/Python27:/c/Windows/system32:/c/Windows:/usr/cmd
Broken environment:
HOME=c/Users/buildbot
TEMP=c/Users/BUILDB~1/AppData/Local/Temp
TMP=c/Users/BUILDB~1/AppData/Local/Temp
TMPDIR=c/Users/BUILDB~1/AppData/Local/Temp
PATH=/libexec/git-core:/bin:/bin:/mingw/bin:c/Perl/bin:c/Python27:c/Windows/system32:c/Windows:/cmd
However, if I run Jenkins interactively as a logged in user, instead of as a service, this problem does not occur at all.
I wrote a Python script (attached) to roughly simulate what Jenkins does and demonstrate the problem. It spawns two bash processes per second for 10 seconds that simply check whether $HOME is readable. Any failures get logged to stdout (redirected to msys_test.stdout.log).
This script always completes perfectly fine when run interactively from a command prompt. However, when run non-interactively from a service (using the Jenkins service launcher or nssm), 30~40% of the processes have a corrupt environment. The Task Scheduler is a much easier way to run processes non-interactively and also causes the same issue:
1. Create a task from the command line (username and paths will need to be changed):
schtasks /create /tn "MsysTest" /sc once /st 00:00 /ru buildbot /rp /tr "C:\Python27\python.exe C:\msys_test.py"
2. Run the task now. When it's complete, look for msys_test.stdout.log in the same directory as msys_test.py. The log files will start out empty and flush when the script finishes.
schtasks /run /tn "MsysTest"
3. Delete the task when you're done so it doesn't run automatically by mistake:
schtasks /delete /f /tn "MsysTest"
Interestingly, if I bump it up to 8 threads at a time, typically only one of them fails (at most 2) and many batches have no failures at all.
My only thought is that there must be some kind of race condition or reentrancy issue in the msys dll (?). I can't think of anything else that could cause this kind of behavior. Any ideas for tracking this down further?
Here's the list of security updates, one of which appears to have broken the ability for multiple concurrent instances of msys to run non-interactively.
MS13-046: Description of the security update for Windows Kernel-Mode drivers: May 14, 2013
MS13-049: Vulnerability in kernel-mode driver could allow denial of service: June 11, 2013
MS13-040: Description of the security update for the .NET Framework 4 on Windows XP, Windows Server 2003, Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2: May 14, 2013
MS13-040: Description of the security update for the .NET Framework 3.5.1 on Windows 7 Service Pack 1 and Windows Server 2008 R2 Service Pack 1: May 14, 2013
MS13-050: Vulnerability in Windows print spooler components could allow elevation of privilege: June 11, 2013
MS13-042: Description of the security update for Publisher 2007
Service Pack 3: May 14, 2013