Concurrent instances of msys have corrupt environments

141 views
Skip to first unread message

vync...@gmail.com

unread,
Nov 6, 2013, 5:34:41 PM11/6/13
to msy...@googlegroups.com
Summary:
Running multiple instances of git (or msys tools) concurrently (especially non-interactively via a service or scheduled task) causes paths in some environment variables to lose their leading slash and thus become "corrupt".

Background:
We use Jenkins to perform continuous integration builds on Windows 7 nodes. Our source is stored in Git repositories, so we use msysgit 1.8.3 to check out the code to build. We run Jenkins as a Windows service (with a specific user account) on these build nodes. Jenkins allows multiple builds to execute concurrently on a single node, which is a feature we make use of. This all worked fairly well until the security updates listed at the end of this post were installed.

Following the updates, when multiple jobs were started concurrently, ssh would occasionally hang after displaying the following error message: "Could not create directory 'c/Users/buildbot/.ssh'". Note that $HOME is set to %USERPROFILE% as a user environment variable (HKCU\Environment\HOME). %USERPROFILE% evaluates to C:\Users\buildbot in this case. I've tried setting $HOME to an absolute path instead of %USERPROFILE% and also setting it as a system environment variable. After digging a bit deeper, I found that $HOME, $TEMP, $TMP, $TMPDIR, and parts of $PATH were missing the leading slash or missing portions.

Working environment:

  HOME=/c/Users/buildbot
  TEMP=/tmp
  TMP=/tmp
  TMPDIR=/tmp
  PATH=/usr/libexec/git-core:/usr/bin:/usr/bin:/usr/mingw/bin:/c/Perl/bin:/c/Python27:/c/Windows/system32:/c/Windows:/usr/cmd

Broken environment:

  HOME=c/Users/buildbot
  TEMP=c/Users/BUILDB~1/AppData/Local/Temp
  TMP=c/Users/BUILDB~1/AppData/Local/Temp
  TMPDIR=c/Users/BUILDB~1/AppData/Local/Temp
  PATH=/libexec/git-core:/bin:/bin:/mingw/bin:c/Perl/bin:c/Python27:c/Windows/system32:c/Windows:/cmd

However, if I run Jenkins interactively as a logged in user, instead of as a service, this problem does not occur at all.

I wrote a Python script (attached) to roughly simulate what Jenkins does and demonstrate the problem. It spawns two bash processes per second for 10 seconds that simply check whether $HOME is readable. Any failures get logged to stdout (redirected to msys_test.stdout.log).

This script always completes perfectly fine when run interactively from a command prompt. However, when run non-interactively from a service (using the Jenkins service launcher or nssm), 30~40% of the processes have a corrupt environment. The Task Scheduler is a much easier way to run processes non-interactively and also causes the same issue:

1. Create a task from the command line (username and paths will need to be changed):

  schtasks /create /tn "MsysTest" /sc once /st 00:00 /ru buildbot /rp /tr "C:\Python27\python.exe C:\msys_test.py"

2. Run the task now. When it's complete, look for msys_test.stdout.log in the same directory as msys_test.py. The log files will start out empty and flush when the script finishes.

  schtasks /run /tn "MsysTest"

3. Delete the task when you're done so it doesn't run automatically by mistake:

  schtasks /delete /f /tn "MsysTest"

Interestingly, if I bump it up to 8 threads at a time, typically only one of them fails (at most 2) and many batches have no failures at all.

My only thought is that there must be some kind of race condition or reentrancy issue in the msys dll (?). I can't think of anything else that could cause this kind of behavior. Any ideas for tracking this down further?


Here's the list of security updates, one of which appears to have broken the ability for multiple concurrent instances of msys to run non-interactively.

MS13-046: Description of the security update for Windows Kernel-Mode drivers: May 14, 2013
MS13-049: Vulnerability in kernel-mode driver could allow denial of service: June 11, 2013
MS13-040: Description of the security update for the .NET Framework 4 on Windows XP, Windows Server 2003, Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2: May 14, 2013
MS13-040: Description of the security update for the .NET Framework 3.5.1 on Windows 7 Service Pack 1 and Windows Server 2008 R2 Service Pack 1: May 14, 2013
MS13-050: Vulnerability in Windows print spooler components could allow elevation of privilege: June 11, 2013
MS13-042: Description of the security update for Publisher 2007
Service Pack 3: May 14, 2013
msys_test.py

Johannes Schindelin

unread,
Nov 7, 2013, 10:10:37 AM11/7/13
to vync...@gmail.com, msy...@googlegroups.com
Hi Vynce Box,

On Wed, 6 Nov 2013, vync...@gmail.com wrote:

> *Summary:*
> Running multiple instances of git (or msys tools) concurrently (especially
> non-interactively via a service or scheduled task) causes paths in some
> environment variables to lose their leading slash and thus become "corrupt".

Interesting.

> *Background:*
> We use Jenkins to perform continuous integration builds on Windows 7 nodes.
> Our source is stored in Git repositories, so we use msysgit 1.8.3 to check
> out the code to build. We run Jenkins as a Windows service (with a specific
> user account) on these build nodes. Jenkins allows multiple builds to
> execute concurrently on a single node, which is a feature we make use of.
> This all worked fairly well until the security updates listed at the end of
> this post were installed.

I use msysGit in a Jenkins node, too, but only one, and this one started
semi-interactively so that the console can be inspected easily.

Windows was never fast enough for me to launch more than one instance
concurrently.

> I wrote a Python script (attached) to roughly simulate what Jenkins does
> and demonstrate the problem. It spawns two bash processes per second for 10
> seconds that simply check whether $HOME is readable. Any failures get
> logged to stdout (redirected to msys_test.stdout.log).
>
> This script always completes perfectly fine when run interactively from a
> command prompt. However, when run non-interactively from a service (using
> the Jenkins service launcher or nssm), 30~40% of the processes have a
> corrupt environment.

Okay, so something is seriously wrong with the MSys/new Windows scheduler
combination.

> My only thought is that there must be some kind of race condition or
> reentrancy issue in the msys dll (?). I can't think of anything else that
> could cause this kind of behavior. Any ideas for tracking this down further?

Unless the Python script itself already has the corrupted environment, I
concur, it is a bug in MSys. Or not exactly a bug in MSys but in Windows
that unfortunately only triggers with MSys

There are several ways to go:

- if you have a service contract with Microsoft, it is their
responsibility to hunt this bug down (they caused it)

- the MSys project itself might have encountered the problem and fixed it
(or at least started to)

- we have an 'msys' branch which lets you build the dll via
/src/rt/release.sh (it is a bit unintuitive that you have to switch
branches with a *different* git.exe so that you can restart msysGit in
MSys mode rather than MinGW mode before this works)

This will clone our fork of MSys and hopefully still work :-) I haven't
done it in ages, and MSys itself switched from a CVS repository to a Git
one, so the exact commits I built msysGit's msys-1.0.dll might not even be
reachable at the moment. I'll try to find some time to test this.

You could use the latter method to debug this further.

Good luck!
Johannes
Reply all
Reply to author
Forward
0 new messages