Re: Push over git protocol hangs in msysGit

Erik Faye-Lund

unread,

Jan 14, 2011, 6:01:09 PM1/14/11

to mcal...@gmail.com, msysGit

I'm moving this to the mailing list, as the issue tracker has been closed.

mcal...@gmail.com wrote:
> I played around with this a bit myself and am able to recreate this when
> changing the child process (upload-pack) to use pipe for StdOut instead
> of the socket to communicate with the parent (send-pack) a little better.
> The child actually finishes all its work fine. When I went to write the
> data from the child to the socket file descriptor I ended up in a
> deadlock.
>
> Tracing this into the MS CRT I found that both write() and read() lock
> the file handle with _lock_fh().
>
> With the addition of the multiplexing there is a second thread reading
> from the socket before the client is spawned and it continues until after
> all the data from the child is supposed to be written. This causes a
> deadlock.
>

Are you sure this is the problem? The multiplexing puts a pipe in
between, which should have prevented this, no?

That is:
fd[0] is copied to 'in' at line 242, but it is overwritten at line 335
by demux.out before 'in' is ever used. fd[0] is only read through
sideband_demux.
fd[1] is not touched by the thread at all. It's written to both before
and after the thread starts.
demux.out is copied to 'in' at line 335, and gets closed at line 362.
It's 'in'-copy potentially gets read on line 351.

I don't see how any of this should cause a dead-lock.

> I was able to see similar behaviors when the child process was using the
> socket as StdOut but it was a lot harder to actually see this going on.
>

Hmm, this is very interesting because if it's true (I'm not saying it
isn't, I just haven't seen it myself), it discredits theories about
WSAInit() etc.

> There are non locking functions in the CRT (e.g. _write_nolock) but those
> might not be a great approach either to get around this.
>
> Fundamentally it appears that a few things will need to change as the
> code uses core safe_write/read functions and those all use write/read
> internally causing these issues.
>

I'm not convinced about that this is the culprit.

Erik Faye-Lund

unread,

Jan 15, 2011, 7:05:01 AM1/15/11

to Marc Calahan, msysGit

Please don't cull the CC-list in the future, that is hit "reply all"
instead of "reply".

On Sat, Jan 15, 2011 at 9:05 AM, Marc Calahan <mcal...@gmail.com> wrote:

> On Fri, Jan 14, 2011 at 6:01 PM, Erik Faye-Lund <kusm...@gmail.com> wrote:
>> I'm moving this to the mailing list, as the issue tracker has been closed.
>>
>

> How are bugs tracked?
>

We don't explicitly track bugs, but we do have mailing list archives.
This is the same way as mainline Git works.

>> mcal...@gmail.com wrote:
>>> I played around with this a bit myself and am able to recreate this when
>>> changing the child process (upload-pack) to use pipe for StdOut instead
>>> of the socket to communicate with the parent (send-pack) a little better.
>>> The child actually finishes all its work fine. When I went to write the
>>> data from the child to the socket file descriptor I ended up in a
>>> deadlock.
>>>
>>> Tracing this into the MS CRT I found that both write() and read() lock
>>> the file handle with _lock_fh().
>>>
>>> With the addition of the multiplexing there is a second thread reading
>>> from the socket before the client is spawned and it continues until after
>>> all the data from the child is supposed to be written. This causes a
>>> deadlock.
>>>
>>
>> Are you sure this is the problem? The multiplexing puts a pipe in
>> between, which should have prevented this, no?
>>
>> That is:
>> fd[0] is copied to 'in' at line 242, but it is overwritten at line 335
>> by demux.out before 'in' is ever used. fd[0] is only read through
>> sideband_demux.
>> fd[1] is not touched by the thread at all. It's written to both before
>> and after the thread starts.
>> demux.out is copied to 'in' at line 335, and gets closed at line 362.
>> It's 'in'-copy potentially gets read on line 351.
>>
>> I don't see how any of this should cause a dead-lock.
>

> sideband_demux is reading from the socket, not a pipe:
> int *fd = data; <-- data object passed from send_pack
> int ret = recv_sideband("send-pack", fd[0], out); <-- using fd[0]
>
> send_pack:
> demux.proc = sideband_demux;
> demux.data = fd; <-- just assigning fd from function parameter below
> demux.out = -1;
> if (start_async(&demux))
>
> int send_pack(struct send_pack_args *args,
> int fd[], <-- here we is
> struct child_process *conn,
> struct ref *remote_refs,
> struct extra_have_objects *extra_have)
>

Yes, but it's writing to a pipe, and send_pack reads from that pipe.

> What is fd? Great question:
>
> connect.c : git_tcp_connect:
> static void git_tcp_connect(int fd[2], char *host, int flags)
> {
> int sockfd = git_tcp_connect_sock(host, flags);
>
> fd[0] = sockfd; <-- fd[0] is the socket
> fd[1] = dup(sockfd); <-- fd[1] is the same socket (duplicated, but
> the CRT uses the same IO stream under the covers so this is
> effectively the same for some operations).
> }
>

_lock_fh locks file-descriptors, so fd[0] and fd[1] should have separate locks.

> So that's the read side, what about the write side:
> in pack_objects:
> memset(&po, 0, sizeof(po));
> po.argv = argv;
> po.in = -1;
> po.out = args->stateless_rpc ? -1 : fd; <-- stateless_rpc is false
> so we will pass fd.
> po.git_cmd = 1;
>
> I'll leave out the details but fd is passed in as a parameter which
> happens to be out in send_pack. out is defined as fd[1].
>

Yes, and fd[0] (and it's shallow copy out) only gets written to from
the main-tread.

I still don't see how this should lead to a dead-lock.

>>> There are non locking functions in the CRT (e.g. _write_nolock) but those
>>> might not be a great approach either to get around this.
>>>
>>> Fundamentally it appears that a few things will need to change as the
>>> code uses core safe_write/read functions and those all use write/read
>>> internally causing these issues.
>>>
>>
>> I'm not convinced about that this is the culprit.
>>
>

> Just try writing to the socket while the demux thread is reading.... I
> dare you....
>

Again, dup() creates a new file-descriptor, so they should have
separate locks (I've stepped through the source code of dup and write
to verify this).

What am I missing?

Marc Calahan

unread,

Jan 15, 2011, 2:11:03 PM1/15/11

to kusm...@gmail.com, msysGit

On Sat, Jan 15, 2011 at 7:05 AM, Erik Faye-Lund <kusm...@gmail.com> wrote:
>> How are bugs tracked?
>>
>
> We don't explicitly track bugs, but we do have mailing list archives.
> This is the same way as mainline Git works.
>

Which explains why the git protocol hasn't worked for so long. The
disappointing thing is we are spending 2 months at my company
investigating open and closed-source source control options to replace
our existing one. This bug is a showstopper and the fact it has been
open for so long is basically counting GIT out of the running.

>>> mcal...@gmail.com wrote:
>> I'll leave out the details but fd is passed in as a parameter which
>> happens to be out in send_pack. out is defined as fd[1].
>>
>
> Yes, and fd[0] (and it's shallow copy out) only gets written to from
> the main-tread.
>
> I still don't see how this should lead to a dead-lock.
>

That is fine that you can't see it. It is still the shown theory
which is backed up by examples both given by me and earlier on the
ticket.

Have you accepted there is a dead lock between writing and reading?

>>
>> Just try writing to the socket while the demux thread is reading.... I
>> dare you....
>>
>
> Again, dup() creates a new file-descriptor, so they should have
> separate locks (I've stepped through the source code of dup and write
> to verify this).
>
> What am I missing?
>

The whole thing about accepting behaviors and looking into the actual problem.

I've attached a 'patch' which will clearly show a deadlock in writing
(in the same process, no subprocess) when the demux is running. This
was what I expected you to try to at least believe when other people
pointed it out.

Attached diff to add a simple 1 byte xwrite (which is what is used
internally) immediately after the demux is started. There is a
sleep(1) to make sure the demux is running and reading prior to the
write.

Marc

deadlock.diff

Johannes Sixt

unread,

Jan 16, 2011, 4:23:05 AM1/16/11

to Marc Calahan, msy...@googlegroups.com, kusm...@gmail.com

On Samstag, 15. Januar 2011, Marc Calahan wrote:
> Attached diff to add a simple 1 byte xwrite (which is what is used
> internally) immediately after the demux is started. There is a
> sleep(1) to make sure the demux is running and reading prior to the
> write.

Thanks for the patch, but I wonder in which way this is relevant? Your patch
uses xwrite(out,...), but the actual code path does not do write(out,...)
anywhere after demux is started; rather, it passes 'out' to the sub-process.
What are you trying to show?

-- Hannes

Marc Calahan

unread,

Jan 16, 2011, 7:39:51 PM1/16/11

to Johannes Sixt, msy...@googlegroups.com, kusm...@gmail.com

Forgot to add cc-parties..

You are correct about handling it off.

The reason why the subprocess can't run/write is because of this
deadlock which I am trying to show.

I am pinpointing the problem hoping that someone will help look at it
as reading and writing to a socket at the same time should NOT
deadlock.

If you read the previous you can see that the subprocess works
correctly if the parent is not receiving data. As soon as the parent
starts reading (in demux) no one can write to the socket.

Thanks,
Marc

Johannes Sixt

unread,

Jan 17, 2011, 3:03:39 PM1/17/11

to Marc Calahan, msy...@googlegroups.com, kusm...@gmail.com

On Montag, 17. Januar 2011, Marc Calahan wrote:
> The reason why the subprocess can't run/write is because of this
> deadlock which I am trying to show.
>
> I am pinpointing the problem hoping that someone will help look at it
> as reading and writing to a socket at the same time should NOT
> deadlock.

Did you discover an instance where the *exact* *same* file descriptor is read
from and written to at the same time? And I do mean file descriptor, not OS
object.

Note that the dup() in git_tcp_connect does not count, because it allocates
different file descriptors, hence, the CRT uses different locks. Even though
the same OS socket object might be referenced by the fds, no deadlock should
happen.

If OTOH you are talking about a socket object, then simultaneous reads and
writes that deadlock would be a serious bug in the OS, not in the CRT nor
git. Sockets are bidirectional, and simultaneous reads and writes from
different threads should work by design.

-- Hannes

Erik Faye-Lund

unread,

Jan 17, 2011, 3:33:46 PM1/17/11

to Marc Calahan, msysGit

On Sat, Jan 15, 2011 at 8:11 PM, Marc Calahan <mcal...@gmail.com> wrote:
> On Sat, Jan 15, 2011 at 7:05 AM, Erik Faye-Lund <kusm...@gmail.com> wrote:
>>> How are bugs tracked?
>>>
>>
>> We don't explicitly track bugs, but we do have mailing list archives.
>> This is the same way as mainline Git works.
>>
>
> Which explains why the git protocol hasn't worked for so long.

Having an issue-tracker didn't help us to fix the issue for 9 months,
I don't think you can blame it not being fixed on the two weeks the

issue tracker has been closed.

> The

> disappointing thing is we are spending 2 months at my company
> investigating open and closed-source source control options to replace
> our existing one. This bug is a showstopper and the fact it has been
> open for so long is basically counting GIT out of the running.
>

Feel free to make your own installer with the workaround applied.
Building a custom installer is not difficult.

>>>> mcal...@gmail.com wrote:
>>> I'll leave out the details but fd is passed in as a parameter which
>>> happens to be out in send_pack. out is defined as fd[1].
>>>
>>
>> Yes, and fd[0] (and it's shallow copy out) only gets written to from
>> the main-tread.
>>
>> I still don't see how this should lead to a dead-lock.
>>
>
> That is fine that you can't see it. It is still the shown theory
> which is backed up by examples both given by me and earlier on the
> ticket.
>

In case you haven't noticed, I'm one of those people who have posted
in-depth information based on debugging. But I do not agree that what
you're trying to show here is evidence of a dead-lock on the socket.

> Have you accepted there is a dead lock between writing and reading?
>

I accept that the process seems to dead-lock, but again: I have not
yet found compelling evidence that this comes from the reading and
writing that you point to.

>>>
>>> Just try writing to the socket while the demux thread is reading.... I
>>> dare you....
>>>
>>
>> Again, dup() creates a new file-descriptor, so they should have
>> separate locks (I've stepped through the source code of dup and write
>> to verify this).
>>
>> What am I missing?
>>
>
> The whole thing about accepting behaviors and looking into the actual problem.
>

Again, I have already spent hours and hours in the debugger trying to
figure this out. Stop with the attitude, please.

> I've attached a 'patch' which will clearly show a deadlock in writing
> (in the same process, no subprocess) when the demux is running. This
> was what I expected you to try to at least believe when other people
> pointed it out.
>
> Attached diff to add a simple 1 byte xwrite (which is what is used
> internally) immediately after the demux is started. There is a
> sleep(1) to make sure the demux is running and reading prior to the
> write.
>

As Johannes Sixt pointed out already, this patch doesn't do the same
thing as the code does, so it's not really relevant. The only
difference with that patch is that "out" gets written to a little bit
earlier (or later due to the sleep). So it doesn't really show a
dead-lock any more than without it.

Marc Calahan

unread,

Jan 17, 2011, 3:49:17 PM1/17/11

to kusm...@gmail.com, msysGit

>> The
>> disappointing thing is we are spending 2 months at my company
>> investigating open and closed-source source control options to replace
>> our existing one. This bug is a showstopper and the fact it has been
>> open for so long is basically counting GIT out of the running.
>>
>
> Feel free to make your own installer with the workaround applied.
> Building a custom installer is not difficult.
>

I have already done this, but that is not acceptable as a solution.
Only stable-released binaries are accepted.

>>>>> mcal...@gmail.com wrote:
>>>> I'll leave out the details but fd is passed in as a parameter which
>>>> happens to be out in send_pack. out is defined as fd[1].
>>>>
>>>
>>> Yes, and fd[0] (and it's shallow copy out) only gets written to from
>>> the main-tread.
>>>
>>> I still don't see how this should lead to a dead-lock.
>>>
>>
>> That is fine that you can't see it. It is still the shown theory
>> which is backed up by examples both given by me and earlier on the
>> ticket.
>>
>
> In case you haven't noticed, I'm one of those people who have posted
> in-depth information based on debugging. But I do not agree that what
> you're trying to show here is evidence of a dead-lock on the socket.
>
>> Have you accepted there is a dead lock between writing and reading?
>>
>
> I accept that the process seems to dead-lock, but again: I have not
> yet found compelling evidence that this comes from the reading and
> writing that you point to.
>

Well, it is very easy to see. Read from a socket in one thread and
then try to write from it in another. I'm not sure how much easier it
can be. Whether or not you want to believe what I am saying, the
behaviors I see point to exactly that.

Other IO functions also block which the receive is happening, like
getsockname(SOCKET,...) when there is a receive on the file descriptor
OR a duplicated file descriptor.

Also, DuplicateHandle is used in a few places and I can't remember if
dup calls it directly or not (lazy) but DuplicateHandle is not
acceptable for sockets specifically when certain protocol filters are
installed, QOS is enabled and probably other reasons: (see 'You should
not use DuplicateHandle to duplicate handles to the following
objects')
http://msdn.microsoft.com/en-us/library/ms724251%28v=vs.85%29.aspx

>>>>
>>>> Just try writing to the socket while the demux thread is reading.... I
>>>> dare you....
>>>>
>>>
>>> Again, dup() creates a new file-descriptor, so they should have
>>> separate locks (I've stepped through the source code of dup and write
>>> to verify this).
>>>
>>> What am I missing?
>>>
>>
>> The whole thing about accepting behaviors and looking into the actual problem.
>>
>
> Again, I have already spent hours and hours in the debugger trying to
> figure this out. Stop with the attitude, please.
>

Ack. I have run 3 scenarios now that show exactly that write blocks
which read is being called in another process. See above it is also
other functions. It would be nice to move on towards figuring out why
socket calls are blocking each other.

I am very familiar with sockets, file descriptors, io streams and the
normal windows CRT. These do not seem to be correct behaviors, but
they are happening.

>> I've attached a 'patch' which will clearly show a deadlock in writing
>> (in the same process, no subprocess) when the demux is running. This
>> was what I expected you to try to at least believe when other people
>> pointed it out.
>>
>> Attached diff to add a simple 1 byte xwrite (which is what is used
>> internally) immediately after the demux is started. There is a
>> sleep(1) to make sure the demux is running and reading prior to the
>> write.
>>
>
> As Johannes Sixt pointed out already, this patch doesn't do the same
> thing as the code does, so it's not really relevant. The only
> difference with that patch is that "out" gets written to a little bit
> earlier (or later due to the sleep). So it doesn't really show a
> dead-lock any more than without it.
>

That's fine. the sub-process is trying to write to the socket in a
similar manner and is either blocking in write or creation based on
timing of the demux thread. I assume the creation part is because of
the above blocking on calling other functions.

Either way, it is very easy to show things work correctly if demux is
not reading from the socket. This doesn't seem to be accepted so I am
trying to show other ways to point this as the culprit.

The patch only makes it easier to see in a single process.

Marc

Erik Faye-Lund

unread,

Jan 17, 2011, 4:22:55 PM1/17/11

to Marc Calahan, msysGit, pe...@peter.is-a-geek.org, Johannes Sixt

On Mon, Jan 17, 2011 at 9:49 PM, Marc Calahan <mcal...@gmail.com> wrote:
>>> The
>>> disappointing thing is we are spending 2 months at my company
>>> investigating open and closed-source source control options to replace
>>> our existing one. This bug is a showstopper and the fact it has been
>>> open for so long is basically counting GIT out of the running.
>>>
>>
>> Feel free to make your own installer with the workaround applied.
>> Building a custom installer is not difficult.
>>
>
> I have already done this, but that is not acceptable as a solution.
> Only stable-released binaries are accepted.
>

There are no stable releases of Git for Windows, so it sounds like
you're out of luck no matter what.

> Also, DuplicateHandle is used in a few places and I can't remember if
> dup calls it directly or not (lazy) but DuplicateHandle is not
> acceptable for sockets specifically when certain protocol filters are
> installed, QOS is enabled and probably other reasons: (see 'You should
> not use DuplicateHandle to duplicate handles to the following
> objects')
> http://msdn.microsoft.com/en-us/library/ms724251%28v=vs.85%29.aspx
>

dup calls DuplicateHandle, and that might very well be the culprit -
MSDN mentions that it messes with the internal reference counting.
Perhaps the underlying socket has somehow died?

But if that was the case, I don't see why it would only happen when we
introduce a thread.

>>>>>
>>>>> Just try writing to the socket while the demux thread is reading.... I
>>>>> dare you....
>>>>>
>>>>
>>>> Again, dup() creates a new file-descriptor, so they should have
>>>> separate locks (I've stepped through the source code of dup and write
>>>> to verify this).
>>>>
>>>> What am I missing?
>>>>
>>>
>>> The whole thing about accepting behaviors and looking into the actual problem.
>>>
>>
>> Again, I have already spent hours and hours in the debugger trying to
>> figure this out. Stop with the attitude, please.
>>
>
> Ack. I have run 3 scenarios now that show exactly that write blocks
> which read is being called in another process.

I never claimed that read didn't block, that is not what has been
debated. In fact, if you go back at the issue-tracker that's exactly
what I posted a month ago.

> See above it is also
> other functions. It would be nice to move on towards figuring out why
> socket calls are blocking each other.
>

Yes, and that's exactly what I've been fishing for :)

> I am very familiar with sockets, file descriptors, io streams and the
> normal windows CRT.

Good to have you on board, then!

> These do not seem to be correct behaviors, but
> they are happening.
>

No, something is going very wrong here. It would be interesting to
figure out if DuplicateHandle was the culprit, indeed.

But discussions between Peter Harris and Johannes Sixth seems to
indicate it's not:
http://groups.google.com/group/msysgit/browse_thread/thread/881831c465f6d01a/89f1ebfed6d8b340

Erik Faye-Lund

unread,

Jan 17, 2011, 5:18:05 PM1/17/11

to Marc Calahan, msysGit, pe...@peter.is-a-geek.org, Johannes Sixt

On Mon, Jan 17, 2011 at 10:22 PM, Erik Faye-Lund <kusm...@gmail.com> wrote:
> On Mon, Jan 17, 2011 at 9:49 PM, Marc Calahan <mcal...@gmail.com> wrote:
>>>> The
>>>> disappointing thing is we are spending 2 months at my company
>>>> investigating open and closed-source source control options to replace
>>>> our existing one. This bug is a showstopper and the fact it has been
>>>> open for so long is basically counting GIT out of the running.
>>>>
>>>
>>> Feel free to make your own installer with the workaround applied.
>>> Building a custom installer is not difficult.
>>>
>>
>> I have already done this, but that is not acceptable as a solution.
>> Only stable-released binaries are accepted.
>>
>
> There are no stable releases of Git for Windows, so it sounds like
> you're out of luck no matter what.
>
>> Also, DuplicateHandle is used in a few places and I can't remember if
>> dup calls it directly or not (lazy) but DuplicateHandle is not
>> acceptable for sockets specifically when certain protocol filters are
>> installed, QOS is enabled and probably other reasons: (see 'You should
>> not use DuplicateHandle to duplicate handles to the following
>> objects')
>> http://msdn.microsoft.com/en-us/library/ms724251%28v=vs.85%29.aspx
>>
>
> dup calls DuplicateHandle, and that might very well be the culprit -
> MSDN mentions that it messes with the internal reference counting.
> Perhaps the underlying socket has somehow died?
>

>> These do not seem to be correct behaviors, but
>> they are happening.
>>
>
> No, something is going very wrong here. It would be interesting to
> figure out if DuplicateHandle was the culprit, indeed.
>
> But discussions between Peter Harris and Johannes Sixth seems to
> indicate it's not:
> http://groups.google.com/group/msysgit/browse_thread/thread/881831c465f6d01a/89f1ebfed6d8b340
>

It doesn't seem like that was the culprit. The following patch didn't
help for me... :(

IsSocketHandle (and IsConsoleHandle) is copied verbatim from
compat/win32/sys/poll.c, the gnulib implementation of poll that we
use.

I'm not an expert on socket programming, so there might be something
I'm doing wrong, though.

diff --git a/compat/mingw.c b/compat/mingw.c
index e4b6b6f..5466ce1 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -370,6 +370,44 @@ FILE *mingw_freopen (const char *filename, const
char *otype, FILE *stream)
return file;
}

+#define IsConsoleHandle(h) (((long) (h) & 3) == 3)
+
+static BOOL
+IsSocketHandle (HANDLE h)
+{
+ WSANETWORKEVENTS ev;
+
+ if (IsConsoleHandle (h))
+ return FALSE;
+
+ /* Under Wine, it seems that getsockopt returns 0 for pipes too.
+ WSAEnumNetworkEvents instead distinguishes the two correctly. */
+ ev.lNetworkEvents = 0xDEADBEEF;
+ WSAEnumNetworkEvents ((SOCKET) h, NULL, &ev);
+ return ev.lNetworkEvents != 0xDEADBEEF;
+}
+
+#undef dup
+int mingw_dup(int fd)
+{
+ WSAPROTOCOL_INFO pi;
+ HANDLE hproc = GetCurrentProcess();
+ HANDLE fh = (HANDLE)_get_osfhandle(fd);
+ HANDLE dh;
+ if (IsSocketHandle(fh)) {
+ if (WSADuplicateSocket((SOCKET)fh, GetCurrentProcessId(), &pi))
+ die("failed to duplicate socket, error %d",
+ WSAGetLastError());
+ dh = (HANDLE)WSASocket(AF_UNSPEC, SOCK_STREAM, IPPROTO_TCP,
+ &pi, 0, WSA_FLAG_OVERLAPPED);
+ } else
+ if (!DuplicateHandle(hproc, fh, hproc, &dh, 0, TRUE,
+ DUPLICATE_SAME_ACCESS))
+ die("failed to duplicate handle, error %d",
+ GetLastError());
+ return _open_osfhandle((int)dh, O_RDWR|O_BINARY);
+}
+
/*
* The unit of FILETIME is 100-nanoseconds since January 1, 1601, UTC.
* Returns the 100-nanoseconds ("hekto nanoseconds") since the epoch.
diff --git a/compat/mingw.h b/compat/mingw.h
index 9c00e75..b08c9ac 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -187,6 +187,9 @@ FILE *mingw_fopen (const char *filename, const char *otype);
FILE *mingw_freopen (const char *filename, const char *otype, FILE *stream);
#define freopen mingw_freopen

+int mingw_dup(int fd);
+#define dup mingw_dup
+
char *mingw_getcwd(char *pointer, int len);
#define getcwd mingw_getcwd

Johannes Schindelin

unread,

Jan 17, 2011, 5:26:20 PM1/17/11

to Johannes Sixt, Marc Calahan, msy...@googlegroups.com, kusm...@gmail.com

Hi,

Actually, just being bidirectional does not make sockets necessarily be
able to be accesses via read() and write() at _one_ side at the same time.

Maybe there is some equivalent to select() which you could use to test
whether there is something read()able, and Sleep() and loop if there is
not? Just to test whether this is _really_ the culprit?

Ciao,
Dscho

Johannes Schindelin

unread,

Jan 17, 2011, 5:38:02 PM1/17/11

to Marc Calahan, kusm...@gmail.com, msysGit

Hi,

On Mon, 17 Jan 2011, Marc Calahan wrote:

> > Feel free to make your own installer with the workaround applied.
> > Building a custom installer is not difficult.
>
> I have already done this, but that is not acceptable as a solution. Only
> stable-released binaries are accepted.

Being pushy does not help. Really.

Note that msysGit is a volunteers' effort. You could also say that it is
an experiment to find out what the ratio is between the people who
complain and those who fix things (there is a very small overlap, which
seems to grow since the former group is so vocal).

That is the reason why there was not a single stable version of Git for
Windows, and it is unlikely that there will ever be (unless the users'
attitude changes -- highly improbable).

> Read from a socket in one thread and then try to write from it in
> another.

If that is really the culprit, then it should be relatively easy to show
the issue by demonstrating that things go further -- in Git -- when
prohibiting one thread from blocking the other one.

It would have to be the read() that blocks, because the write() should
return successfully, right?

> the sub-process is trying to write to the socket in a similar manner and
> is either blocking in write or creation based on timing of the demux
> thread.

Is it really the write() that is blocking the other thread's read()? If
so, why is it not succeeding and finishing its business, relieving the
socket for the read()?

A demonstration that would convince me could be a patch that introduces a
small test script in t/ (see
https://git.wiki.kernel.org/index.php/HowToWriteTests for details) showing
the block (it does not have to fail or succeed, it just needs to block),
and a patch littering the code with debug statements so one can see
whether read() is executed first or write(), and from which thread, and
from which end of the socket (if the test script starts the server, which
it should do to be easy to run).

Ciao,
Johannes

Marc Calahan

unread,

Jan 17, 2011, 7:02:32 PM1/17/11

to Johannes Schindelin, kusm...@gmail.com, msysGit

On Mon, Jan 17, 2011 at 5:38 PM, Johannes Schindelin
<Johannes....@gmx.de> wrote:
> Hi,
>
> On Mon, 17 Jan 2011, Marc Calahan wrote:
>
>> > Feel free to make your own installer with the workaround applied.
>> > Building a custom installer is not difficult.
>>
>> I have already done this, but that is not acceptable as a solution. Only
>> stable-released binaries are accepted.
>
> Being pushy does not help. Really.
>

I completely understand this which is why I have invested my own time
in looking into the problem and trying to fix it.

I would have expected a work-around to be used as this makes the git
protocol unusable for writing. The workaround has been known and
simple.

People voicing their opinions is also a healthy way for you to know
the product is being used and how it is being used.

> Note that msysGit is a volunteers' effort. You could also say that it is
> an experiment to find out what the ratio is between the people who
> complain and those who fix things (there is a very small overlap, which
> seems to grow since the former group is so vocal).
>

As this is true with any open source project.

> That is the reason why there was not a single stable version of Git for
> Windows, and it is unlikely that there will ever be (unless the users'
> attitude changes -- highly improbable).
>

I would venture to guess a good number of the user's won't ever get
into the code or even try to help because they are looking for a git
client for window. A chunk of them don't even program in c, posix or
anything close to useful for this project.

>> Read from a socket in one thread and then try to write from it in
>> another.
>
> If that is really the culprit, then it should be relatively easy to show
> the issue by demonstrating that things go further -- in Git -- when
> prohibiting one thread from blocking the other one.
>
> It would have to be the read() that blocks, because the write() should
> return successfully, right?
>
>> the sub-process is trying to write to the socket in a similar manner and
>> is either blocking in write or creation based on timing of the demux
>> thread.
>
> Is it really the write() that is blocking the other thread's read()? If
> so, why is it not succeeding and finishing its business, relieving the
> socket for the read()?

No, the read is blocking the write. The read comes first, almost
always as it is a spawned thread and not a spawned process as the
writer is.

>
> A demonstration that would convince me could be a patch that introduces a
> small test script in t/ (see
> https://git.wiki.kernel.org/index.php/HowToWriteTests for details) showing
> the block (it does not have to fail or succeed, it just needs to block),
> and a patch littering the code with debug statements so one can see
> whether read() is executed first or write(), and from which thread, and
> from which end of the socket (if the test script starts the server, which
> it should do to be easy to run).
>

Scripting and this test environment is not my strong suite. I
provided multiple example and even a diff that would show the behavior
with just running the 'git push' with a commit. Whether you want to
believe me or not is entirely up to you. I have put in the effort to
show this. If you want clarification on something I have said, I
would be happy to oblige.

You can also see it not dead-lock by commenting out the read in the demux.

Marc

Marc Calahan

unread,

Jan 17, 2011, 6:47:51 PM1/17/11

to Johannes Schindelin, Johannes Sixt, msy...@googlegroups.com, kusm...@gmail.com

On Mon, Jan 17, 2011 at 5:26 PM, Johannes Schindelin
<Johannes....@gmx.de> wrote:
> Hi,
>
> On Mon, 17 Jan 2011, Johannes Sixt wrote:
>
>> On Montag, 17. Januar 2011, Marc Calahan wrote:
>> > The reason why the subprocess can't run/write is because of this
>> > deadlock which I am trying to show.
>> >
>> > I am pinpointing the problem hoping that someone will help look at it
>> > as reading and writing to a socket at the same time should NOT
>> > deadlock.
>>
>> Did you discover an instance where the *exact* *same* file descriptor is
>> read from and written to at the same time? And I do mean file
>> descriptor, not OS object.
>>
>> Note that the dup() in git_tcp_connect does not count, because it
>> allocates different file descriptors, hence, the CRT uses different
>> locks. Even though the same OS socket object might be referenced by the
>> fds, no deadlock should happen.
>>
>> If OTOH you are talking about a socket object, then simultaneous reads
>> and writes that deadlock would be a serious bug in the OS, not in the
>> CRT nor git. Sockets are bidirectional, and simultaneous reads and
>> writes from different threads should work by design.
>
> Actually, just being bidirectional does not make sockets necessarily be
> able to be accesses via read() and write() at _one_ side at the same time.
>

Sockets are fine with read/write from the same time. On windows you
would typically use winsock or BSD functions (WSARecv/recv and
WSASend/send) instead of read/write and there are some subtle
differences from the two but I have not run into this before.

I tried using send/recv instead of read/write and that also deadlocks.

> Maybe there is some equivalent to select() which you could use to test
> whether there is something read()able, and Sleep() and loop if there is
> not? Just to test whether this is _really_ the culprit?
>

This isn't necessary with sockets which is why I didn't want to try
it. I expected it to work and it does. Replacing read with the
attached removes the deadlocks and makes everything work. I really
don't like the approach though as we shouldn't need to add polling in
the first place.

> Ciao,
> Dscho
>
>

mingw.h.diff

mingw.c.diff

Johannes Schindelin

unread,

Jan 17, 2011, 8:38:58 PM1/17/11

to Marc Calahan, Johannes Sixt, msy...@googlegroups.com, kusm...@gmail.com

Hi,

LOL, obviously not, otherwise you would not have had to select() in your
code! :-D

> > Maybe there is some equivalent to select() which you could use to test
> > whether there is something read()able, and Sleep() and loop if there
> > is not? Just to test whether this is _really_ the culprit?
>
> This isn't necessary with sockets which is why I didn't want to try
> it. I expected it to work and it does. Replacing read with the
> attached removes the deadlocks and makes everything work.

Render me puzzled. The "isn't necessary" part suggests that you think
your patch should not be needed, but the second part says that your patch
fixes the issue?

I must be very slow tonight, because I cannot resolve that conundrum.

> I really don't like the approach though as we shouldn't need to add
> polling in the first place.

select() != poll(). And you should not need to set the timeout to
something as small as 50 milliseconds unless the select() itself blocks
the socket as much as read() does.

Ciao,
Johannes

P.S.: I still do not have time to follow all the steps to reproduce the
issue here manually, sorry!

Marc Calahan

unread,

Jan 17, 2011, 7:13:52 PM1/17/11

to kusm...@gmail.com, msysGit, pe...@peter.is-a-geek.org, Johannes Sixt

On Mon, Jan 17, 2011 at 5:18 PM, Erik Faye-Lund <kusm...@gmail.com> wrote:
> On Mon, Jan 17, 2011 at 10:22 PM, Erik Faye-Lund <kusm...@gmail.com> wrote:
>> No, something is going very wrong here. It would be interesting to
>> figure out if DuplicateHandle was the culprit, indeed.
>>
>> But discussions between Peter Harris and Johannes Sixth seems to
>> indicate it's not:
>> http://groups.google.com/group/msysgit/browse_thread/thread/881831c465f6d01a/89f1ebfed6d8b340
>>
>
> It doesn't seem like that was the culprit. The following patch didn't
> help for me... :(
>
> IsSocketHandle (and IsConsoleHandle) is copied verbatim from
> compat/win32/sys/poll.c, the gnulib implementation of poll that we
> use.
>
> I'm not an expert on socket programming, so there might be something
> I'm doing wrong, though.
>

Everything in the patch seems fine. The message thread above was
interesting reading and I had not seen this before. I do have
different results then the message thread though.

1) Using pipes with the subprocess works fine for me. It only
deadlocks when I try to write the data the parent process reads from
the pipes to the socket.
2) Using select in an over-written read gets around the deadlock as
the write gets a change to actually happen in between read selects.
(See my other mail with diffs.) This works for both the SOCKET as the
STDOUT of the sub-process and pipes with internal writes in the parent
process.

My implementation of IsSocketHandle is a little different and probably
has issues here and there in the diff.

Marc

Marc Calahan

unread,

Jan 18, 2011, 10:15:40 AM1/18/11

to Johannes Schindelin, Johannes Sixt, msy...@googlegroups.com, kusm...@gmail.com

On Mon, Jan 17, 2011 at 8:38 PM, Johannes Schindelin
<Johannes....@gmx.de> wrote:

> On Mon, 17 Jan 2011, Marc Calahan wrote:
>> Sockets are fine with read/write from the same time.
>
> LOL, obviously not, otherwise you would not have had to select() in your
> code! :-D

Agreed :) But I have never had to do that before. Granted, I
usually use the BSD or Winsock APIs; but for fun I switched the code
over to Winsock APIs and the problem has not gone away.

>> > Maybe there is some equivalent to select() which you could use to test
>> > whether there is something read()able, and Sleep() and loop if there
>> > is not? Just to test whether this is _really_ the culprit?
>>
>> This isn't necessary with sockets which is why I didn't want to try
>> it. I expected it to work and it does. Replacing read with the
>> attached removes the deadlocks and makes everything work.
>
> Render me puzzled. The "isn't necessary" part suggests that you think
> your patch should not be needed, but the second part says that your patch
> fixes the issue?
>
> I must be very slow tonight, because I cannot resolve that conundrum.

See above; Reading and writing concurrently to a socket works fine.
There is something amiss here which prevents it.

>> I really don't like the approach though as we shouldn't need to add
>> polling in the first place.
>
> select() != poll(). And you should not need to set the timeout to
> something as small as 50 milliseconds unless the select() itself blocks
> the socket as much as read() does.

No, in this case select is just a cheap peek (with timeout) to see if
data exists to read. Select also blocks write so a long timeout
causes the same issues. Hence the low 50ms.

Marc

Johannes Sixt

unread,

Jan 19, 2011, 3:22:37 PM1/19/11

to Marc Calahan, Johannes Schindelin, msy...@googlegroups.com, kusm...@gmail.com

OK, I do see how your read-replacement can work around a dead-lock *if* there
is one. However, you keep saying that it should not be necessary. What is
going on?

Is it that we are using read/write, i.e., ReadFile/WriteFile on the socket
instead of WSARecv/WSASend?

Is it that we write to the socket, and then inherit it to a child process,
which does not call WSAInit(), but does write() to it?

-- Hannes

Ciaran

unread,

Jan 21, 2011, 8:03:41 AM1/21/11

to Johannes Sixt, Marc Calahan, Johannes Schindelin, msy...@googlegroups.com, kusm...@gmail.com

On Wed, Jan 19, 2011 at 8:22 PM, Johannes Sixt <j...@kdbg.org> wrote:
> On Dienstag, 18. Januar 2011, Marc Calahan wrote:
>> On Mon, Jan 17, 2011 at 8:38 PM, Johannes Schindelin
>> <Johannes....@gmx.de> wrote:
>> > On Mon, 17 Jan 2011, Marc Calahan wrote:

<!--- Snip --- >

> Is it that we are using read/write, i.e., ReadFile/WriteFile on the socket
> instead of WSARecv/WSASend?
>
> Is it that we write to the socket, and then inherit it to a child process,
> which does not call WSAInit(), but does write() to it?
>
> -- Hannes
>

Hopefully this isn't considered noise, but I believe I've had an
incident of this on a machine where we're using gitosis as our server
which is serving over *SSH*, and it seems that all operations (clone,
push + pull) just sit 'waiting' (there is an active SSH process.) I'm
only mentioning this as an observation in case it gives someone a
thought to look at another code path. It is of course possible that
the issue we observed is different. The server operates ok with
MacOSX + Linux clients so the server isn't stalled currently. I would
love to investigate more deeply, but Iack the requisite skills in
C/C++/Socket programming these days :( [which kills me to say it.]

Many Thanks
- Cj.

Erik Faye-Lund

unread,

Jan 21, 2011, 8:11:19 AM1/21/11

to Ciaran, Johannes Sixt, Marc Calahan, Johannes Schindelin, msy...@googlegroups.com

On Fri, Jan 21, 2011 at 2:03 PM, Ciaran <cia...@gmail.com> wrote:
> On Wed, Jan 19, 2011 at 8:22 PM, Johannes Sixt <j...@kdbg.org> wrote:
>> On Dienstag, 18. Januar 2011, Marc Calahan wrote:
>>> On Mon, Jan 17, 2011 at 8:38 PM, Johannes Schindelin
>>> <Johannes....@gmx.de> wrote:
>>> > On Mon, 17 Jan 2011, Marc Calahan wrote:
> <!--- Snip --- >
>> Is it that we are using read/write, i.e., ReadFile/WriteFile on the socket
>> instead of WSARecv/WSASend?
>>
>> Is it that we write to the socket, and then inherit it to a child process,
>> which does not call WSAInit(), but does write() to it?
>>
>> -- Hannes
>>
> Hopefully this isn't considered noise,

Not at all.

> but I believe I've had an
> incident of this on a machine where we're using gitosis as our server
> which is serving over *SSH*, and it seems that all operations (clone,
> push + pull) just sit 'waiting' (there is an active SSH process.)

I doubt it's the same issue; there's been a bunch of reports of hangs
over SSH as well. These have usually turned out to be caused by Plink.
If you're using OpenSSH, then I think it's a bit more worrying.

That being said, AFAICT there's no good reason why this problem should
be unique to the GIT protocol (but I could be mistaken). But my
experience tells me this is not the case.

Ciaran

unread,

Jan 21, 2011, 8:18:30 AM1/21/11

to kusm...@gmail.com, Johannes Sixt, Marc Calahan, Johannes Schindelin, msy...@googlegroups.com

As an update, a reboot of the misbehaving machine fixed the problem
(it was OpenSSH.) I have no idea what caused the issue, but presumably
it is un-related. Apologies for the noise.
- Cj.

Johannes Schindelin

unread,

Feb 17, 2011, 9:07:55 AM2/17/11

to Marc Calahan, Johannes Sixt, msy...@googlegroups.com, kusm...@gmail.com

Hi,

Maybe this issue is resolved with 13af8cbd?

Ciao,
Dscho

Erik Faye-Lund

unread,

Dec 8, 2011, 9:47:09 AM12/8/11

to Johannes Schindelin, Marc Calahan, Johannes Sixt, msy...@googlegroups.com

On Thu, Feb 17, 2011 at 3:07 PM, Johannes Schindelin

I recently read this blog-post, which reminded me of this issue:
http://blogs.msdn.com/b/oldnewthing/archive/2011/12/02/10243553.aspx

I haven't thought this through or anything, but this sounds to me like
it could be related to this issue...

Reply all

Reply to author

Forward