syscall: dup2 blocks contrary to runtime's expectations for fuse file systems

499 views
Skip to first unread message

Aaron Jacobs

unread,
Mar 20, 2015, 12:02:33 AM3/20/15
to golang-nuts
Hi golang-nuts,

I see that syscall.Dup2 uses Syscall instead of RawSyscall, which I understand
to mean that the runtime assumes it cannot block. (Please correct me if I'm
wrong.)

If I understand correctly, this assumption is correct for "real" file systems,
but it does not appear to be true for fuse. The following call chain causes
fuse to send a request to userspace and wait for the response:

    http://goo.gl/KqTtNN sys_dup2 -> sys_dup3 -> do_dup2
    http://goo.gl/J8c1y4 do_dup2 -> filp_close
    http://goo.gl/M04Jf3 filp_close-> f_op->flush (fuse_flush)
    http://goo.gl/YJYx2X fuse_flush -> fuse_request_send
    http://goo.gl/xoRLh9 fuse_request_send -> __fuse_request_send
    http://goo.gl/PTzwCg __fuse_request_send -> request_wait_answer

I discovered this while writing tests for a fuse package. The test runs a fuse
server on one goroutine, and makes file system requests on another. When it
calls syscall.Dup2, it deadlocks very hard, with kernel stacks as pasted below.
I believe what's happening is that the goroutine calling syscall.Dup2 is
holding some lock that a goroutine calling syscall.Read on /dev/fuse is waiting
for. (The latter would otherwise server the fuse flush request.) Indeed if I
call via syscall.RawSyscall, the problem goes away.

(If you want to play, I can reliably reproduce this with "go test -v
github.com/jacobsa/fuse/samples/flushfs --ogletest.run FlushFSTest.Dup2" at
commit ad2317642f1ba7a61263718bde6d83b6a0843e1c in that repo.)

Does it make sense to make Dup2 call via Syscall rather than RawSyscall? I
don't think this problem is due to a buggy fuse server implementation but
rather may be present in some form for any fuse file system, so I guess I'm
asking what level of support for fuse's weirdness I should expect from Go.

Thanks,
Aaron

===================================================

jacobsa@fourier:~/go/src/github.com/jacobsa/fuse (dup2-deadlock)% pstree -p 19419
flushfs.test(19419)─┬─{flushfs.test}(19420)
                    ├─{flushfs.test}(19421)
                    ├─{flushfs.test}(19423)
                    ├─{flushfs.test}(19426)
                    └─{flushfs.test}(19427)


jacobsa@fourier:~/go/src/github.com/jacobsa/fuse (dup2-deadlock)% cat /proc/19419/stack
[<0000000000000000>] futex_wait_queue_me+0xdd/0x140
[<0000000000000000>] futex_wait+0x182/0x290
[<0000000000000000>] do_futex+0xde/0x760
[<0000000000000000>] SyS_futex+0x71/0x150
[<0000000000000000>] system_call_fastpath+0x1a/0x1f
[<0000000000000000>] 0xffffffffffffffff


jacobsa@fourier:~/go/src/github.com/jacobsa/fuse (dup2-deadlock)% cat /proc/19420/stack
[<0000000000000000>] poll_schedule_timeout+0x49/0x70
[<0000000000000000>] do_select+0x5b6/0x780
[<0000000000000000>] core_sys_select+0x1cc/0x2e0
[<0000000000000000>] SyS_select+0xab/0x100
[<0000000000000000>] system_call_fastpath+0x1a/0x1f
[<0000000000000000>] 0xffffffffffffffff


jacobsa@fourier:~/go/src/github.com/jacobsa/fuse (dup2-deadlock)% cat /proc/19421/stack
[<0000000000000000>] wait_answer_interruptible+0x6a/0xa0
[<0000000000000000>] __fuse_request_send+0x1fb/0x280
[<0000000000000000>] fuse_request_send+0x12/0x20
[<0000000000000000>] fuse_flush+0xd7/0x120
[<0000000000000000>] filp_close+0x2f/0x70
[<0000000000000000>] do_dup2+0x68/0xb0
[<0000000000000000>] SyS_dup2+0xa6/0x140
[<0000000000000000>] system_call_fastpath+0x1a/0x1f
[<0000000000000000>] 0xffffffffffffffff


jacobsa@fourier:~/go/src/github.com/jacobsa/fuse (dup2-deadlock)% cat /proc/19423/stack
[<0000000000000000>] ep_poll+0x262/0x340
[<0000000000000000>] SyS_epoll_wait+0xd5/0x100
[<0000000000000000>] system_call_fastpath+0x1a/0x1f
[<0000000000000000>] 0xffffffffffffffff
jacobsa@fourier:~/go/src/github.com/jacobsa/fuse (dup2-deadlock)% cat /proc/19426/stack
[<0000000000000000>] futex_wait_queue_me+0xdd/0x140
[<0000000000000000>] futex_wait+0x182/0x290
[<0000000000000000>] do_futex+0xde/0x760
[<0000000000000000>] SyS_futex+0x71/0x150
[<0000000000000000>] system_call_fastpath+0x1a/0x1f
[<0000000000000000>] 0xffffffffffffffff


jacobsa@fourier:~/go/src/github.com/jacobsa/fuse (dup2-deadlock)% cat /proc/19427/stack
[<0000000000000000>] futex_wait_queue_me+0xdd/0x140
[<0000000000000000>] futex_wait+0x182/0x290
[<0000000000000000>] do_futex+0xde/0x760
[<0000000000000000>] SyS_futex+0x71/0x150
[<0000000000000000>] system_call_fastpath+0x1a/0x1f
[<0000000000000000>] 0xffffffffffffffff

minux

unread,
Mar 20, 2015, 12:19:18 AM3/20/15
to Aaron Jacobs, golang-nuts
On Fri, Mar 20, 2015 at 12:01 AM, 'Aaron Jacobs' via golang-nuts <golan...@googlegroups.com> wrote:
Hi golang-nuts,

I see that syscall.Dup2 uses Syscall instead of RawSyscall, which I understand
Dup2 uses RawSyscall, not Syscall. 
Unfortunately, we can't always use syscall.Syscall for dup2 due to some technical
problems.

For example, dup2 is also used by the child process when Go fork and exec a
new program. The problem is that we can't use Syscall here, as the child is
sharing data with the parent, the scheduler lock might have been acquired by
parent and thus using Syscall will deadlock.

In your test, have you tried to increase GOMAXPROCS to 2 when running the
test? I think it should fix the problem. And generally, it seems like a must for
tests that test against a fuse file system implemented by the same Go program
as the test and the fuse file system goroutines might need to be running at the
same time. And indeed this is the case.

Assuming the dup2 syscall doesn't block shouldn't pose problems besides this
case. If it blocks, it will just block the goroutine without the scheduler noticing.

Ian Lance Taylor

unread,
Mar 20, 2015, 12:51:32 AM3/20/15
to minux, Aaron Jacobs, golang-nuts
On Thu, Mar 19, 2015 at 9:18 PM, minux <mi...@golang.org> wrote:
>
> Unfortunately, we can't always use syscall.Syscall for dup2 due to some
> technical
> problems.
>
> For example, dup2 is also used by the child process when Go fork and exec a
> new program. The problem is that we can't use Syscall here, as the child is
> sharing data with the parent, the scheduler lock might have been acquired by
> parent and thus using Syscall will deadlock.

That turns out not to be the case. We don't call dup2 in the child
process. We call RawSyscall directly.

I think Aaron is correct: we must not mark any syscall that can go
through a FUSE filesystem as sysnb. Fortunately, thanks to some of
Dmitry's work, sysnb is not as much of an advantage as it used to be.

Ian

minux

unread,
Mar 20, 2015, 12:59:27 AM3/20/15
to Ian Lance Taylor, Aaron Jacobs, golang-nuts
On Fri, Mar 20, 2015 at 12:51 AM, Ian Lance Taylor <ia...@golang.org> wrote:
On Thu, Mar 19, 2015 at 9:18 PM, minux <mi...@golang.org> wrote:
>
> Unfortunately, we can't always use syscall.Syscall for dup2 due to some
> technical
> problems.
>
> For example, dup2 is also used by the child process when Go fork and exec a
> new program. The problem is that we can't use Syscall here, as the child is
> sharing data with the parent, the scheduler lock might have been acquired by
> parent and thus using Syscall will deadlock.

That turns out not to be the case.  We don't call dup2 in the child
process.  We call RawSyscall directly.
I wan't clear. I was referring the RawSyscall. If we happen to dup a fuse-backed fd
in the forkAndExec child, it can still block.

I think Aaron is correct: we must not mark any syscall that can go
through a FUSE filesystem as sysnb.  Fortunately, thanks to some of
Dmitry's work, sysnb is not as much of an advantage as it used to be.

OK, we can fix syscall.Dup2.

I take a look at all remaining sysnb uses in syscall:
The problematic ones are:
Dup3, Dup2, Dup on all unix.

The remaining uses are for socket, epoll, time, uid/gid/sid/pid, rlimit, inotfy,
which should be safe.

Ian Lance Taylor

unread,
Mar 20, 2015, 1:10:51 AM3/20/15
to minux, Aaron Jacobs, golang-nuts
On Thu, Mar 19, 2015 at 9:58 PM, minux <mi...@golang.org> wrote:
>
> On Fri, Mar 20, 2015 at 12:51 AM, Ian Lance Taylor <ia...@golang.org> wrote:
>>
>> On Thu, Mar 19, 2015 at 9:18 PM, minux <mi...@golang.org> wrote:
>> >
>> > Unfortunately, we can't always use syscall.Syscall for dup2 due to some
>> > technical
>> > problems.
>> >
>> > For example, dup2 is also used by the child process when Go fork and
>> > exec a
>> > new program. The problem is that we can't use Syscall here, as the child
>> > is
>> > sharing data with the parent, the scheduler lock might have been
>> > acquired by
>> > parent and thus using Syscall will deadlock.
>>
>> That turns out not to be the case. We don't call dup2 in the child
>> process. We call RawSyscall directly.
>
> I wan't clear. I was referring the RawSyscall. If we happen to dup a
> fuse-backed fd
> in the forkAndExec child, it can still block.

Ah, yes, sorry. Although I think we're OK in this specific case,
since the calls to dup2 are only in the child, and we don't use vfork.
It's kind of a good reason for us to never use vfork, though.

Ian

minux

unread,
Mar 20, 2015, 1:14:56 AM3/20/15
to Ian Lance Taylor, Aaron Jacobs, golang-nuts
Yeah, I agree in that case it's fine to block.

I will add a note to #5838, which talks about using vfork to avoid
the cost of duplicating the huge VM space used by Go processes.

Aaron Jacobs

unread,
Mar 20, 2015, 1:21:31 AM3/20/15
to Ian Lance Taylor, minux, golang-nuts
Sorry for confusing the issue by getting the polarity backwards.

Are you guys agreeing that dup2 can be fixed? If so, should I open a bug?

I think GOMAXPROCS=2 will just cause the issue to come up when there are two
simultaneous calls to dup2, right? In general it seems like each call to dup2
will cause a scheduler slot to be lost?

minux

unread,
Mar 20, 2015, 1:25:54 AM3/20/15
to Aaron Jacobs, Ian Lance Taylor, golang-nuts
On Fri, Mar 20, 2015 at 1:21 AM, Aaron Jacobs <jac...@google.com> wrote:
Are you guys agreeing that dup2 can be fixed? If so, should I open a bug?
Yes. Please feel free to open a bug. syscall.Dup3/Dup2/Dup should all be fixed.

I think GOMAXPROCS=2 will just cause the issue to come up when there are two
simultaneous calls to dup2, right? In general it seems like each call to dup2
will cause a scheduler slot to be lost?
Yes. what happens whan RawSyscall blocks is that the scheduler still thinks the
goroutine is executing, so it's still uses one credit in GOMAXPROCS, while actually
it's not executing.

Aaron Jacobs

unread,
Mar 20, 2015, 1:28:52 AM3/20/15
to minux, Ian Lance Taylor, golang-nuts
On Fri, Mar 20, 2015 at 4:24 PM, minux <mi...@golang.org> wrote:
> Yes. Please feel free to open a bug. syscall.Dup3/Dup2/Dup should all be
> fixed.

Okay, I filed it here:

https://github.com/golang/go/issues/10202

I would have assigned it to you or Ian but I don't think I have permission. :-)
Thanks to both of you for investigating.

Aaron Jacobs

unread,
Mar 22, 2015, 8:13:47 PM3/22/15
to minux, Ian Lance Taylor, golang-nuts
By the way, for the benefit of anybody finding this thread in the future:

I found a similar problem with taking a page fault for an mmap'd file
implemented in fuse. The kernel blocks with this stack:

[<0000000000000000>] wait_answer_interruptible+0x6a/0xa0
[<0000000000000000>] __fuse_request_send+0x1fb/0x280
[<0000000000000000>] fuse_request_send+0x12/0x20
[<0000000000000000>] fuse_readpage+0x152/0x1e0
[<0000000000000000>] filemap_fault+0x116/0x410
[<0000000000000000>] __do_fault+0x6f/0x530
[<0000000000000000>] handle_mm_fault+0x482/0xf00
[<0000000000000000>] __do_page_fault+0x184/0x560
[<0000000000000000>] do_page_fault+0x1a/0x70
[<0000000000000000>] page_fault+0x28/0x30
[<0000000000000000>] 0xffffffffffffffff

and my test deadlocks with the faulting thread camping on the Go scheduler slot
and the fuse thread waiting for the scheduler slot. (The workaround in the
particular case of this test is to just raise GOMAXPROCS, of course.)

It seems like this probably comes up for files mmap'd from a "real" file
systems, too: while waiting on disk access for a page fault, the scheduler will
fail to utilize the CPU. But I don't see anything obvious that can be done
about it.

Aaron

ki...@nexedi.com

unread,
Mar 21, 2017, 9:06:51 AM3/21/17
to golang-nuts, mi...@golang.org, ia...@golang.org
Hello up there. I've found this thread via https://github.com/golang/go/issues/19563#issuecomment-287920797 so let me chime in and comment a bit:

1. The problem of sys_dup2 blocking is not FUSE-only. By definition dup2 may end up closing newfd:

    If the file descriptor newfd was previously open, it is silently closed before being reused.

so this will become blocking operation with any filesystem that does blocking work on .flush. One example of such "real" filesystem is NFS:


понедельник, 23 марта 2015 г., 3:13:47 UTC+3 пользователь Aaron Jacobs написал:
Yes, it can come up with any filesystem because on pagefault the filesystem might need to do its loading work which becomes blocking operation. What also can happen is that if there will be IO error for loading, the kernel wil send SIGBUS as result to client code (which golang converts to runtime panics if runtime.debug.SetPanicOnFault() was set true).
 
But I don't see anything obvious that can be done about it.

I came to conclusion the following has to be done around mmap access:

1. protect mmap-area accessing code with debug.SetPanicOnFault(true) to catch IO errors
2. wrap actual access with runtime·entersyscall and runtime·exitsyscall to let go scheduler know code might be blocking and free up a P slot.

This should look somthing like this (but not yet tested):

   ---- 8< ----
   err = nil
   save := debug.SetPanicOnFault(true)
   defer func() {
       if r := recover(); r != nil && r ~= "invalid memory address or nil pointer dereference ..." {
            err = EIO
       }
       runtime·exitsyscall()
       debug.SetPanicOnFault(save)
   }()

   runtime·entersyscall()
   copy(mmapedAddr[pos:pos+length], buf)
   return err
   ---- 8< ----

Thanks,
Kirill

Btw, one can see the cost of 1 minor pagefalt to be ~ 6x time _more_ cost of 1 page copy here:


P.S. maybe sorry for formatting - replied via web.
Reply all
Reply to author
Forward
0 new messages