fuse4x device leak with sshfs/autofs on OSX

Jon Nall

unread,

Apr 27, 2012, 4:18:31 AM4/27/12

to fus...@googlegroups.com

I posted this issue on github, but wasn't sure what kind of audience that receives, so I'm posting here.

I'm using sshfs 2.4.0 with fuse4x 0.10.0 on a OSX 10.7.3 machine and see what I believe amounts to a resource leak.

I have symlinked the sshfs binary to /sbin/mount_sshfs.

I have multiple mounts setup using autofs with the following options (I have more than these, but this gives a sense of what I'm doing):

/DIR1 -fstype=sshfs,auto_cache,allow_other,sshfs_debug,reconnect,defer_permissions,negative_vncache,volname=Dir1 USER@SERVER:/DIR1
/DIR2 -fstype=sshfs,auto_cache,allow_other,sshfs_debug,reconnect,defer_permissions,negative_vncache,volname=Dir2 USER@SERVER:/DIR2

After a while I start seeing this in the log:

Apr 26 19:31:23 monoco com.apple.automountd[10836]: t0x10b40a000      fork_exec: /sbin/mount_sshfs -o nobrowse -o nosuid,nodev -o auto_cache,allow_other,sshfs_debug,reconnect,defer_permissions,negative_vncache,volname=Dir1 -o automounted USER@SERVER:/DIR1 /DIR1 
Apr 26 19:31:23 monoco com.apple.automountd[10836]: SSHFS version 2.4.0
Apr 26 19:31:23 monoco com.apple.automountd[10836]: fuse4x: failed to open device file
Apr 26 19:31:23 monoco automountd[10836]: mount of /DIR1 failed
Apr 26 19:31:23 monoco com.apple.automountd[10836]: t0x10b40a000      fork_exec: returns exit status 1
Apr 26 19:31:23 monoco com.apple.automountd[10836]: t0x10b40a000    MOUNT REPLY    : status=1, AUTOFS_DONE

A little later I start seeing these messages (and they continue until I kill the sshfs processes):

Apr 27 00:29:56 monoco automountd[15039]: Can't open /dev/autofs: Resource busy
Apr 27 00:29:56 monoco com.apple.launchd[1] (com.apple.automountd[15039]): Exited with code: 2
Apr 27 00:29:56 monoco com.apple.launchd[1] (com.apple.automountd): Throttling respawn: Will start in 10 seconds

After doing an lsof on /dev/autofs I see there are 24 sshfs processes with it open (24 is in fact the FUSE4X_NDEVICES value). However, if I look at the commands for those processes, many of them are for the same mount (about 5 distinct mounts total across the 24 processes).

I loaded one of them into gdb and got the stack traces below. I believe Thread 3 is the interesting one. It seems to have invoked a read syscall and never returns.

This is pretty reproducible, though it takes an hour or two sometimes.

So I'm curious -- do my options above look reasonable? Is there any way to turn on more debugging for the sshfs process? Anyone heard of this sort of thing before or have an inkling as to what might be going on?

Thanks,
nall.

Thread 1
(gdb) where
#0 0x00007fff866fce42 in __semwait_signal ()
#1 0x00007fff8a25f97e in pthread_join ()
#2 0x000000010001abc6 in fuse_session_loop_mt ()
#3 0x000000010000489d in ?? ()
#4 0x00000001000017cc in ?? ()

Thread 2
(gdb) where
#0 0x00007fff866fd00e in __sigsuspend ()
#1 0x00007fff8a25fd3f in pause ()
#2 0x000000010001af28 in fuse_do_work ()
#3 0x00007fff8a2a98bf in _pthread_start ()
#4 0x00007fff8a2acb75 in thread_start ()

Thread 3
(gdb) where
#0 0x00007fff866fdaf2 in read ()
#1 0x00000001000061bf in ?? () // do_read
#2 0x0000000100005b5b in ?? () // sftp_read
#3 0x0000000100007571 in ?? () // process_one_request
#4 0x00007fff8a2a98bf in _pthread_start ()
#5 0x00007fff8a2acb75 in thread_start ()

Anatol Pomozov

unread,

Apr 27, 2012, 12:09:16 PM4/27/12

to fus...@googlegroups.com

Hi, Jon

Thanks for your report.

This is a known deadlock issue in unmount. It existed in fuse4x from early days and most likely exists in macfuse. The reason is a race condition (most likely in kernel extension) that marks filesystem as dead internally but does not send FUSE_DESTROY message to userspace filesystem. The userspace filesystem (e.g. sshfs) never receives such 'goodbye' message and never exists. The filesystem process stays running and does not close() fuse device file (/dev/fuse4xNN).

autofs mounts and unmounts filesystems a lot so this race condition eats available fuse devices.

This is an annoying issue that I see a lot with sshfs. I am going to address this and some other issues during 'umount' cleanup that happen in 0.10.2 release.

To recover from this error first you need to kill all zombie filesystems. Find them by 'ps ax | grep sshfs', you might want to kill them as 'killall -9 sshfs'. *Then* unmount all zombie mountpoints - you can find it by 'mount -t fuse4x', unmount them as 'umount PATH_TO_DIR'

A temporary workaround for you is to use sshfs in single-thread mode. AFAIK the issue happens only with multithreaded mode. Add '-s' parameter to sshfs.

PS While here it also worth mention that 0.10.0 is not recommended for use. 0.10.0 contains a race condition that leads to kernel panic and it is recommended to downgrade back to 0.9.0 (Homebrew already downgraded the formula). I am going to address the kernel panic in 0.10.1 bugfix release.

Jon Nall

unread,

Apr 27, 2012, 1:31:46 PM4/27/12

to fus...@googlegroups.com

Anatol,

Thanks for the nice description of the problem.

I tried adding s to my mount options like this:

/DIR1 -fstype=sshfs,s,auto_cache,... USER@SERVER:/DIR1

But I get this on the Console:

Apr 27 10:29:47 didot com.apple.automountd[74422]: fuse: unknown option `s'

Is there some other place to put this argument?

Thanks,

nall.

Hans Beckérus

unread,

Apr 28, 2012, 3:02:32 AM4/28/12

to fus...@googlegroups.com

Hi.

-s is a swich rather than a mount option. Try this instead

-fstype=sshfs,-s,auto_cache,...

Depending on if the argument parser in sshfs can handle it that is. Using the built-in argument parsers service in FUSE provides this feature.
If sshfs is using it is another question.

Hans

Anatol Pomozov

unread,

Jul 10, 2012, 3:49:20 AM7/10/12

to fus...@googlegroups.com

Hi, Jon

I think I found the reason of this deadlock. Here is a patched version of fuse4x https://dl.dropbox.com/u/3842605/Fuse4X-0.9.2-beta.dmg that does not leak the device files. At least I do not see this issue in the test that mounts/unmounts fuse filesystems.

If you have chance please test it and let me know the result.

On Fri, Apr 27, 2012 at 1:18 AM, Jon Nall <jon....@gmail.com> wrote:

Reply all

Reply to author

Forward