BUG: Select(2) on channel fd hangs and makes system unstable on El Capitan

395 views
Skip to first unread message

bill...@navimatics.com

unread,
Oct 7, 2015, 5:47:20 PM10/7/15
to OSXFUSE
OS X version: 10.11
OSXFUSE version: 2.8.1

I have a file system that uses the lowlevel API together with select(2) on the FUSE channel fd. This allows the file system to serve multiple concurrent FUSE requests from within a single thread.

[My understanding is that a select(2) on the channel fd is allowed on OSXFUSE as per this message: https://groups.google.com/d/msg/osxfuse-group/grhxqMaghho/WHveaMtJWGgJ]

This approach has worked fine on OSX 10.10 and earlier. It also works fine on Linux. Unfortunately it has stopped working as of OSX 10.11 (El Capitan). The results are that the select(2) system call hangs indefinitely and the system becomes extremely unstable, often times resulting in a kernel panic.

The full list of symptoms is as follows:
  • select(2) hangs indefinitely even when called with a 0 timeout.
  • The FUSE process becomes unkillable (kill -9 does *not* work).
  • The mount_osxfusefs hangs and becomes unkillable (kill -9 does *not* work).
  • The CPU time for either the FUSE process and/or the mount_osxfusefs process shoots to 100%.
  • Every few times this will also result in the rarely seen OS X BSOD.
  • If the system does not crash it cannot be shutdown or restarted, unless forced to by holding down the power button.
  • When the system crashes it leaves a crash report behind. In most of these cases the OSXFUSE kext is being hinted to as the culprit.
I attach a zip file that includes some example crash reports as well as a slightly modified version of fuse/example/hello_ll.c to demonstrate the problem. The modified hello_ll.c calls select(2) on the channel fd just prior to calling fuse_session_loop(). This appears to be enough to trigger the problem.

I have not found a workaround for this problem yet, but I may try the following ugly kludge: wait until after mount_osxfusefs has completed before issuing the select(2) call for the first time.

Thank you,

Bill Zissimopoulos
SelectBug.zip

bill...@navimatics.com

unread,
Oct 21, 2015, 6:09:52 PM10/21/15
to OSXFUSE
I have confirmed that this problem exists on OSX 10.11.1 as well. Since I have not heard anything on this from the list, I intend to engage Apple directly to see what changes in the OS may have caused this issue.

However the silence on the list makes me somewhat worried about the health of the project and the direction it is taking.

Any input welcome.

Bill

Benjamin Fleischer

unread,
Oct 21, 2015, 6:38:37 PM10/21/15
to osxfus...@googlegroups.com
I’ve been looking into the issue but I do not have a fix for osxfuse 2.8, yet. Btw, great bug report.. It seems to be a race condition when mounting a FUSE volume that results in a deadlock. If you sleep for a second before calling is_chan_ready() all is fine. 

The good news is, that the osxfuse 3.0 pre-release version (https://github.com/osxfuse/osxfuse/releases) does not seem to be affected by this issue. At least I have not been able to reproduce the deadlock or panics.

However the silence on the list makes me somewhat worried about the health of the project and the direction it is taking.

Please know that there is only one developer actively working on osxfuse. Fixing bugs takes time. If you are worried about the health or the direction of the project than you might as well get your hands dirty and submit a pull request.

Regards,
Benjamin

--
You received this message because you are subscribed to the Google Groups "OSXFUSE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osxfuse-grou...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

bill...@navimatics.com

unread,
Oct 21, 2015, 7:05:00 PM10/21/15
to OSXFUSE
Benjamin, thank you very much for your answer.

On Wednesday, October 21, 2015 at 3:38:37 PM UTC-7, Benjamin Fleischer wrote:
I’ve been looking into the issue but I do not have a fix for osxfuse 2.8, yet. Btw, great bug report.. It seems to be a race condition when mounting a FUSE volume that results in a deadlock. If you sleep for a second before calling is_chan_ready() all is fine. 

Just to let you know that I have tried various "fixes" of that sort, but have not been successful in solving the problem. In one test I delayed issuing the select() call until well after mount_osxfuse had completed. However I still experienced the hang and instability described earlier (except that now there was no hang mount_osxfuse process as it had already completed).
  
The good news is, that the osxfuse 3.0 pre-release version (https://github.com/osxfuse/osxfuse/releases) does not seem to be affected by this issue. At least I have not been able to reproduce the deadlock or panics.

I will try with the latest 3.0 releases and report back as my select scenario is rather more complicated than the one presented in the modified hello_ll.c.

A quick related question: are there any obvious gotchas in moving a FUSE 2.x file system to 3.x?
However the silence on the list makes me somewhat worried about the health of the project and the direction it is taking.
Please know that there is only one developer actively working on osxfuse. Fixing bugs takes time. If you are worried about the health or the direction of the project than you might as well get your hands dirty and submit a pull request.

Benjamin, point taken.

Unfortunately I do not have the necessary knowledge to work in kernel mode on OS X, but I will be happy to provide patches/input for those things that I am able.

Bill

Benjamin Fleischer

unread,
Oct 22, 2015, 2:31:24 AM10/22/15
to osxfus...@googlegroups.com
> Am 22.10.2015 um 01:05 schrieb bill...@navimatics.com:
>
> A quick related question: are there any obvious gotchas in moving a FUSE 2.x file system to 3.x?

API and ABI of version 3.x should be mostly compatible with version 2.x. A few functions (e.g. fuse_sem_*) have been deprecated. File systems compiled with version 2 should work with version 3 out of the box without the need to recompile anything. Much has changed internally, though. The most noteworthy change is the move from libfuse 2.7.3 to libfuse 2.9.4. But there are still some libfuse 2.9.4 callbacks that are not implemented yet.

Please note that even though I consider version 3 stable I’m not done with it, yet. I do not recommend shipping it to end users at his point.

bill...@navimatics.com

unread,
Oct 22, 2015, 4:09:47 PM10/22/15
to OSXFUSE
Unfortunately the problem still happens as of the latest release (3.0.6).

When I ran my own file system daemon it simply hang. I was able to execute a few select(2) calls before the hang happened (my debug log showed 2-3). There was no mount_osxfuse process around (as I understand that fuse_mount_core() now waits for the mount_osxfuse process to complete). However the FUSE process was still unkillable, etc.

The problem did not happen at first with the modified hello_ll.c. As said, the first few select(2) calls seem to succeed. I further modified hello_ll.c to perform a select(2) within fuse_session_loop(). This resulted in two kernel panics, both with the message: "preemption level underflow, possible cause unlocking an unlocked mutex or spinlock".

I attach a zip file containing the newly modified hello_ll.c and the two kernel panic reports.

On a related note I have used one of my Apple Developer Support TSI's to hopefully get Apple to look into this.

Bill

bill...@navimatics.com

unread,
Oct 22, 2015, 4:13:50 PM10/22/15
to OSXFUSE
I forgot to attach the new zip file. Here goes.
SelectBug2.zip

Benjamin Fleischer

unread,
Oct 22, 2015, 4:41:59 PM10/22/15
to osxfus...@googlegroups.com
There is definitely something wrong with locking. Every hang I have seen in my tests occurred when unlocking the ms_mtx mutex. However, it does not always hang on the same line of code.

--
You received this message because you are subscribed to the Google Groups "OSXFUSE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osxfuse-grou...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<SelectBug2.zip>

bill...@navimatics.com

unread,
Oct 22, 2015, 5:29:22 PM10/22/15
to OSXFUSE
I have received a rather interesting response from Apple. They think that this problem should be handled as a bug report on their side. I will submit a bug report on their bugtracker and see if that takes us somewhere.

BTW, do you have a recommended setup for debugging such issues? I would like to take a look myself beyond just poring over source code.

Bill

bill...@navimatics.com

unread,
Oct 22, 2015, 5:55:56 PM10/22/15
to OSXFUSE
On Thursday, October 22, 2015 at 2:29:22 PM UTC-7, bill...@navimatics.com wrote:
I have received a rather interesting response from Apple. They think that this problem should be handled as a bug report on their side. I will submit a bug report on their bugtracker and see if that takes us somewhere.

Ok, I have submitted the bug report to Apple.

 

Benjamin Fleischer

unread,
Oct 23, 2015, 2:47:12 AM10/23/15
to osxfus...@googlegroups.com
Thanks!

Apple will need the following dSYMs to symbolicate the osxfuse parts of the stack traces: http://sourceforge.net/projects/osxfuse/files/osxfuse-2.8.1/kext-2.8.1.tbz/download. The archive contains three kernel extensions for different versions of OS X: 10.5 10.6 and 10.9. 10.9 is the one that is used on OS X 10.11.

Am 22.10.2015 um 23:29 schrieb bill...@navimatics.com:

BTW, do you have a recommended setup for debugging such issues? I would like to take a look myself beyond just poring over source code.

I’m using the two machine remote kernel debugging setup described in the following documents:

Kernel Programming Guide

Kernel Extension Programming Topics

Both documents are a bit outdated but the still hold valuable information.

Download and install the kernel debug kit for OS X 10.11 from Apple on the the "debugger“ machine.

You need to start by setting the boot-args of the machine that you want to debug to

debug=0x144 -v

This allows you to debug the kernel after issuing a non maskable interrupt (NMI). You do this by pressing Command-Option-Control-Shift-Escape at the same time on the machine that is to be debugged, when it hangs.

Then you will need to run lldb on your "debugger" machine and connect to the remote machine. Running the following commands will get you started. You will need to alter the path to the kernel in the lldb command and the IP address for kdb-remote. 

$ lldb /path/to/Kernels/kernel
...
(lldb) settings set target.load-script-from-symbol-file true
...
(lldb) kdb-remote 172.16.115.128
...
(lldb) showallkmods
...
(lldb) showallstacks

And when you are done:

(lldb) quit

The above command will print the (kernel) stack traces for all running processes at the time of the hang. 

Regards,
Benjamin

bill...@navimatics.com

unread,
Oct 23, 2015, 1:46:42 PM10/23/15
to OSXFUSE
Apple will need the following dSYMs to symbolicate the osxfuse parts of the stack traces: http://sourceforge.net/projects/osxfuse/files/osxfuse-2.8.1/kext-2.8.1.tbz/download. The archive contains three kernel extensions for different versions of OS X: 10.5 10.6 and 10.9. 10.9 is the one that is used on OS X 10.11.

Thanks. I updated the bug report.

BTW, do you have a recommended setup for debugging such issues? I would like to take a look myself beyond just poring over source code.

I’m using the two machine remote kernel debugging setup described in the following documents:

Ok, this is rather involved but familiar. Takes me back a long time ago (and on a different OS...)

Many thanks for your attention/involvement.

Bill

Benjamin Fleischer

unread,
Oct 25, 2015, 1:10:16 PM10/25/15
to osxfus...@googlegroups.com
I found the bug. Apple is not to blame for this. osxfuse relies on a kernel private struct for select(2) support that has changed between OS X 10.10 and 10.11. I’m working on a fix.

Benjamin Fleischer

unread,
Oct 25, 2015, 3:50:45 PM10/25/15
to osxfus...@googlegroups.com
The just released version 2.8.2 fixes this issue.

Regards,
Benjamin

bill...@navimatics.com

unread,
Oct 26, 2015, 1:11:53 PM10/26/15
to OSXFUSE
Benjamin, excellent work. Thank you!

I will download the latest 2.8.2 and test it.

Bill

bill...@navimatics.com

unread,
Oct 26, 2015, 1:47:30 PM10/26/15
to OSXFUSE
I have confirmed that my test suite passes with FUSE 2.8.2. Thank you again for the fix, Benjamin.
Reply all
Reply to author
Forward
0 new messages