hangs in flockfile() during fread() or fclose()

418 views
Skip to first unread message

Paul Smith

unread,
Jul 9, 2014, 8:05:28 PM7/9/14
to darwi...@lists.apple.com
I'm working on a massively multithreaded application (often there are 30
or more threads) which runs on GNU/Linux, Mac OSX, and Windows. I've
been having a problem which only appears on Mac OSX, that I can't
understand.

I've seen it on various (relatively up to date) MacOSX versions; for
example my latest instantiation happened here:

$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.9.2
BuildVersion: 13C64

What happens is that my application hangs forever. When I attach with
the debugger or force a core dump, the hang is happening in flockfile():

Thread 14 (core thread 13):
#0 0x00007fff8a997746 in __psynch_mutexwait ()
#1 0x00007fff88cd7779 in _pthread_mutex_lock ()
#2 0x00007fff856c0edd in flockfile ()
#3 0x00007fff856c2c06 in fread ()
#4 0x0000000101fa1eb5 in Stream::load (this=0x1150d9930, fileName=<unavailable>) at /Users/build/src/Stream.cpp:182
...

I've seen this same thing happen on an fclose(), although from a
different function and for a different file. In both cases just sits
there waiting for the lock, which is never released.

Now, I have checked my code and I never invoke flockfile() myself, so
it's only ever invoked as a side-effect of the stdio functions in libc
(I've gotten a copy of the source code for libc and looked in the
implementation from the FreeBSD directory).

Further, as you can see above I'm passing in the filename to the
function, not a FILE*, and the function calls fopen(), then fread() in a
loop, then fclose() and never passes the file pointer to any other code;
basically:

void Stream::load(const char* fileName)
{
FILE* file = fopen(fileName, "rb");

if (!file) {
throw Exception(...);
}

Buffer buffer;

for (size_t n; (n = fread(buffer, 1, buffer.size(), file)) > 0;) {
append_to_local(buffer, n);
}

fclose(file);
}

I'm really confused about this. Google doesn't seem to be helping.
Does anyone have any thoughts or ideas of where to turn or how to
proceed?

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list (Darwi...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/darwin-dev/darwin-dev-garchive-73044%40googlegroups.com

This email sent to darwin-dev-g...@googlegroups.com

Stephen J. Butler

unread,
Jul 9, 2014, 8:55:31 PM7/9/14
to pa...@mad-scientist.net, Darwin Dev
Can you distill this down to self contained test case?


Paul Smith

unread,
Jul 10, 2014, 12:26:39 AM7/10/14
to Stephen J. Butler, Darwin Dev
On Wed, 2014-07-09 at 19:55 -0500, Stephen J. Butler wrote:
> Can you distill this down to self contained test case?

Difficult. It only happens once in a while on our system in our test
suite which is 14,000 individual tests of our entire application.
Basically every now and then we notice that the tests are hung and when
we go to investigate and get a core this is what we find. It happens at
different times in different areas (the fclose() hang was in a totally
different function, reading different files).

Our use of stdio functions is literally as simple as the example I
quoted in my original email: the FILE* is opened and assigned to a local
variable, used, then closed before the function returns and not passed
to any functions (except other stdio functions).

I could try to write a simple program that created lots of threads and
randomly read/wrote files: not sure how effective that would be as a
repro case.

I'm wondering if someone has pointers on what we might investigate (and
how) when we get a process in this state, that might help us narrow down
where to look or what to concentrate on.

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list (Darwi...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:

Greg Parker

unread,
Jul 10, 2014, 12:46:38 AM7/10/14
to pa...@mad-scientist.net, Darwin Dev
On Jul 9, 2014, at 9:26 PM, Paul Smith <pa...@mad-scientist.net> wrote:
> On Wed, 2014-07-09 at 19:55 -0500, Stephen J. Butler wrote:
>> Can you distill this down to self contained test case?
>
> Difficult. It only happens once in a while on our system in our test
> suite which is 14,000 individual tests of our entire application.
> Basically every now and then we notice that the tests are hung and when
> we go to investigate and get a core this is what we find. It happens at
> different times in different areas (the fclose() hang was in a totally
> different function, reading different files).
>
> Our use of stdio functions is literally as simple as the example I
> quoted in my original email: the FILE* is opened and assigned to a local
> variable, used, then closed before the function returns and not passed
> to any functions (except other stdio functions).
>
> I could try to write a simple program that created lots of threads and
> randomly read/wrote files: not sure how effective that would be as a
> repro case.
>
> I'm wondering if someone has pointers on what we might investigate (and
> how) when we get a process in this state, that might help us narrow down
> where to look or what to concentrate on.

Some possibilities include:
* That thread is deadlocked against itself because it's trying to call fread() from a signal handler and the signal handler interrupted another flockfile-ing call. What is the rest of that stack trace?
* The process is deadlocked because some other thread owns the lock and won't let go for some reason. What are the other threads' stack traces?
* The lock is broken because a memory error smashed it. What does the memory contents of the lock look like? (I don't know what the internals of the current pthread mutex looks like, but the first four bytes should be something similar to 'MUTX'.)


--
Greg Parker gpa...@apple.com Runtime Wrangler

Stephen J. Butler

unread,
Jul 10, 2014, 1:04:11 AM7/10/14
to Paul Smith, Darwin Dev
Don't take this the wrong way, but...

Incredible claims -- like stdio being broken -- require incredible evidence. From our perspective, it's much more likely that your code has a memory corruption, double free, or stack smashing bug somewhere than stdio not working correctly. You might want to try Malloc Debug if you haven't already:


valgrind, although a pain to setup, has also helped me find memory related bugs in the past.

Of course, a reproducible test case would point the finger where it belongs. Or it might help you find a bug in your code. I know there have been times I was ready to blame a framework for something, and when writing the test case I discovered it was something I'd misunderstood.

Paul Smith

unread,
Jul 10, 2014, 8:54:36 AM7/10/14
to Greg Parker, Darwin Dev
On Wed, 2014-07-09 at 21:46 -0700, Greg Parker wrote:
> On Jul 9, 2014, at 9:26 PM, Paul Smith <pa...@mad-scientist.net> wrote:
> > On Wed, 2014-07-09 at 19:55 -0500, Stephen J. Butler wrote:
> >> Can you distill this down to self contained test case?
> >
> > I'm wondering if someone has pointers on what we might investigate (and
> > how) when we get a process in this state, that might help us narrow down
> > where to look or what to concentrate on.
>
> Some possibilities include:
> * That thread is deadlocked against itself because it's trying to call
> fread() from a signal handler and the signal handler interrupted
> another flockfile-ing call. What is the rest of that stack trace?

Nope, we don't do signal handlers. Signals are either ignored or cause
the process to crash; we don't register any handlers. The rest of the
stacktrace is unremarkable; a bunch of our internal functions and at the
top (bottom?) of the stack:

#17 0x00007fff88cd4899 in _pthread_body ()
#18 0x00007fff88cd472a in _pthread_start ()
#19 0x00007fff88cd8fc9 in thread_start ()

> * The process is deadlocked because some other thread owns the lock
> and won't let go for some reason. What are the other threads' stack
> traces?

They're all waiting in recv() or sleep(). But, unless I'm badly
misunderstanding something this cannot be true because the FILE* is
opened, used, and closed within this function and the lock is local to
the FILE object and never shared between different FILE objects.

The one thing I was thinking is this: maybe I have an fopen() then an
fclose(), but then something uses the FILE* again after the fclose()
which causes the stdio structure to be corrupted somehow. Then that
FILE* object is used for another fopen() and this causes the problem.

I can't find anyplace that this happens, but I will look harder.

> * The lock is broken because a memory error smashed it. What does the
> memory contents of the lock look like? (I don't know what the
> internals of the current pthread mutex looks like, but the first four
> bytes should be something similar to 'MUTX'.)

OK, memory errors are always a possibility. I'll see if I can dig into
the lock contents.

On Thu, 10 Jul 2014 at 00:04 -0500, Stephen J. Butler wrote:
> Incredible claims -- like stdio being broken -- require incredible
> evidence.

I definitely agree, that's why I didn't say stdio was broken :-) (or at
least, I didn't mean to do so). I'm just reporting what I see and
asking for tips on where to go from here.

> From our perspective, it's much more likely that your code has a
> memory corruption, double free, or stack smashing bug somewhere than
> stdio not working correctly. You might want to try Malloc Debug if you
> haven't already:
>
> https://developer.apple.com/library/mac/documentation/performance/conceptual/managingmemory/articles/MallocDebug.html

> valgrind, although a pain to setup, has also helped me find memory
> related bugs in the past.

We do run valgrind on GNU/Linux, but not MacOS. 99% of the code is
identical between them, but there are a few differences (but none
related to file IO; in fact that's why we're using stdio instead of
open/read/close: for portability--I've considered avoiding this problem
by switching to system calls as there's no real benefit to stdio the way
we use it except for portability).

The problem of course is that even with an optimized build the full
tests take 6+ hours and the failure doesn't always happen, and running
that under valgrind or even a debugging malloc will take significantly
longer (and could potentially change the timing so it never happens).

However, I guess it's up to me now. I'll follow up if I figure out
anything.

Cheers!

Paul Smith

unread,
Jul 10, 2014, 10:31:10 AM7/10/14
to Greg Parker, Darwin Dev
On Thu, 2014-07-10 at 08:53 -0400, Paul Smith wrote:
> However, I guess it's up to me now. I'll follow up if I figure out
> anything.

Hm. OK, I think I have a smoking gun.

I looked at the core again and it turns out there's ANOTHER thread,
which is ALSO hung in flockfile(). I didn't notice it the first time.
So, a real deadlock situation rather than memory stomp.

Examining the other thread, I see a bug in our code: we are writing
logging and we check to see if the log FILE* is NULL and if so we write
to stdout instead. We do this properly everywhere except for the
fflush() operation at the end; here we just pass the log FILE* without
checking. And of course, if you call fflush(NULL) then it will try to
flush all open file descriptors, and this is where it hangs:

Thread 7 (core thread 6):


#0 0x00007fff8a997746 in __psynch_mutexwait ()
#1 0x00007fff88cd7779 in _pthread_mutex_lock ()
#2 0x00007fff856c0edd in flockfile ()

#3 0x00007fff856c156f in sflush_locked ()
#4 0x00007fff856c3e82 in _fwalk ()
#5 0x0000000101d9112f in Engine::logger (this=<unavailable, message=<unavailable>) at /Users/build/src/Logger.cpp:887

Looking at the code in libc I _think_ I see the problem:

* In fopen() if we need to get new FILE objects it appears to me
that they are added into the list _before_ they are completely
initialized; in particular before the INITEXTRA() macro, which
is what initializes the _fl_lock mutex, is run.
* In _fwalk() we walk the list of open FILE objects without taking
a lock; the comment says:
* It should be safe to walk the list without locking it;
* new nodes are only added to the end and none are ever
* removed.

Then _fwalk() passes the FILE object to sflush_locked() which calls
flockfile() on it.

So I believe that it's possible for one thread to be calling flockfile()
on a FILE object with an uninitialized _fl_lock mutex, leading to
corruption.

I will change our code, since we definitely don't want to be calling
fflush(NULL), but it seems also to be a problem in libc since this kind
of thing is exactly why flockfile() is there, IIUC.

Am I understanding the code correctly, or did I miss something? Should
I file a bug? If so where's the best place?

comex

unread,
Jul 10, 2014, 12:02:55 PM7/10/14
to Paul Smith, Darwin Dev
On Thu, Jul 10, 2014 at 10:30 AM, Paul Smith <pa...@mad-scientist.net> wrote:
> Am I understanding the code correctly, or did I miss something? Should
> I file a bug? If so where's the best place?

Looks like a bug to me. Here is a test case - compiled with clang and
run with MallocScribble=1 (not sure if necessary), it eventually
hangs.

You can file bugs at https://bugreport.apple.com.
fopendl.c

Greg Parker

unread,
Jul 10, 2014, 8:43:01 PM7/10/14
to Paul Smith, Darwin Dev

Your diagnosis sounds plausible to me. That code needs appropriate memory barriers if it wants to play games with lock-free algorithms, and I don't see any. Please file a bug report from http://bugreport.apple.com.


--
Greg Parker gpa...@apple.com Runtime Wrangler

_______________________________________________

Paul Smith

unread,
Jul 11, 2014, 4:25:36 PM7/11/14
to Greg Parker, Darwin Dev
On Thu, 2014-07-10 at 17:43 -0700, Greg Parker wrote:
> > Am I understanding the code correctly, or did I miss something?
> Should
> > I file a bug? If so where's the best place?
>
> Your diagnosis sounds plausible to me. That code needs appropriate
> memory barriers if it wants to play games with lock-free algorithms,
> and I don't see any. Please file a bug report from
> http://bugreport.apple.com.

OK 17644882 has been filed. Thanks all!

Reply all
Reply to author
Forward
0 new messages