I'm not sure where to post this, but I thought this would be a good start.
I'm writing a C++ thread wrapper library for Win32 & Linux and during
testing, I've found a bug in sem_wait. The bug is as follows: if a thread
with cancellation state/type set to disabled/deferred (yes, I know the
deferred part will be ignored) is waiting on a semaphore and is cancelled
by another thread, sem_wait returns 0 immediately and errno gets set to
EINTR. This is completely wrong -- if sem_wait is going to set errno to
EINTR, it must at least return -1 to tell that something has gone wrong.
Sem_wait should not have returned in the first place. Note that this only
occurs with the above cancellation state/type. If the thread's cancellation
state is enabled, sem_wait acts as a cancellation point as expected. This
is especially annoying because the stability of my library is dependent on
the correctness of Linuxthreads.
I've included a bare bones program that reproduces the bug. To compile it,
simply type:
g++ -D_REENTRANT sem_test.cpp -osem_test -lpthread
and run it:
./sem_test
Thanks for any help,
Jason
Use the ``glibcbug'' script to submit a bug report. In the meanwhile I will
start looking at it. :)
Never mind. The bug is already fixed in the 2.2 stream. Ulrich Drepper's last
commit to linuxthreads/cancel.c adds a missing test for the disabled
cancellation.
The test wasn't needed once upon a time because cancellation was always handled
by generating a signal to the target thread; the internal signal handler for
that signal would go off and do the cancellability test. That changed for
2.1.3: when a thread is known to be waiting on a cancellation point,
cancellation is handled by kicking it with a restart signal instead. But it's
invalid to do that one unconditionally!
There won't be another 2.1 release; glibc is in beta already for 2.2.
So if you want the bugfix, you have to wait for 2.2 or patch your own 2.1. :(
--
Any hyperlinks appearing in this article were inserted by the unscrupulous
operators of a Usenet-to-web gateway, without obtaining the proper permission
of the author, who does not endorse any of the linked-to products or services.
> On Sun, 05 Nov 2000 05:23:11 GMT, Kaz Kylheku <k...@ashi.footprints.net>
> wrote:
> >On Sun, 05 Nov 2000 04:21:20 GMT, Jason Andrew Nye <jn...@nbnet.nb.ca>
> >wrote:
> >>Hi all,
> >>
> >>I'm not sure where to post this
> >
> >Use the ``glibcbug'' script to submit a bug report. In the meanwhile I
> >will start looking at it. :)
>
> Never mind. The bug is already fixed in the 2.2 stream. Ulrich Drepper's
> last commit to linuxthreads/cancel.c adds a missing test for the disabled
> cancellation.
>
> The test wasn't needed once upon a time because cancellation was always
> handled by generating a signal to the target thread; the internal signal
> handler for
> that signal would go off and do the cancellability test. That changed for
> 2.1.3: when a thread is known to be waiting on a cancellation point,
> cancellation is handled by kicking it with a restart signal instead. But
> it's invalid to do that one unconditionally!
>
> There won't be another 2.1 release; glibc is in beta already for 2.2.
> So if you want the bugfix, you have to wait for 2.2 or patch your own 2.1.
> :(
>
So what you're saying is that although my library is meant to be portable
accross Win32 & Linux, it will only run reliably on Win32 through no fault
of its own? I can go ahead an patch my own system, but if my installation
instructions include 'oh, you have to download glibc 2.1.3, patch cancel.c
in the Linuxthreads package, and reinstall the most fundamental library in
your system, which by the way, is not for the faint of heart', then that is
completely unnacceptable.
Basically, no software package that uses semaphores on Linux will work
properly -- this constitutes a lot of software: StarOffice, Multithreaded
CORBA ORBs, mozilla, databases, QT, pretty much everything that is useful.
I would *really* reconsider releasing a patch to glibc2.1.3.
Until fundamental libraries in the system are extensively tested and
audited, Linux and all other systems that use these libraries will be
considered academic excercises (again, through no fault of their own). That
bug in cancel.c does not only completely break any program that uses
semaphores, it was simple to test and the test did not get done. I am a
complete Linux convert, but I've seen lately a rash of supporting-library
bugs that just frustrate me to no end.
I am being forced to release my library with a disclaimer: "This does not
work properly on Linux with glibc 2.1.3". What a farce.
End of rant mode for very frustrated developer. None of this is personal in
any way or directed at any one in particular.
ARRRRRGGGGHH and cheers,
Jason Nye
That is correct; library bugs tend to do that to you.
>I can go ahead an patch my own system, but if my installation
>instructions include 'oh, you have to download glibc 2.1.3, patch cancel.c
>in the Linuxthreads package, and reinstall the most fundamental library in
>your system, which by the way, is not for the faint of heart', then that is
>completely unnacceptable.
Alternately, you can give people a ready-made library. You don't have to
reinstall all of glibc; your program can use its own libpthread*.so.
>Basically, no software package that uses semaphores on Linux will work
>properly -- this constitutes a lot of software: StarOffice, Multithreaded
>CORBA ORBs, mozilla, databases, QT, pretty much everything that is useful.
>I would *really* reconsider releasing a patch to glibc2.1.3.
None of these people have complained, to my knowledge. The bug only affects
disabled cancellation, which I suspect is rarely used. The problem with
disabling cancellation is that it has unreliable semantics; cancellation
requests are just dropped on the floor. By turning cancellation off and
then on again, a thread may miss a cancellation request. If a thread turns
off cancellation for good, the application may just as well be coded so that it
never calls pthread_cancel on the thread. It's better to leave cancellation
turned on and just handle it with cleanup handlers. This is especially true
when you are writing thread-safe library code that will be run by threads
supplied by the application.
>Until fundamental libraries in the system are extensively tested and
>audited, Linux and all other systems that use these libraries will be
>considered academic excercises (again, through no fault of their own).
That could be said of any operating system that has bugs. Is Solaris
an academic exercise because its recursive mutexes are broken? :)
I don't really want to take on more than I need to here -- I'd have to
support my own version of libpthread and keeping it in sync with the
'standard' version would not be much fun at all.
> None of these people have complained, to my knowledge. The bug only
> affects disabled cancellation, which I suspect is rarely used. The problem
> with disabling cancellation is that it has unreliable semantics;
> cancellation requests are just dropped on the floor. By turning
> cancellation off and then on again, a thread may miss a cancellation
> request. If a thread turns off cancellation for good, the application may
> just as well be coded so that it
> never calls pthread_cancel on the thread. It's better to leave
> cancellation turned on and just handle it with cleanup handlers. This is
> especially true when you are writing thread-safe library code that will be
> run by threads supplied by the application.
The problem is that POSIX-style cancellation is very dangerous in C++ code
because objects allocated on the stack will never have their destructors
called when a thread is cancelled (leads to memory leaks and other nasty
problems). Typically what a C++ thread library author will do is have
POSIX-style cancellation turned off by default and provide a virtual cancel
operation in the base Thread class so that clients of the library override
this to use more controlled methods of cancellation and, if necessary, they
still have the option of using POSIX-style cancellation in certain areas in
their code (with great care, I might add). Sometimes thread libraries
(Rogue Wave's Threads.h++, for example) throw an exception to handle
cancellation in which case POSIX-style cancellation is not used at all and
should be completely disabled. Just because POSIX-style cancellation is
disabled does not mean a thread cannot be canceled. I've written some
pretty complicated & fancy threaded systems with full cancellation support
and not a single call to pthread_cancel.
You mentioned that the bug is in cancel.c. Does this mean that every
deferred cancellation point in Linuxthreads has this problem? If so, I'll
be having an extremely rough time trying to work around this problem...
Cheerio,
Jason
I also maintain a C++ library that has cancellation support without
pthread_cancel, which works over POSIX or Win32. But because I didn't fiddle
with the cancellation at all, I didn't notice the problem. The code doesn't
call pthread_cancel *OR* pthread_setcancelstate; it just pretends that
cancellation doesn't exist. You don't have to disable cancellation in order
to do implement your own.
>You mentioned that the bug is in cancel.c. Does this mean that every
>deferred cancellation point in Linuxthreads has this problem?
Yes. Strangely enough, there was only one recent report, from a 2.1.9x beta
tester I think, which led to the fix. I'm not aware of any 2.1.3 users (other
than you) who have reported this. I suspect that it's rather atypical for
software to disable cancellation and then use pthread_cancel anyway.
>If so, I'll
>be having an extremely rough time trying to work around this problem...
Why? If you don't want cancellation, as a workaround, why not just get rid of
calls to pthread_cancel? Or document it that the users of the library should
not use pthread_cancel on your threads because those requests will be ignored
(or trigger a bug, if using glibc 2.1.3).
I think that if your library silently disables cancellation on the threads that
it creates and then the library user tries to use pthread_cancel, that is
arguably an erroneous situation, regardless of any bugs in the library. The
user expects a response that does not happen. It would be better to have a
cleanup handler which catches the cancellation, and aborts the program with
some message like ``foo library doesn't support POSIX cancellation! Use
ThreadClass::Cancel instead''.
> The problem with disabling cancellation is that it has unreliable semantics;
> cancellation requests are just dropped on the floor.
Are you saying there's a BUG in the implementation of pending cancellation in
Linux? Otherwise, I don't understand this statement (or the following one) at all.
Cancellation can be missed only if it remains disabled for the life of the thread.
(In which case, while it has clearly been "missed", that is also clearly what the
developer intended... whether or not said developer realized that was what was
intended.)
> By turning cancellation off and then on again, a thread may miss a cancellation
> request.
How might this happen in a correct implementation?
> If a thread turns off cancellation for good, the application may just as well be
> coded so that it never calls pthread_cancel on the thread.
Clearly.
> It's better to leave cancellation turned on and just handle it with cleanup
> handlers. This is especially true when you are writing thread-safe library code
> that will be run by threads supplied by the application.
There are cases where cancellable code occurs within critical sections where
"cleanup" is difficult or impossible, or just inconvenient. The standard allows
cancelability to be disabled, and requires that any cancellation request received
while cancelability is disabled will result in a "pending cancel" that will be
delivered at the next cancelation point after cancelability is re-enabled.
/------------------[ David.B...@compaq.com ]------------------\
| Compaq Computer Corporation POSIX Thread Architect |
| My book: http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----[ http://home.earthlink.net/~anneart/family/dave.html ]-----/
> The problem is that POSIX-style cancellation is very dangerous in C++ code
> because objects allocated on the stack will never have their destructors
> called when a thread is cancelled (leads to memory leaks and other nasty
> problems).
This statement is not strictly true. Only an implementation of POSIX thread
cancellation that completely ignores C++, combined with an implemenation of
C++ that completely ignores POSIX thread cancellation, results in a dangerous
environment for applications and use both in combination. Because POSIX
cancellation was designed to work with exceptions (it was in fact designed to
be implemented as an exception), the combination is obvious and natural, and
there's simply no good excuse for it to not work.
Personally, I think it's very near criminal to release an implemenation where
C++ and cancellation don't work together. Developers who do this may have the
convenient excuse that "nobody made them" do it right. The C++ standard doesn't
recognize threads, and POSIX has never dealt with creating a standard for the
behavior of POSIX interfaces under C++. (Technically, none of the POSIX
interfaces are required to work under C++, though you rarely see a UNIX where
C++ can't call write(), or even printf().) Excuses are convenient, but this is
still shallow and limited thinking. I don't understand why anyone would be
happy with releasing such a system.
I spent a lot of time and energy educating the committee that devised the ABI
specification for UNIX 98 on IA64 to ensure that the ABI didn't allow a broken
implementation. Part of this was simply in self defense because a broken ABI
would prohibit a correct implementation. I'd also had some hope that the
reasonable requirements of the ABI would eventually percolate up to the source
standard. More realistically, though, I hoped that by forcing a couple of C++
and threads groups to get together and do the obviously right (and mandatory)
thing for IA64, they might do the same obviously right (though not mandatory)
thing on their other platforms. Maybe someday it'll even get to Linux.
Please don't settle for this being broken. And especially, don't believe that
it has to be that way. Anyone who can implement C++ with exceptions can create
a language-independent exception facility that can equally well be used by the
thread library -- and, with a few trivial source extensions, by C language code
(e.g., though the POSIX cleanup handler macros).
Actually there is one in the latest glibc-2.1.9x betas. I just submitted
a bug report.
>Otherwise, I don't understand this statement (or the following one) at all.
>Cancellation can be missed only if it remains disabled for the life of the
>thread.
That's what I used to believe, but I think that I got mixed up about this in
some libc-alpha discussions which were related to the ``fix'' which actually
broke the behavior.
What's worse is that the glibc documentation is misleading; it uses the
term ``ignore'' to describe what happens to cancelation requests. So I think
that the documentation is partly to blame for the bug creeping in unnoticed.
Cancellation used to work properly in 2.1.2, except for subtle races in
timed-out waits; some necessary fixes introduced in 2.1.3 for that accidentally
broke things; cancelation could proceed even when disabled. That was fixed
properly in the 2.2 stream, but then an extra test was added to pthread_cancel
to simply bail out of the function without doing anything if the target
thread's cancelation is disabled.
I seem to recall that I complained about this but I didn't properly follow up
with the standard; instead I was somehow convinced by the ensuing debate into
readjusting my thinking that it's actually right. ;)
Luckily this new issue affects only 2.2 beta testers.
What should the semantics be, in your opinion? POSIX cleanup handles first,
then C++ unwinding? Or C++ unwinding first, then POSIX cleanup handlers? Or
should the proper nesting of cleanup handlers and C++ statement blocks be
observed?
I understand that in the Solaris implementation, the POSIX handlers are done
first and then the C++ cleanup. How about Digital UNIX? My concern is what GNU
libc should do; where there isn't a standard, imitating what some other popular
implementations do would make sense.
I don't think that C++ unwinding can be done first, since POSIX
handlers may refer to objects on the stack. I can see a similar
argument preventing the reverse order (e.g. a class which locks a
mutex in its constructor and unlocks in the destructor).
--
Eppur si muove
> On Mon, 06 Nov 2000 09:16:42 -0500, Dave Butenhof <David.B...@compaq.com>
> wrote:
> >environment for applications and use both in combination. Because POSIX
> >cancellation was designed to work with exceptions (it was in fact designed to
> >be implemented as an exception), the combination is obvious and natural, and
> >there's simply no good excuse for it to not work.
>
> What should the semantics be, in your opinion? POSIX cleanup handles first,
> then C++ unwinding? Or C++ unwinding first, then POSIX cleanup handlers? Or
> should the proper nesting of cleanup handlers and C++ statement blocks be
> observed?
My opinion is that they should be executed in the only possible correct or useful
order.
( ;-) -- but only for the phrasing, not the message.)
Each active "unwind scope" on the thread must be handled in order. (The opposite
order from that in which they were entered, of course.)
The obvious implementation of this is that both C++ destructors (and catch
clauses) and POSIX cleanup handlers, are implemented as stack frame scoped
exception handlers, and that each handler is executed, in order, as the frame is
unwound by a single common unwind handler.
Any other order will break one or the other, or both.
> I understand that in the Solaris implementation, the POSIX handlers are done
> first and then the C++ cleanup. How about Digital UNIX? My concern is what GNU
> libc should do; where there isn't a standard, imitating what some other popular
> implementations do would make sense.
I don't know the details of the Solaris implementation, but what you describe is
clearly broken and useless except in trivial and contrived examples.
We, of course, do it "correctly", though it could be cleaner. For example, right
now C++ code can't catch a cancel or thread exit except with the overly general
"catch(...)", because C++ isn't allowed to use pthread_cleanup_push/pop, (and
shouldn't want to since C++ syntax is more powerful), and C++ doesn't have a name
for those "foreign" exceptions. (Of course destructors work fine.) We've worked
with the compiler group to add some builtin exception subclasses to deal with
that, but we never found the time to finish hooking up all the bits.
Our UNIX was architected from the beginning with a universal calling standard that
supports call-frame based exceptions. All conforming language processors must
provide unwind information (procedure descriptors) for all procedures, and a
common set of procedures (in libc and libexc) support finding and interpreting the
descriptors and in unwinding the stack. Our C compiler provides extensions to
allow handling these native/common exceptions from C language code. Our
<pthread.h> uses these extensions to implement POSIX cleanup handlers. (For other
C compilers, we use a setjmp/longjmp package built on native exceptions "under the
covers", though with some loss of integration when interleaved call frames switch
between the two models. Support for our extensions, or something sufficiently
similar, would allow me to make gcc work properly.) Both cancel delivery and
pthread_exit are implemented as native exceptions. The native unwind mechanism
will unwind all call frames (of whatever origin) and call each frame's handler (if
any) in the proper order. (Another minor glitch is that our exception system has a
single "last chance" handler, on which both we and C++ rely. We set it once at
initialization, but C++ sets it at each "throw" statement, which will break
cancellation or thread exit of the initial thread since we can't put a frame
handler at or below main(). This is also fixed by our not-quite-done integration
effort with C++.)
This is all covered by the IA64 ABI. Of course it specifies API names, and data
structure sizes and contents. It's also somewhat more biased than our
implementation towards C++, since it was a generalization and cleanup of the C++
ABI section on exceptions rather than something designed independently. (The ABI,
and any C++ implementation, had to do this anyway. Making it general was only a
little more work than making it exclusive to C++, and of fairly obvious value.)