Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

mutex deadlock at libc.so.6

129 views
Skip to first unread message

LaBird

unread,
May 10, 2004, 12:09:55 PM5/10/04
to
Dear all,

(Sorry if this post is off topic. If so please tell me the right
place to post.)

I am a newbie to gcc compilers, but have used linux for C++
programming for some time. Recently I got a strange error
in a parallel program running on more than 1 machine. One of
the processes seems to hang forever, causing other processes
to wait indefinitely for its reply (through socket messages).

The problem does not appear at the same position each time
I run the program. However, when I use gdb to debug the
problem code, I get lines like these:

Attaching to program: /home/srg/benny/mm4, process 10536
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
0x00580c32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) next
Single stepping until exit from function _dl_sysinfo_int80,
which has no line number information.
0x001f93a6 in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
(gdb) next
..... <no response, unless I press CTRL-C> ...

It seems to me that the program calls __lll_mutex_lock_wait ()
in the C library, but it fails to get the lock and waits forever.
I'd like to ask:
(1) What kinds of program statements would cause
the __lll_mutex_lock_wait () to execute?
(2) In this program, I use signals. Is it because the signal
handling that causes this deadlock to occur (the program gets
the lock before a signal, and the program tries to get the
same lock after the signal is invoked)?

I'm using gcc 2.95.66. It seems 3.22 and 3.32 also have this
problem. But when I used an older version (egcs 2.91), the
program can run successfully to completion.

I appreciate a lot for any advice and comments. Thanks!

Best Regards,
Benny (LaBird).
Email: remove all numerals to get the valid one.


Paul Pluzhnikov

unread,
May 10, 2004, 10:51:41 PM5/10/04
to
"LaBird" <wlcheu...@hkucs2004.org> writes:

> (1) What kinds of program statements would cause
> the __lll_mutex_lock_wait () to execute?

Any kind of program using mutexes.
Any multi-threaded program using glibc.

> (2) In this program, I use signals. Is it because the signal
> handling that causes this deadlock to occur (the program gets
> the lock before a signal, and the program tries to get the
> same lock after the signal is invoked)?

It is impossible to tell, given the info you've provided.
At a minimum, execute "where" at the gdb prompt and post the result.

That said, note that it is *extremely* difficult to write correct
programs that are multi-threaded and handle signals.

Also note, that you are not allowed to call async-signal unsafe
functions from within a signal handler (even in single-threaded
programs), and pthread_mutex is definitely not async-signal safe.

> I'm using gcc 2.95.66. It seems 3.22 and 3.32 also have this
> problem. But when I used an older version (egcs 2.91), the
> program can run successfully to completion.

The bug was probably already there, but was simply hiding.

Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.

LaBird

unread,
May 11, 2004, 3:46:01 AM5/11/04
to
Hi,

"Paul Pluzhnikov" <ppluzhn...@charter.net> wrote in message
news:m31xlry...@salmon.parasoft.com...


> "LaBird" <wlcheu...@hkucs2004.org> writes:
>
> > (1) What kinds of program statements would cause
> > the __lll_mutex_lock_wait () to execute?
>
> Any kind of program using mutexes.

Seems the mutexes are implemented by the gcc itself
rather than provided by the OS, right? But what kinds
of general C++ statements will use mutexes? (Is that
virtually every I/O and memory access will call mutexes?)

> Any multi-threaded program using glibc.
>
> > (2) In this program, I use signals. Is it because the signal
> > handling that causes this deadlock to occur (the program gets
> > the lock before a signal, and the program tries to get the
> > same lock after the signal is invoked)?
>
> It is impossible to tell, given the info you've provided.
> At a minimum, execute "where" at the gdb prompt and post the result.

Here it is:
Attaching to program: /home/srg/benny/mm4, process 5365


Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2

0x004c5c32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) where
#0 0x004c5c32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x00ad13a6 in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
#2 0xbf534a60 in ?? ()
#3 0x00b1e998 in __DTOR_END__ () from /lib/tls/libc.so.6
#4 0x00b1f2a0 in _IO_stdfile_0_lock () from /lib/tls/libc.so.6
#5 0x00a308a7 in _L_mutex_lock_6706 () from /lib/tls/libc.so.6
#6 0xbff7898c in ?? ()
#7 0x08063bc0 in _IO_stdin_used ()
#8 0xbff7b080 in ?? ()


(gdb) next
Single stepping until exit from function _dl_sysinfo_int80,
which has no line number information.

0x00ad13a6 in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
(gdb) where
#0 0x00ad13a6 in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
#1 0xbf534a60 in ?? ()
#2 0x00b1e998 in __DTOR_END__ () from /lib/tls/libc.so.6
#3 0x00b1f2a0 in _IO_stdfile_0_lock () from /lib/tls/libc.so.6
#4 0x00a308a7 in _L_mutex_lock_6706 () from /lib/tls/libc.so.6
#5 0xbff7898c in ?? ()
#6 0x08063bc0 in _IO_stdin_used ()
#7 0xbff7b080 in ?? ()
(gdb)

The ?? above is annoying me, does that mean the program
counter has lost where it should proceed?

> That said, note that it is *extremely* difficult to write correct
> programs that are multi-threaded and handle signals.

My program is not intended to be multi-threaded (I did not
create threads explicitly in the program), but rather,
multiprocess using "rsh" to start the same copy on different
machines (1 process per machine).

> Also note, that you are not allowed to call async-signal unsafe
> functions from within a signal handler (even in single-threaded
> programs), and pthread_mutex is definitely not async-signal safe.

I read from other web resources that malloc() (and other functions
depending on it) should not be called within the signal handler.
I have used memcpy. Is it dependent to malloc() thus causes
the problem?

Thank you very much for your help.

Best Regards,
Benny (LaBird).


Paul Pluzhnikov

unread,
May 11, 2004, 10:58:56 AM5/11/04
to
"LaBird" <wlcheu...@hku2004cs.org> writes:

> Seems the mutexes are implemented by the gcc itself
> rather than provided by the OS, right?

No. Mutexes are implemented in libpthread.so, which is part of glibc,
but is not part of gcc, nor "the OS".

> But what kinds
> of general C++ statements will use mutexes? (Is that
> virtually every I/O and memory access will call mutexes?)

Any use of stdio will use mutexes, if linked into multithreaded
program.

> Attaching to program: /home/srg/benny/mm4, process 5365
> Reading symbols from /lib/tls/libm.so.6...done.
> Loaded symbols for /lib/tls/libm.so.6
> Reading symbols from /lib/tls/libc.so.6...done.
> Loaded symbols for /lib/tls/libc.so.6
> Reading symbols from /lib/ld-linux.so.2...done.
> Loaded symbols for /lib/ld-linux.so.2

Hmm, no libpthread. I would not expect any mutex use in this case ...

> (gdb) where
> #0 0x004c5c32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> #1 0x00ad13a6 in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
> #2 0xbf534a60 in ?? ()
> #3 0x00b1e998 in __DTOR_END__ () from /lib/tls/libc.so.6
> #4 0x00b1f2a0 in _IO_stdfile_0_lock () from /lib/tls/libc.so.6
> #5 0x00a308a7 in _L_mutex_lock_6706 () from /lib/tls/libc.so.6
> #6 0xbff7898c in ?? ()
> #7 0x08063bc0 in _IO_stdin_used ()
> #8 0xbff7b080 in ?? ()

And the stack trace appears to be totally bogus...
You may want to install glibc-debug and try again. I would not
trust the above stack trace.

> The ?? above is annoying me, does that mean the program
> counter has lost where it should proceed?

It means that you are looking at optimized code which did not follow
the "preserve %ebp" convention, and that gdb was unable to correctly
decode the stack trace. The apparent stack trace is unlikely to
reflect *real* application state.

> My program is not intended to be multi-threaded ...

And doesn't have libpthread in it (just as it should not).
I doubt it ever calls __lll_mutex_lock_wait.

You can verify whether it does, by following this procedure:

- run the app under gdb, break on main.
- when BP1 is hit:

p &__lll_mutex_lock_wait ## prints address of the first instruction
b *0x00ad... ## use the address just printed
cont

If BP2 is ever hit, the app does call the _lll_mutex... (contrary
to my expectations).

Also, from the stack above it appears that you are using Fedora
Core1 with exec-shield enabled (or one of the other "hack prevention"
recent inventions). Such use on a *development* machine is ill
advised, because development *is* hacking.

> I read from other web resources that malloc() (and other functions
> depending on it) should not be called within the signal handler.

Including stdio (e.g. printf/fprintf).

> I have used memcpy. Is it dependent to malloc() thus causes
> the problem?

No. memcpy() is async-signal safe.

LaBird

unread,
May 11, 2004, 1:00:34 PM5/11/04
to
Hi Paul,

> If BP2 is ever hit, the app does call the _lll_mutex... (contrary
> to my expectations).

I tried a few times, it does not seem the __lll_mutex_lock_wait()
is called (every time it gives a different address), as you have
predicted. You are right.

> Also, from the stack above it appears that you are using Fedora
> Core1 with exec-shield enabled (or one of the other "hack prevention"
> recent inventions). Such use on a *development* machine is ill
> advised, because development *is* hacking.

Yes, perhaps it does get some bugs to be fixed. This may account
for the messy debugging flow shown in my previous post.

So, I try to throw away the Fedora for the moment and test the program
with another 4 machines installed with RedHat 9.0 and gcc-3.2.2-5.
I got the same problem again, but this time the gdb messages are
more reasonable:

Attaching to program: /home/benny/mm7, process 11134
Reading symbols from /usr/lib/libstdc++.so.5...done.
Loaded symbols for /usr/lib/libstdc++.so.5


Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6

Reading symbols from /lib/libgcc_s.so.1...done.
Loaded symbols for /lib/libgcc_s.so.1


Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2

0xffffe002 in ?? ()
(gdb) next
Cannot find bounds of current function
(gdb) next
Cannot find bounds of current function
(gdb)

(gdb) where
#0 0xffffe002 in ?? ()
#1 0x420492bf in vfprintf () from /lib/tls/libc.so.6
#2 0x4204f112 in printf () from /lib/tls/libc.so.6
#3 0x08058025 in gettokenserver(dsmmsg) (m=
{op = 14, subseq = 0, ref = 0, endflag = 1, frompid = 2, topid = 1,
seqno
= 143, size = 8, refp = 0x0, outbuff = 0x0, data = "\212\002\000\000\002",
'\0'
<repeats 65530 times>}) at dsmall.cpp:449
#4 0x080557cc in msgserver() () at dsmcomm.cpp:134
#5 0x08055cf7 in sigio_handler() () at dsmcomm.cpp:278
#6 <signal handler called>
#7 0x420498c6 in buffered_vfprintf () from /lib/tls/libc.so.6
#8 0x420492bf in vfprintf () from /lib/tls/libc.so.6
#9 0x4204f112 in printf () from /lib/tls/libc.so.6
#10 0x080583d0 in gettoken(int, unsigned) (getfrom=3, objid=400)
at dsmall.cpp:527
#11 0x08058714 in dsmmem(unsigned) (objid=400) at dsmall.cpp:614
#12 0x0805a369 in fasttouch(unsigned, unsigned) (fid=400, fstate=1)
at dsmall.cpp:1362
#13 0x0804c382 in Pointer<int>::operator[](int) (this=0x5ffff628, index=0)
at dsm.h:1237
#14 0x0804974b in main (argc=5, argv=0xbffff1d4) at dsminit3.cpp:164
#15 0x42015574 in __libc_start_main () from /lib/tls/libc.so.6

Does this suggest that the reentrant async-signal unsafe printf() before
and after the signal (#2 and #9) cause the problem?

One more rather general question is: Here I have not used any
optimizations (-O1) because it seems like when I use it, the chance
of causing problems is larger. If I use -O2, the program hangs once
I start it. Will some programs that can successfully run without
optimization, but fails to complete when using -O1 or -O2?
If so, is there any general programming rules to avoid problems
under compiling with -O?

Thank you once again.

Best Regards,
Benny (LaBird).


Paul Pluzhnikov

unread,
May 11, 2004, 9:01:25 PM5/11/04
to
"LaBird" <wlcheu...@hkucs2004.org> writes:

This doesn't *suggest* it. This stack trace *indicates* it.

You are calling unsafe function (printf) from a signal handler.
Don't do that. Doing that causes undefined behavior.

> One more rather general question is: Here I have not used any
> optimizations (-O1) because it seems like when I use it, the chance
> of causing problems is larger. If I use -O2, the program hangs once
> I start it. Will some programs that can successfully run without
> optimization, but fails to complete when using -O1 or -O2?
> If so, is there any general programming rules to avoid problems
> under compiling with -O?

Yes: the general rule is to write *correct* programs with
well-defined behavior. If you write incorrect program (such as
above), its behaviour is *undefined*.

That means it may run forever, but crash when optimized, or vice
versa; or it can run on machine A but not on identical machine B;
or it can deadlock every 5 minutes; or crash every 23 seconds but
only on Mondays. You get the picture (I hope).

Paul Pluzhnikov

unread,
May 11, 2004, 10:39:37 PM5/11/04
to
"LaBird" <wlcheu...@hkucs2004.org> writes:

> (gdb) where


> #3 0x08058025 in gettokenserver(dsmmsg) (m=
> {op = 14, subseq = 0, ref = 0, endflag = 1, frompid = 2, topid = 1,
> seqno
> = 143, size = 8, refp = 0x0, outbuff = 0x0, data = "\212\002\000\000\002",
> '\0'
> <repeats 65530 times>}) at dsmall.cpp:449

It would appear that you are passing "dsmmsg" into gettokenserver()
by value, and that "dsmmsg" is 64K in size, and mostly filled
with zeros.

That's *insane*. Consider passing a pointer to dsmmsg instead.

LaBird

unread,
May 12, 2004, 2:32:45 AM5/12/04
to

"Paul Pluzhnikov" <ppluzhn...@charter.net> wrote in message
news:m3fza6l...@salmon.parasoft.com...

> You are calling unsafe function (printf) from a signal handler.
> Don't do that. Doing that causes undefined behavior.
>
>
> Yes: the general rule is to write *correct* programs with
> well-defined behavior. If you write incorrect program (such as
> above), its behaviour is *undefined*.
>
> That means it may run forever, but crash when optimized, or vice
> versa; or it can run on machine A but not on identical machine B;
> or it can deadlock every 5 minutes; or crash every 23 seconds but
> only on Mondays. You get the picture (I hope).

Thank you very much for your help,
I've learned a lot from your reply.

Regards,
Benny (LaBird).


0 new messages