SIGBUS that doesn't go away even after application restart, only after reboot

8 views
Skip to first unread message

AJ ONeal

unread,
Dec 20, 2010, 2:26:42 PM12/20/10
to uta...@googlegroups.com, Provo Linux Users Group
I've got an application which runs for several hours in the same loop correctly and then exits with SIGBUS.

If I try to start the application again it immediately exits with SIGBUS and will not work until rebooted.
If I reboot the system it runs for several hours again.

gdb reports
Program received signal SIGBUS, Bus error.
0x401bd354 in memcpy () from /lib/libc.so.6
(gdb) bt
#0  0x401bd354 in memcpy () from /lib/libc.so.6
#1  0x0000929c in capture_file_write ()
#2  0x0fefeb0a in ?? ()
#3  0x0fefeb0a in ?? ()

Any ideas on what causes this error?
How do I go about finding what #2 and #3 are?

This is the application flow:

count = 0 # this is %d
do
    outfile = /dev/shm/output.%d.dat
    data = read logged.dat for 512kb
    unlink outfile if exists
    truncate outfile to 512kb
    outfile_p = mmap outfile
    memcpy data outfile_p
    count += 1
    if count > MAX; count = 0
    advance or rewind logged.dat
loop

AJ ONeal

Shawn

unread,
Dec 20, 2010, 2:41:48 PM12/20/10
to Utah C Users Group
On Dec 20, 12:26 pm, AJ ONeal <coola...@gmail.com> wrote:
> Any ideas on what causes this error?

SIGBUS is caused by one of two things: Unaligned memory access or
access to an unmapped virtual address. Clearly your case must be the
latter, since memcpy does byte-aligned reads/writes. I've only seen
it happen when a file is mmap'd, then truncated, then an access is
made to the now-nonexistent portion of the file.

Your pseudocode doesn't show that you ever munmap and close outfile,
but I'm supposing you do. If you didn't, I suppose you might
eventually exhaust the virtual memory space -- but all of that space
should be unmapped as soon as the process terminates, so that wouldn't
explain the need to reboot.

This is very weird.

> How do I go about finding what #2 and #3 are?

There's another very weird thing. Those addresses are funny-looking.
One thing you can do is make sure you're liking to a libgcc that has
debugging symbols, so you can see all of the stack layers that run
before your code is called... but even without that, I'd expect to see
a main() on that stack before you get to frames without symbols. I'm
assuming that all of your code is built with debugging symbols.

The fact that a *reboot* is needed to clear the bad state really
points toward a kernel bug. It might be a good idea to ask on LKML (I
assume this is on Linux).

--
Shawn

Shawn

unread,
Dec 20, 2010, 2:45:47 PM12/20/10
to Utah C Users Group
On Dec 20, 12:41 pm, Shawn <shawnwill...@gmail.com> wrote:
> The fact that a *reboot* is needed to clear the bad state really
> points toward a kernel bug.  It might be a good idea to ask on LKML (I
> assume this is on Linux).

I should be a little clearer about this: I expect it's an application
bug that is causing the problem, but the fact is that once the process
is dead, any outstanding mappings should have been cleaned up by the
kernel.

Do any other programs have problems after the system has gotten into
this state?

--
Shawn
Reply all
Reply to author
Forward
0 new messages