Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

For what reasons would a process be delivered SIGBUS

2,167 views
Skip to first unread message

Andrew Falanga

unread,
Jun 25, 2014, 5:18:51 PM6/25/14
to
Hi everyone,

Some code that my team produced generates SIGBUS occasionally (not at all at regular intervals). Since our release code is optimized, I decided to deliver an unoptimized build to the customer for a core file which would be easier to diagnose (or so was the theory). Well, the debug build didn't give an error so the theory was the optimization flags we were using were not good (is -Os and we changed to -O2 in gcc). The problem just happened again.

I've read through some posts to this group and Stack Overflow and a myriad of others and most say the same things, that a bus error comes from dereferencing an unaligned pointer. (with the help of a wikipedia article for the assembler code, I've even produced code which makes a bus error all the time). The problem is, as I understand it, the x86 architecture protects against this unless you specifically embed assembler to enable the trap (which we're not doing). Further more, the one core file I've examined which came from SIGBUS showed all pointers having DWORD aligned addresses (this was on a 32-bit Linux build).

Past posts to this group, going back to the 90's, shows that it's usually the unaligned pointer. However, newer posts also mention the use of mmap() to files which change in size or an attempt to read past the end, etc. I've also found a post to Stack Overflow, http://stackoverflow.com/a/2096012/988207 (if anyone is interested), which states that a failed I/O request can also produce a bus error on Linux.

I would simply like to know for what reasons "modern" Linux kernels will deliver SIGBUS. I put modern in quotes because this is a CentOS 6.0 system which is using kernel 2.6.32 ... pretty "old".

Thanks,
Andy

Jorgen Grahn

unread,
Jun 25, 2014, 5:49:37 PM6/25/14
to
On Wed, 2014-06-25, Andrew Falanga wrote:
> Hi everyone,
>

> Some code that my team produced generates SIGBUS occasionally (not
> at all at regular intervals). Since our release code is optimized, I
> decided to deliver an unoptimized build to the customer for a core
> file which would be easier to diagnose (or so was the theory).

I guess you know you can have optimization /and/ debug info with gcc?
E.g. 'gcc -O3 -g ...'. (Of course, optimization will still make the
result harder to interpret, with values optimized away, functions
inlined and so on.)

IIRC, you can also recompile while adding -g, and that executable will
be compatible with the core dump generated by the stripped executable.

> Well, the debug build didn't give an error so the theory was the
> optimization flags we were using were not good (is -Os and we changed
> to -O2 in gcc). The problem just happened again.

> I've read through some posts to this group and Stack Overflow and
> a myriad of others and most say the same things, that a bus error
> comes from dereferencing an unaligned pointer. (with the help of a
> wikipedia article for the assembler code, I've even produced code
> which makes a bus error all the time). The problem is, as I
> understand it, the x86 architecture protects against this

True; it's unusually tolerant. A coworker often laments the fact that
we abandoned SPARC, because it's unusually intolerant ...

[snip]

I don't know enough offhand to comment on the rest. You seem to be
asking the right questions ... but you also have the core dump: it's
showing you the instructions going on when the SIGBUS happened, along
with the registers. The answer is in there somewhere (but sadly you
need to more or less understand some x86 assembly).

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Kaz Kylheku

unread,
Jun 25, 2014, 6:08:32 PM6/25/14
to
On 2014-06-25, Andrew Falanga <af30...@gmail.com> wrote:
> Hi everyone,
>
> Some code that my team produced generates SIGBUS occasionally (not at all at
> regular intervals). Since our release code is optimized, I decided to
> deliver an unoptimized build to the customer for a core file which would be
> easier to diagnose (or so was the theory). Well, the debug build didn't give
> an error so the theory was the optimization flags we were using were not good
> (is -Os and we changed to -O2 in gcc). The problem just happened again.

Optimization is usually not the root cause of a bug; but it can change
the behavior so that hidden bugs are revealed. Or possibly vice versa.

Basically, if there is a defect in the program, different optimization settings
change the conditions under which the defect manifests itself.

On architectures that enforce alignment, SIGBUS denotes alignment exceptions.

A SIGBUS also occurs in the situation that a memory mapping extends
beyond the end of a file and is accessed.

And of course, something can always raise SIGBUS by accident.

int uninited_var; /* contains garbage equivalent to SIGBUS */
raise(uninited_var);

> for the assembler code, I've even produced code which makes a bus error all
> the time). The problem is, as I understand it, the x86 architecture protects
> against this unless you specifically embed assembler to enable the trap
> (which we're not doing). Further more, the one core file I've examined which
> came from SIGBUS showed all pointers having DWORD aligned addresses (this was
> on a 32-bit Linux build).

So alignment is ruled out. If you can rule out mmap'ed files as being the
cause, then the next thing to investigate is various Linux-kernel-specific
SIGBUS situations.

A good way to proceed here is to hunt down kernel code that possibly generates
SIGBUS and add a printk and a dump_stack() call there.

Are you accessing memory-mapped hardware? The "fault" virtual function
in a "struct vm_operations" can return VM_FAULT_SIGBUS, which the kernel
will translate to a SIGBUS. Numerous places return this value.

Oh, and one situation in the kernel that looks like can trigger a SIGBUS: when
a stack crashes into another memory mapping. Normally, the usual main thread
stack will not crash into anything when you have runaway recursion: it its the
process limit, and that's a SIGSEGV. But looks like there is code in Linux
which turns the crashing situation into a SIGBUS. See the do_anonymous_page
function in mm/memory.c, where it calls check_stack_guard_page.

Still, that does not give rise to easy theories about -Os versus -O2.

Jorgen Grahn

unread,
Jun 25, 2014, 6:41:27 PM6/25/14
to
On Wed, 2014-06-25, Kaz Kylheku wrote:
> On 2014-06-25, Andrew Falanga <af30...@gmail.com> wrote:

[snip good stuff]

>> for the assembler code, I've even produced code which makes a bus error all
>> the time). The problem is, as I understand it, the x86 architecture protects
>> against this unless you specifically embed assembler to enable the trap
>> (which we're not doing). Further more, the one core file I've examined which
>> came from SIGBUS showed all pointers having DWORD aligned addresses (this was
>> on a 32-bit Linux build).
>
> So alignment is ruled out. If you can rule out mmap'ed files as being the
> cause, then the next thing to investigate is various Linux-kernel-specific
> SIGBUS situations.
>
> A good way to proceed here is to hunt down kernel code that possibly generates
> SIGBUS and add a printk and a dump_stack() call there.
>
> Are you accessing memory-mapped hardware? The "fault" virtual function
> in a "struct vm_operations" can return VM_FAULT_SIGBUS, which the kernel
> will translate to a SIGBUS. Numerous places return this value.
>
> Oh, and one situation in the kernel that looks like can trigger a SIGBUS: when
> a stack crashes into another memory mapping. Normally, the usual main thread
> stack will not crash into anything when you have runaway recursion: it its the
> process limit, and that's a SIGSEGV. But looks like there is code in Linux
> which turns the crashing situation into a SIGBUS. See the do_anonymous_page
> function in mm/memory.c, where it calls check_stack_guard_page.
>
> Still, that does not give rise to easy theories about -Os versus -O2.

The good news is that the situations you list should be fairly easy to
spot when debugging the core dump. Write down the memory mappings,
then check stack pointer and other registers, looking for one that's
close to a border. (Simplified.)

Andrew Falanga

unread,
Jun 25, 2014, 7:06:06 PM6/25/14
to
On Wednesday, June 25, 2014 3:49:37 PM UTC-6, Jorgen Grahn wrote:
> On Wed, 2014-06-25, Andrew Falanga wrote:
>
> > Hi everyone,
>
> >
>
>
>
> > Some code that my team produced generates SIGBUS occasionally (not
>
> > at all at regular intervals). Since our release code is optimized, I
>
> > decided to deliver an unoptimized build to the customer for a core
>
> > file which would be easier to diagnose (or so was the theory).
>
>
>
> I guess you know you can have optimization /and/ debug info with gcc?
>
> E.g. 'gcc -O3 -g ...'. (Of course, optimization will still make the
>
> result harder to interpret, with values optimized away, functions
>
> inlined and so on.)

I know this now (and not simply because you've posted it). I ran into this while pouring over the gcc manual to learn more about the optimization and debug settings.


>
>
>
> IIRC, you can also recompile while adding -g, and that executable will
>
> be compatible with the core dump generated by the stripped executable.
>

You recall correctly. When I learned that could embed gdb friendly symbols into an optimized build it was a short leap in thought to realize I could simply rebuild the code, as done for release, with -ggdb added. The result was quite favorable.

>
>
>
> True; it's unusually tolerant. A coworker often laments the fact that
>
> we abandoned SPARC, because it's unusually intolerant ...


That explains why I first ran into SIGBUS on sparc then.

>
>
>
> [snip]
>
>
>
> I don't know enough offhand to comment on the rest. You seem to be
>
> asking the right questions ... but you also have the core dump: it's
>
> showing you the instructions going on when the SIGBUS happened, along
>
> with the registers. The answer is in there somewhere (but sadly you
>
> need to more or less understand some x86 assembly).
>

To say the least, this is intimidating. I do not yet know assembly. Looks like I'm in for a crash course. Do you have any links that are *good* for a very green assembly programmer?

Andy

Andrew Falanga

unread,
Jun 25, 2014, 7:25:03 PM6/25/14
to
On Wednesday, June 25, 2014 4:08:32 PM UTC-6, Kaz Kylheku wrote:

>
>
>
> Optimization is usually not the root cause of a bug; but it can change
>
> the behavior so that hidden bugs are revealed. Or possibly vice versa.
>

Yes, this seems to make sense. While investigating the optimization path, I ran into an interesting discrepancy. More about this below ....

>
>
> And of course, something can always raise SIGBUS by accident.

Yikes but I guess these things happen.

> So alignment is ruled out. If you can rule out mmap'ed files as being the
>
> cause, then the next thing to investigate is various Linux-kernel-specific
>
> SIGBUS situations.
>
>
>
> A good way to proceed here is to hunt down kernel code that possibly generates
>
> SIGBUS and add a printk and a dump_stack() call there.
>

These are good suggestions. I'll be digging into the kernel code then. One thing I'm curious about: the signal(7) man page states that the numerical values are positive, but the codes being delivered are negative. That is, our code is called by a "master control program" (TRON flashbacks), which stores the return value of Python (our library is C++ exposed to python). The process is delivered SIGBUS, which is defined as 7, but the MCP receives -7. Why is this?

>
>
> Are you accessing memory-mapped hardware?

To my knowledge no. However, I'm 1 of 6 developers on this team and I don't know every last corner of the code. I'll check to make sure, but am reasonably certain we're not doing any mmap'ing in the code base.

> The "fault" virtual function
>
> in a "struct vm_operations" can return VM_FAULT_SIGBUS, which the kernel
>
> will translate to a SIGBUS. Numerous places return this value.


I'm not familiar with this at all. Is this something I should look for through our code? Or are you mentioning something from within the kernel?

>
>
>
> Oh, and one situation in the kernel that looks like can trigger a SIGBUS: when
>
> a stack crashes into another memory mapping. Normally, the usual main thread
>
> stack will not crash into anything when you have runaway recursion: it its the
>
> process limit, and that's a SIGSEGV. But looks like there is code in Linux
>
> which turns the crashing situation into a SIGBUS. See the do_anonymous_page
>
> function in mm/memory.c, where it calls check_stack_guard_page.
>
>
>
> Still, that does not give rise to easy theories about -Os versus -O2.

So, I found that what the gcc docs say is happening between -0s and -O2 isn't what is happening with this compiler. CentOS 6.5 is using gcc 4.4.7, which is the latest release of 4.4. I filed a bug to GCC Bugzilla, after posting the question to gcc-help. For the full breakdown, the bug id is 61588. Essentially, there's virtually no difference on this compiler between -Os and -O2 with one exception.

Andy

James K. Lowden

unread,
Jun 25, 2014, 10:24:12 PM6/25/14
to
On Wed, 25 Jun 2014 16:25:03 -0700 (PDT)
Andrew Falanga <af30...@gmail.com> wrote:

> One thing I'm curious about: the signal(7) man page states that the
> numerical values are positive, but the codes being delivered are
> negative. That is, our code is called by a "master control
> program" (TRON flashbacks), which stores the return value of Python
> (our library is C++ exposed to python). The process is delivered
> SIGBUS, which is defined as 7, but the MCP receives -7. Why is this?

It's a feature of your friendly shell. From the bash manual:

SHELL GRAMMAR
Simple Commands
A simple command is a sequence of optional variable
assignments followed by blank-separated words and redirections, and
terminated by a control operator. The first word specifies the
command to be executed, and is passed as argument zero. The
remaining words are passed as arguments to the invoked command.

The return value of a simple command is its exit status, or
128+n if the command is terminated by signal n.

So your 0x07 becomes 0x87, ne plus utra negative. ;-)

--jkl

Jorgen Grahn

unread,
Jun 26, 2014, 2:15:04 AM6/26/14
to
On Wed, 2014-06-25, Andrew Falanga wrote:
> On Wednesday, June 25, 2014 3:49:37 PM UTC-6, Jorgen Grahn wrote:
...
>> I don't know enough offhand to comment on the rest. You seem to be
>> asking the right questions ... but you also have the core dump: it's
>> showing you the instructions going on when the SIGBUS happened, along
>> with the registers. The answer is in there somewhere (but sadly you
>> need to more or less understand some x86 assembly).

> To say the least, this is intimidating.

Yeah, but you're looking at a customer core dump. The truth is in
there somewhere, and there's a lot to gain by looking at the assembly.

> I do not yet know assembly.
> Looks like I'm in for a crash course. Do you have any links that are
> *good* for a very green assembly programmer?

Sadly no. I don't know x86 assembly myself. I know MC68000 assembly
from way back, so I think that plus googling (Wikipedia?) would be
barely enough for me.

E.g. looking at an instruction like

mov %rax,0x20(%rsp)

plus a register dump, you should be able to tell what it does[1]. And
you can practice on other object files and executables, using objdump
-dl. Source line numbers is nice to have -- better than intermingled
C source code IMO.

Also, perhaps a coworker can help you? Even if she hasn't worked with
x86, she may have taken a course in low-level programming so that she
understands the basics.

/Jorgen

[1] It either copies register RAX to offset 32 on the stack, or the other
way around. There are two competing syntaxes for assembly, and I
forget if objdump shows source-first or destination-first.

Rainer Weikusat

unread,
Jun 26, 2014, 5:24:45 AM6/26/14
to
Andrew Falanga <af30...@gmail.com> writes:
> On Wednesday, June 25, 2014 3:49:37 PM UTC-6, Jorgen Grahn wrote:
>> On Wed, 2014-06-25, Andrew Falanga wrote:
>> > Some code that my team produced generates SIGBUS occasionally (not
>> > at all at regular intervals).

[...]

>> showing you the instructions going on when the SIGBUS happened, along
>> with the registers. The answer is in there somewhere (but sadly you
>> need to more or less understand some x86 assembly).
>
> To say the least, this is intimidating. I do not yet know assembly.
> Looks like I'm in for a crash course. Do you have any links that are
> *good* for a very green assembly programmer?

Using the core dump and a debugger should give you a stack
backtrace. This is possibly not that helpful, eg, writing
through a dangling pointer may result in a visible error in a completely
different portion of the program, but it will at least provide some
useful information.

idol programmer

unread,
Jun 26, 2014, 7:17:00 AM6/26/14
to
Have you tried running this under valgrind or a similar memory checking
tool?

Whenever I've got a SIGBUS it's been due to corrupting memory / stack by
trying to store 100 bytes in a 50 byte field. If you're lucky it'll
crash pretty shortly afterwards, if you're unlucky it'll be randomly
later on.

The optimized code is probably more sensitive to the effects of the bug.

We always run with debug versions of our code in production
(unstripped) because the benefits when something goes wrong far outweigh
the negatives. But your milage may vary of course.

Nobody

unread,
Jun 26, 2014, 7:42:26 AM6/26/14
to
On Wed, 25 Jun 2014 16:25:03 -0700, Andrew Falanga wrote:

> These are good suggestions. I'll be digging into the kernel code then.
> One thing I'm curious about: the signal(7) man page states that the
> numerical values are positive, but the codes being delivered are negative.
> That is, our code is called by a "master control program" (TRON
> flashbacks), which stores the return value of Python (our library is C++
> exposed to python). The process is delivered SIGBUS, which is defined as
> 7, but the MCP receives -7. Why is this?

How does the MCP obtain the termination status?

If the MCP is also written in Python, then note that the .returncode field
of a subprocess.Popen object is the process' exit status (the value
passed to exit() or returned from main()) if the process terminated
normally, or the negation of the signal number if the process terminated
due to a signal:

if _WIFSIGNALED(sts):
self.returncode = -_WTERMSIG(sts)
elif _WIFEXITED(sts):
self.returncode = _WEXITSTATUS(sts)

The intention is probably that this format (non-negative for exit code,
negative for a signal) is more portable and easier to decode than the raw
status returned from wait() etc. E.g.

if p.returncode < 0:
print "process terminated on signal %d" % abs(p.returncode)
else:
print "process terminated with status %d" % p.returncode

Noob

unread,
Jun 26, 2014, 8:16:54 AM6/26/14
to
On 25/06/2014 23:49, Jorgen Grahn wrote:

> I guess you know you can have optimization /and/ debug info with gcc?
> E.g. 'gcc -O3 -g ...'. (Of course, optimization will still make the
> result harder to interpret, with values optimized away, functions
> inlined and so on.)

It is worth noting that gcc has recently added the -Og optimization
flag, which means:

Optimize debugging experience. -Og enables optimizations that do not
interfere with debugging. It should be the optimization level of
choice for the standard edit-compile-debug cycle, offering a
reasonable level of optimization while maintaining fast compilation
and a good debugging experience.

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-Og-801

It is also worth noting that, while -O2 almost always produces faster
code than -O1, it is not rare for -O3 to produce slower code than -O2.
(And -O3 is more likely to tickle compiler bugs.)

Regards.

Richard Kettlewell

unread,
Jun 26, 2014, 9:54:56 AM6/26/14
to
Andrew Falanga <af30...@gmail.com> writes:
> I've read through some posts to this group and Stack Overflow and a
> myriad of others and most say the same things, that a bus error comes
> from dereferencing an unaligned pointer. (with the help of a
> wikipedia article for the assembler code, I've even produced code
> which makes a bus error all the time). The problem is, as I
> understand it, the x86 architecture protects against this unless you
> specifically embed assembler to enable the trap (which we're not
> doing). Further more, the one core file I've examined which came from
> SIGBUS showed all pointers having DWORD aligned addresses (this was on
> a 32-bit Linux build).

There are are few things things in x86 that are more sensitive to
alignment, e.g. MOVAPS, which requires that the memory operand have
16-byte alignment.

I don’t believe GCC will generate these by default on a 32-bit Linux
build, but if this sounds at all plausible as a line of inquiry you
might want to check the following things:
- whether you use any compiler options that enable additional
instructions
- whether you use any assembly language
- whether any of the above is any true of any libraries your program
uses

--
http://www.greenend.org.uk/rjk/

Rainer Weikusat

unread,
Jun 26, 2014, 10:25:17 AM6/26/14
to
Richard Kettlewell <r...@greenend.org.uk> writes:
> Andrew Falanga <af30...@gmail.com> writes:
>> I've read through some posts to this group and Stack Overflow and a
>> myriad of others and most say the same things, that a bus error comes
>> from dereferencing an unaligned pointer. (with the help of a
>> wikipedia article for the assembler code, I've even produced code
>> which makes a bus error all the time). The problem is, as I
>> understand it, the x86 architecture protects against this unless you
>> specifically embed assembler to enable the trap (which we're not
>> doing). Further more, the one core file I've examined which came from
>> SIGBUS showed all pointers having DWORD aligned addresses (this was on
>> a 32-bit Linux build).
>
> There are are few things things in x86 that are more sensitive to
> alignment, e.g. MOVAPS, which requires that the memory operand have
> 16-byte alignment.
>
> I don锟斤拷t believe GCC will generate these by default on a 32-bit Linux
> build, but if this sounds at all plausible as a line of inquiry you
> might want to check the following things:
> - whether you use any compiler options that enable additional
> instructions
> - whether you use any assembly language
> - whether any of the above is any true of any libraries your program
> uses

The problem with this is guessing at possible causes of a problem is
extremely unlikely to yield anything useful as the number of possible
guesses is essentially infinite. In case a core dump exists (as someone
wrote) that can - in combination with a debugger and a binary with
debugging symbols - be used to determine the immediate cause of the
SIGBUS, because it will usually/ often pinpoint the line of source code
which 'caused' it and a stack backtrace showing how the program got
there. At worst, this will (for this case) require looking at the actual
machine code but problems possibly caused by that can be solved, either
by using publically available documentation or asking for specific
help.

In case there's no core dump, there's a chance that the kernel logged
the address of the faulting instruction (at least Linux usually does so,
=> dmesg/ /var/log/kern.log) and a disassembler (eg, objdump -d) can
then be used to locate the actual code.

Rainer Weikusat

unread,
Jun 26, 2014, 10:38:12 AM6/26/14
to
Jorgen Grahn <grahn...@snipabacken.se> writes:

[...]

> Of course, optimization will still make the result harder to
> interpret, with values optimized away, functions inlined and so on.

Using the various 'automatic inlining' options available for gcc because
of some misguided idea of the deadly harmfulness of 'the function call
overhead' is basically akin to handing the compiler a wooden mallet,

http://www.photo-dictionary.com/photofiles/list/2114/2765mallet.jpg

and asking it to "hit me with this until I become wiser".

Andrew Falanga

unread,
Jun 26, 2014, 12:17:37 PM6/26/14
to
On Wednesday, June 25, 2014 8:24:12 PM UTC-6, James K. Lowden wrote:
> On Wed, 25 Jun 2014 16:25:03 -0700 (PDT)
>
>
>
>
Wow, thank you very much. That makes it all make sense now.

Andrew Falanga

unread,
Jun 26, 2014, 12:29:17 PM6/26/14
to
On Thursday, June 26, 2014 6:16:54 AM UTC-6, Noob wrote:
> On 25/06/2014 23:49, Jorgen Grahn wrote:
>
>
>
> > I guess you know you can have optimization /and/ debug info with gcc?
>
> > E.g. 'gcc -O3 -g ...'. (Of course, optimization will still make the
>
> > result harder to interpret, with values optimized away, functions
>
> > inlined and so on.)
>
>
>
> It is worth noting that gcc has recently added the -Og optimization
>
> flag, which means:
>
>
>
> Optimize debugging experience. -Og enables optimizations that do not
>
> interfere with debugging. It should be the optimization level of
>

I agree it is worth noting. However, the system compiler of CentOS 6, 4.4.x, does not support this. :-(

Andrew Falanga

unread,
Jun 26, 2014, 12:32:47 PM6/26/14
to
On Thursday, June 26, 2014 5:42:26 AM UTC-6, Nobody wrote:
> On Wed, 25 Jun 2014 16:25:03 -0700, Andrew Falanga wrote:
>
>
>
> > These are good suggestions. I'll be digging into the kernel code then.
>
> > One thing I'm curious about: the signal(7) man page states that the
>
> > numerical values are positive, but the codes being delivered are negative.
>
> > That is, our code is called by a "master control program" (TRON
>
> > flashbacks), which stores the return value of Python (our library is C++
>
> > exposed to python). The process is delivered SIGBUS, which is defined as
>
> > 7, but the MCP receives -7. Why is this?
>
>
>
> How does the MCP obtain the termination status?
>
>
>
> If the MCP is also written in Python, then note that the .returncode field
>
> of a subprocess.Popen object is the process' exit status (the value
>
> passed to exit() or returned from main()) if the process terminated
>
> normally, or the negation of the signal number if the process terminated
>
> due to a signal:
>
>

The MCP is python also. I didn't write this though so I don't know what they're looking at when they print the negative value to the log. However, I do know the process died with the signal. The core dump told me so.

Andy
0 new messages