Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

stack smashing detected

13 views
Skip to first unread message

Stan Johnson

unread,
Jan 29, 2023, 11:10:03 PM1/29/23
to
Hello,

I am seeing anywhere from zero to four of the following errors while
booting Linux on 68030 systems and using sysvinit startup scripts:

*** stack smashing detected ***: terminated
Aborted

I usually (but not always) see three of the errors while init is running
the rcS.d scripts, and one while running the rc2.d scripts. The stack
smashing messages appear only on the system console (nothing is logged
in an error log or dmesg). Despite the errors, the system continues
booting to multiuser mode without any obvious additional problems. I
haven't tested systemd, which is too slow to be useful on my m68k
systems (though I have a Debian SID with systemd that I can restore for
testing if necessary).

I'm using the current Debian SID and Debian kernel, and I've confirmed
the errors on a Mac IIci and SE/30. I haven't seen the errors on any
68040 system (I only tested on a Centris 650 and PowerBook 550c). I also
notice the errors on 68030 systems using custom kernels that I have
cross-compiled using GCC 12 or GCC 10 on a x86_64 system running Debian
SID; however, I do not see the errors as often if I cross-compile using
GCC 8.3.0 on a 686 system (running Debian 10.7 Buster) -- I saw the
errors a few weeks ago with an earlier kernel, but none today using
Linux 6.1.8 cross-compiled with GCC 8.3.0.

I'll be happy to help debug or troubleshoot, though at this point, since
the "stack smashing detected" errors aren't reporting which processes
are being terminated/aborted, I'm not sure where to start.

thanks for any suggestions

-Stan Johnson use...@yahoo.com

Geert Uytterhoeven

unread,
Jan 30, 2023, 5:20:04 AM1/30/23
to
CC linux-m68k

Michael Schmitz

unread,
Jan 30, 2023, 10:10:03 PM1/30/23
to
Hi Stan,
The man page of init states that init logs process and reason for
termination in /var/run/utmp and /var/log/wtmp each time a child process
terminates. You're looking for processed terminated by SIGABRT as far as
I can see.

There does not appear to be any tool to extract that information from
utmp/wtmp files though - utmpdump only shows login process information
for me, nothing on init processes.

Another way may be logging the start of each of the rcS.d or rc2.d
scripts until you know what scripts to look at in more detail, then
adding 'set -v' at the start of those to log every command in the
offending script.

Once the offending binary is known (and the crash can be reproduced
after system boot), gdb can be used to find the function that overwrote
its local stack guard.

That's a lot of work on a 030 Mac - have you tried to reproduce this on
any kind of emulator?

I suppose one difference between your 030 and 040 Macs might be the
amount of RAM available. I wonder if this bug results from a combination
of 030 MMU and memory pressure, or 030 MMU only.

Cheers,

Michael

Eero Tamminen

unread,
Jan 31, 2023, 2:30:03 PM1/31/23
to
Hi,

On 31.1.2023 5.05, Michael Schmitz wrote:
> That's a lot of work on a 030 Mac - have you tried to reproduce this on
> any kind of emulator?
>
> I suppose one difference between your 030 and 040 Macs might be the
> amount of RAM available. I wonder if this bug results from a combination
> of 030 MMU and memory pressure, or 030 MMU only.

As to emulation... 030 and 040 MMUs differ a lot.

E.g. Aranym emulates only the (simpler) 040 MMU, and does not emulate
CPU cache. Qemu does not have any cache emulation either.


WinAUE (and Hatari & Previous) emulate both 030 MMU and CPU cache.

Using Hatari with Linux is documented here:
https://hatari.tuxfamily.org/doc/m68k-linux.txt


AFAIK 030 MMU + cache emulation works well enough in WinUAE (Amiga
emulator) to run Linux kernel + user-space, but with Hatari (Atari
emulator), you need to disable cache emulation to have working
user-space with m68k linux.


- Eero

Michael Schmitz

unread,
Feb 1, 2023, 2:00:04 PM2/1/23
to
Hi Stan,

On 2/02/23 05:38, Stan Johnson wrote:
> On 1/30/23 8:05 PM, Michael Schmitz wrote:
>> ...
>> Am 30.01.2023 um 17:00 schrieb Stan Johnson:
>>> Hello,
>>>
>>> I am seeing anywhere from zero to four of the following errors while
>>> booting Linux on 68030 systems and using sysvinit startup scripts:
>>>
>>> *** stack smashing detected ***: terminated
>>> Aborted
>>>
>>> I usually (but not always) see three of the errors while init is running
>>> the rcS.d scripts, and one while running the rc2.d scripts. The stack
>>> smashing messages appear only on the system console (nothing is logged
>>> in an error log or dmesg). Despite the errors, the system continues
>>> booting to multiuser mode without any obvious additional problems. I
>>> haven't tested systemd, which is too slow to be useful on my m68k
>>> systems (though I have a Debian SID with systemd that I can restore for
>>> testing if necessary).
>>>
>>> ...
>> Another way may be logging the start of each of the rcS.d or rc2.d
>> scripts until you know what scripts to look at in more detail, then
>> adding 'set -v' at the start of those to log every command in the
>> offending script.
> Hi Michael,
>
> Thanks for your reply.
>
> After logging the start and end of each script, I see that the "stack
> smashing detected" error often happens while running
> "/etc/rcS.d/S01mountkernfs.sh" (/etc/init.d/mountkernfs.sh). I'll try to
> isolate it to a particular command.
>
> This may be a coincidence, but the error seems to happen (up to about 4
> times) after a warm boot into Mac OS 7.5.5 and a subsequent boot into
> Linux that when starting with a cold boot into Mac OS 7.5.5, but it
> doesn't seem that that should make any difference for Linux. This
> morning, after a cold boot, I saw two of the errors, while after a warm
> boot, I saw four.
Hmm - that might well indicate a hardware issue rather than software.
Bits flipping at random in RAM (and getting picked up because the stack
canary changes).
>
>> Once the offending binary is known (and the crash can be reproduced
>> after system boot), gdb can be used to find the function that overwrote
>> its local stack guard.
> Is there a way to configure the kernel to use the stack guard for every
> function, and then log every resulting abort? I realize that that would
> be very slow, but it might also be useful for debugging.

The stack canary mechanism pushes a token on the stack at function
entry, and compares against that token's value at function exit. This is
all code generated by gcc in the user binary.

The kernel is not involved in function calls other than syscalls. For
syscalls, we could try to find the user mode stack, and do a similar
canary trick, but I don't think that would be necessary for all
syscalls. Might be easier to instrument copy_to_user() instead if you're
worried about a syscall receiving result data that way to a variable on
the stack.

But since we're touching on copy_to_user() here - the 'remove set_fs'
patch set by Christoph Hellwig refactored the m68k inline helpers around
July 2021. Can you test a kernel prior to those patches (5.15-rc2)?

>
>> That's a lot of work on a 030 Mac - have you tried to reproduce this on
>> any kind of emulator?
> I haven't seen the error in QEMU.
>
>> I suppose one difference between your 030 and 040 Macs might be the
>> amount of RAM available. I wonder if this bug results from a combination
>> of 030 MMU and memory pressure, or 030 MMU only.
> For some reason, the error seems to happen only with 68030 systems,
> regardless of processor speed or memory:
>
> PB 170 : 68030, 25 MHz, 8 MiB, external SCSI2SD
> Mac IIci : 68030, 25 MHz, 80 MiB, internal SCSI2SD
> SE/30 : 68030, 16 MHz, 128 MiB, external SCSI2SD
> PB 550c : 68040, 33 MHz, 36 MiB, external SCSI2SD
> Centris 650 : 68040, 25 MHz, 136 MiB, internal SCSI2SD

Exception handling in copy_to_user() and the related bits in 030 page
fault handling might need another look in then...

Cheers,

    Michael



>
> -Stan

Geert Uytterhoeven

unread,
Feb 2, 2023, 3:00:04 AM2/2/23
to
Hi Michael et al,
You can enable extra debugging options in the kernel, which might help
detecting memory corruption, like CONFIG_DEBUG_LIST and DEBUG_SLAB.
It will slow down your kernel, and make it grow too large, though.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Michael Schmitz

unread,
Feb 2, 2023, 7:20:03 PM2/2/23
to
Hi Stan,

Am 03.02.2023 um 12:16 schrieb Stan Johnson:
> On 2/1/23 11:51 AM, Michael Schmitz wrote:
>> ...
>>
>> The stack canary mechanism pushes a token on the stack at function
>> entry, and compares against that token's value at function exit. This is
>> all code generated by gcc in the user binary.
>>
>> The kernel is not involved in function calls other than syscalls. For
>> syscalls, we could try to find the user mode stack, and do a similar
>> canary trick, but I don't think that would be necessary for all
>> syscalls. Might be easier to instrument copy_to_user() instead if you're
>> worried about a syscall receiving result data that way to a variable on
>> the stack.
>>
>> But since we're touching on copy_to_user() here - the 'remove set_fs'
>> patch set by Christoph Hellwig refactored the m68k inline helpers around
>> July 2021. Can you test a kernel prior to those patches (5.15-rc2)?
>> ...
> I updated my m68k rootfs in QEMU to the latest Debian SID and installed
> the updated rootfs on my Mac IIci. And it's now restoring to my SE/30,
> so I'll be able to test on that system again eventually.
>
> I tested 5.15.0-rc3, 5.15.0-rc2 and 5.15.0-rc1, with inconclusive
> results on the IIci (3 tests per kernel; e.g. "0, 1, 5" in
> "Stack-Smashing" indicates that test 1 had zero smashing errors, test 2
> had 1, and test 3 had 5):
>
> Kernel Stack-Smashing
> 5.15.0-rc3 0, 1, 5
> 5.15.0-rc2 2, 3, 2
> 5.15.0-rc1 7, 1, 3

I think we can rule out those changes to usermode copy routines now.

Good idea to test on some other hardware. I'd also consider Geert's
suggestion to enable kernel level debugging for the kernel's memory
allocators. That does look a lot easier than debugging usermode copy.

Cheers,

Michael


> I saved the serial console logs, but they're probably not too useful at
> this point. Next, I'll confirm a similar failure on the SE/30, then I'll
> check 4.0 and 5.0 (on the IIci) to see whether the issue started in that
> range, then use git bisect in an attempt to isolate the issue further.
> I'll also do a spot check on a different IIci to lessen the chance that
> the issue is being caused by a hardware problem. If it's not caused by a
> kernel bug, the next step will be to isolate the specific offending
> executable(s) in the sysvinit scripts.
>
> Or I could test using systemd instead of sysvinit; that would take
> longer but it might be worthwhile if it doesn't look like a kernel bug.
>
> -Stan
>

Finn Thain

unread,
Feb 3, 2023, 7:00:03 PM2/3/23
to
On Wed, 1 Feb 2023, Stan Johnson wrote:

>
> After logging the start and end of each script, I see that the "stack
> smashing detected" error often happens while running
> "/etc/rcS.d/S01mountkernfs.sh" (/etc/init.d/mountkernfs.sh). I'll try to
> isolate it to a particular command.
>

That brings to mind some other unresolved initscript failures, also
involving 68030, which were accompanied by "page allocation failure".

But it's hard to blame "stack smashing detected" on a page allocation
failure since the latter always produces a very noisy splat in the kernel
messages.

Have you reproduced the error with Debian's kernel package?

If not, please refer to private correspondence from me dated 10 December
2022 regarding setting up /etc/initramfs-tools so as to produce a suitably
small initramfs.

To save time, I recommend using QEMU and an up-to-date Debian/m68k SID
virtual machine to produce the vmlinux and initrd files needed for use
with Penguin on your slower machines.

Michael Schmitz

unread,
Feb 5, 2023, 5:30:03 PM2/5/23
to
Hi Stan,

Am 02.02.2023 um 07:51 schrieb Michael Schmitz:
>
> But since we're touching on copy_to_user() here - the 'remove set_fs'
> patch set by Christoph Hellwig refactored the m68k inline helpers around
> July 2021. Can you test a kernel prior to those patches (5.15-rc2)?
>
>>
>>> That's a lot of work on a 030 Mac - have you tried to reproduce this on
>>> any kind of emulator?
>> I haven't seen the error in QEMU.
>>
>>> I suppose one difference between your 030 and 040 Macs might be the
>>> amount of RAM available. I wonder if this bug results from a combination
>>> of 030 MMU and memory pressure, or 030 MMU only.
>> For some reason, the error seems to happen only with 68030 systems,
>> regardless of processor speed or memory:
>>
>> PB 170 : 68030, 25 MHz, 8 MiB, external SCSI2SD
>> Mac IIci : 68030, 25 MHz, 80 MiB, internal SCSI2SD
>> SE/30 : 68030, 16 MHz, 128 MiB, external SCSI2SD
>> PB 550c : 68040, 33 MHz, 36 MiB, external SCSI2SD
>> Centris 650 : 68040, 25 MHz, 136 MiB, internal SCSI2SD
>
> Exception handling in copy_to_user() and the related bits in 030 page
> fault handling might need another look in then...

Seeing Finn's report that Al Viro's VM_FAULT_RETRY fix may have solved
his task corruption troubles on 040, I just noticed that I probably
misunderstood how Al's patch works.

Botching up a fault retry and carrying on may well leave the page tables
in a state where some later access could go to the wrong page and
manifest as user space corruption. Could you try Al's patch 4 (m68k: fix
livelock in uaccess) to see if this helps?

Cheers,

Michael

Finn Thain

unread,
Feb 5, 2023, 5:50:03 PM2/5/23
to
On Sun, 5 Feb 2023, Stan Johnson wrote:

> >
> > To save time, I recommend using QEMU and an up-to-date Debian/m68k SID
> > virtual machine to produce the vmlinux and initrd files needed for use
> > with Penguin on your slower machines.
> >
>
> AFAIK, the initrd must be created on the system that is using the initrd
> (or an identical system, at least that was the response I received a
> while back when I was unsuccessfully trying to use the initrd that ships
> on the Debian install CD).

If that was true, how could Debian ship a single initrd that works on
Atari, Amiga, Mac etc.? They could not.

If you boot a random Debian/m68k rootfs with a random Debian/m68k
vmlinux/initrd combination, and if you need a kernal module at some point,
then your random rootfs must contain the modules that match your random
kernel binary (probably not going to work).

If you never need to load a module, perhaps because the important ones all
got loaded from the initrd, then all you need is a valid vmlinux/initrd
combination and the rootfs is not relevant.

When you need to generate a valid Debian/m68k vmlinux/initrd combination
that is also current, then you'll need a Debian/m68k system that is
current. The quickest way to get it is an emulator.

When you need a small initrd, because of RAM limitations, you'll need to
customize your initrd for your hardware using /etc/initramfs-tools (as
mentioned).

If you succeed, you'll get an initrd that is missing all the modules for
all the hardware that you don't own, which saves RAM. It doesn't matter
what system generates that initrd (could be a build farm).

John Paul Adrian Glaubitz

unread,
Feb 5, 2023, 6:20:03 PM2/5/23
to
On Mon, 2023-02-06 at 09:29 +1100, Finn Thain wrote:
> On Sun, 5 Feb 2023, Stan Johnson wrote:
>
> > >
> > > To save time, I recommend using QEMU and an up-to-date Debian/m68k SID
> > > virtual machine to produce the vmlinux and initrd files needed for use
> > > with Penguin on your slower machines.
> > >
> >
> > AFAIK, the initrd must be created on the system that is using the initrd
> > (or an identical system, at least that was the response I received a
> > while back when I was unsuccessfully trying to use the initrd that ships
> > on the Debian install CD).
>
> If that was true, how could Debian ship a single initrd that works on
> Atari, Amiga, Mac etc.? They could not.

You cannot use the initrd that's on the debian-installer medium because it
boots directly into the installer. It's not a regular initrd.

That's completely independent of the architecture/sub-architecture in use.

> When you need to generate a valid Debian/m68k vmlinux/initrd combination
> that is also current, then you'll need a Debian/m68k system that is
> current. The quickest way to get it is an emulator.

Correct.

Adrian


--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer
`. `' Physicist
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

Geert Uytterhoeven

unread,
Feb 6, 2023, 3:00:04 AM2/6/23
to
Hi Stan,

On Mon, Feb 6, 2023 at 4:42 AM Stan Johnson <use...@yahoo.com> wrote:
> On an SE/30 with 128 MiB memory, the latest Debian SID kernel
> (vmlinux-6.1.0-2-m68k), using Debian SID modules, and with
> initrd-6.1.0-2-m68k built on the SE/30, hangs after the initial
> "ABCFGHIJK" (I tried it twice).

If you get to "K", you're almost at the end of arch/m68k/kernel/head.S,
and it is very likely the kernel C-code actually started.
Do you get any output using "debug earlyprintk" on the kernel command
line?

Geert Uytterhoeven

unread,
Feb 6, 2023, 3:40:04 PM2/6/23
to
Hi Stan,

On Mon, Feb 6, 2023 at 9:31 PM Stan Johnson <use...@yahoo.com> wrote:
> On 2/6/23 12:52 AM, Geert Uytterhoeven wrote:
> > On Mon, Feb 6, 2023 at 4:42 AM Stan Johnson <use...@yahoo.com> wrote:
> >> On an SE/30 with 128 MiB memory, the latest Debian SID kernel
> >> (vmlinux-6.1.0-2-m68k), using Debian SID modules, and with
> >> initrd-6.1.0-2-m68k built on the SE/30, hangs after the initial
> >> "ABCFGHIJK" (I tried it twice).
> >
> > If you get to "K", you're almost at the end of arch/m68k/kernel/head.S,
> > and it is very likely the kernel C-code actually started.
> > Do you get any output using "debug earlyprintk" on the kernel command
> > line?
> > ...
>
> Please see the attached serial console log, which all happened within an
> hour after I set the Penguin loose (I have Penguin-19 configured to use
> 32768 K memory). There was nothing more after two additional hours.

Thanks, so the kernel does start, but hangs later.
Adding "initcall_debug" to the kernel command line may reveal more..

Finn Thain

unread,
Feb 6, 2023, 10:40:03 PM2/6/23
to

On Mon, 6 Feb 2023, Stan Johnson wrote:

> > Thanks, so the kernel does start, but hangs later.
> > Adding "initcall_debug" to the kernel command line may reveal more..
> > ...
>
> Please see attached.
>
> ...
>
> [ 34.440000] calling key_proc_init+0x0/0x5e @ 1
> [ 34.470000] initcall key_proc_init+0x0/0x5e returned 0 after 1307 usecs
> [ 34.500000] calling asymmetric_key_init+0x0/0x10 @ 1
> [ 34.520000] Key type asymmetric registered
> [ 34.540000] initcall asymmetric_key_init+0x0/0x10 returned 0 after 22481 usecs
> [ 34.570000] calling x509_key_init+0x0/0x16 @ 1
> [ 34.580000] Asymmetric key parser 'x509' registered
> [ 34.600000] initcall x509_key_init+0x0/0x16 returned 0 after 14116 usecs
> [ 34.620000] calling crypto_kdf108_init+0x0/0x13a @ 1
>

You could try 'initcall_blacklist=key_proc_init' or
initcall_blacklist=x509_key_init' etc.

This kind of thing has come up before.
https://lists.debian.org/debian-68k/2019/06/msg00020.html
https://lists.debian.org/debian-68k/2019/06/msg00066.html

These systems are too slow for needless key generation so a bug report
may be needed.

Brad Boyer

unread,
Feb 7, 2023, 3:10:03 PM2/7/23
to
On Tue, Feb 07, 2023 at 12:31:17AM -0700, Stan Johnson wrote:
> On 2/6/23 8:25 PM, Finn Thain wrote:
> >
> > These systems are too slow for needless key generation so a bug report
> > may be needed.
> >
>
> The Mac IIci (25 MHz) is only about 50% faster that the SE/30 (16 MHz).
> The Debian kernel booted on the IIci, though it took somewhere between
> 30 and 60 minutes. If it were just slowness, shouldn't the SE/30 be
> expected to boot in about 60 to 120 minutes (I let it run for 3 hours)?
>

Is the optional L2 cache card installed in your IIci? If so, that
could give a really big performance boost on something like this. I
suspect the key generation routines and data don't fit in the tiny L1
cache in a 68030, but would fit easily in the L2 cache. Any older
models like the SE/30 weren't designed to be able to have an L2 cache.
This difference alone could probably double or triple the performance
even without the clock speed change.

Brad Boyer

unread,
Feb 7, 2023, 5:20:03 PM2/7/23
to
On Tue, Feb 07, 2023 at 12:41:52PM -0700, Stan Johnson wrote:
> Yes, I do have the L2 cache card installed in the IIci.
>
> If you think it would be useful, I don't mind letting the SE/30 run
> overnight to see if it eventually boots.

I don't know if it would be useful in any practical sense, but I do
admit to being curious how many hours it will take. Overnight should
be enough. That would confirm that it's just extremely slow.

Michael Schmitz

unread,
Feb 7, 2023, 6:00:04 PM2/7/23
to
Thanks Stan,

On 8/02/23 08:37, Stan Johnson wrote:
> Hi Michael,
>
> On 2/5/23 3:19 PM, Michael Schmitz wrote:
>> ...
>>
>> Seeing Finn's report that Al Viro's VM_FAULT_RETRY fix may have solved
>> his task corruption troubles on 040, I just noticed that I probably
>> misunderstood how Al's patch works.
>>
>> Botching up a fault retry and carrying on may well leave the page tables
>> in a state where some later access could go to the wrong page and
>> manifest as user space corruption. Could you try Al's patch 4 (m68k: fix
>> livelock in uaccess) to see if this helps?
>> ...
> ok, this appears to be the patch:
>
> Signed-off-by: Al Viro <vi...@zeniv.linux.org.uk>
> ---
> arch/m68k/mm/fault.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
> index 4d2837eb3e2a..228128e45c67 100644
> --- a/arch/m68k/mm/fault.c
> +++ b/arch/m68k/mm/fault.c
> @@ -138,8 +138,11 @@ int do_page_fault(struct pt_regs *regs, unsigned
> long address,
> fault = handle_mm_fault(vma, address, flags, regs);
> pr_debug("handle_mm_fault returns %x\n", fault);
>
> - if (fault_signal_pending(fault, regs))
> + if (fault_signal_pending(fault, regs)) {
> + if (!user_mode(regs))
> + goto no_context;
> return 0;
> + }
>
> /* The fault is fully completed (including releasing mmap lock) */
> if (fault & VM_FAULT_COMPLETED)

That's correct.

Your results show improvement but the problem does not entirely go away.

Looking at differences between 030 and 040/040 fault handling, it
appears only 030 handles faults corrected by exception tables (such as
used in uaccess macros) special, i.e. aborting bus error processing
while 040 and 060 carry on in the fault handler.

I wonder if that's the main difference between 030 and 040 behaviour?

I'll try and log such accesses caught by exception tables on 030 to see
if they are rare enough to allow adding a kernel log message...

Cheers,

    Michael

Finn Thain

unread,
Feb 7, 2023, 6:40:04 PM2/7/23
to

On Tue, 7 Feb 2023, Stan Johnson wrote:

> If you think it would be useful, I don't mind letting the SE/30 run
> overnight to see if it eventually boots.
>

Preventing pointless key generation would be beneficial for all Macs,
Amigas, Ataris, emulators etc. and measuring the performance of one model
of Mac versus that of another model seems a bit irrelevant to me.

Moreover, you've shown that your kernel builds produce stack smashing
errors whereas Debian's build does not. To resolve the problem with your
builds, why not begin by eliminating some of the differences between your
build and Debian's?

I suggest you should adopt the current Debian SID build environment and
toolchain use it to build mainline Linux (stock v6.1) using QEMU.

If you use your .config and if you still get stack smashing errors then
you can use the script I wrote to bisect the differences between your
.config and Debian's .config.

John Klos

unread,
Feb 8, 2023, 2:30:03 PM2/8/23
to
> If anyone knows of a 68030 emulator (maybe Basilisk?) that can boot
> Linux, then I might be able to use that for faster testing.

I've played around with NetBSD on FS-UAE. I'd use it more, except for the
fact that the emulation of the Commodore 2065 ethernet card gives very
flakey networking.

The emulation was configured with an m68030 & m68882, and I've heard it
can run Linux, too.

> This difference alone could probably double or triple the performance
> even without the clock speed change.

Even though the external cache can do burst transfers to the CPU's cache,
this definitely wouldn't double or triple performance - most of the time
the CPU isn't waiting for data from memory. It'd be faster, but not
significantly.

John

Eero Tamminen

unread,
Feb 8, 2023, 2:40:03 PM2/8/23
to
Hi,

On 8.2.2023 19.39, Stan Johnson wrote:
> The stack smashing appears to be intermittent. And it doesn't show up
> while booting the kernel; it only shows up while sysvinit scripts are
> running (I haven't tested using systemd, since that would be too painful
> on any 68030 slower than about 40 MHz). It takes too long to boot slow
> systems using Debian's kernel to run repeated tests, and QEMU only
> emulates 68040, so it appears to be necessary to test on real hardware.
> It's not my goal to get my config files closer to Debian's, anyway. My
> goal is to have the fewest number of config files for different groups
> of systems.
>
> If anyone knows of a 68030 emulator (maybe Basilisk?) that can boot
> Linux, then I might be able to use that for faster testing.

On quick search I was not able to find out whether Basilisk II (or mini
vMac) emulate both 030 MMU and CPU i/d-cache. If they don't, they may
not be accurate enough to find issues like this.


In case the issue is only 030, not Mac specific, both WinUAE (Amiga) and
Hatari (Atari) emulators support 68030 + MMU + cache emulation, and can
boot Linux.

Hatari CPU core is based on one from WinUAE.)


Apparently WinUAE has not had problems running Linux, but in Hatari,
enabling 030 _cache_ emulation breaks Linux boot when it reaches user
space (kernel boot works fine).

I'm not sure whether latter is related to the issue you are seeing on Mac.

(There are some differences e.g. in how Amiga and Atari handle CPU
exceptions and how MMU is used, which may explain differences in
behavior. I have no idea whether Mac is closer to Amiga or Atari in
this respect.)


For details on using m68k Linux with Hatari, see:
https://hatari.tuxfamily.org/doc/m68k-linux.txt


- Eero

PS. Debugger in Hatari emulator can provide you with backtraces for
Linux kernel side, and profile where kernel time goes.

For a full boot, callgraphs will be so large that you'll probably need
to throw away 99% fo the data though, to make them readable. For
details, see:
https://hatari.tuxfamily.org/doc/debugger.html#Profiling

Finn Thain

unread,
Feb 8, 2023, 5:30:04 PM2/8/23
to
On Wed, 8 Feb 2023, Stan Johnson wrote:

> On 2/7/23 4:20 PM, Finn Thain wrote:
> >
> > On Tue, 7 Feb 2023, Stan Johnson wrote:
> > ...
> > Preventing pointless key generation would be beneficial for all Macs,
> > Amigas, Ataris, emulators etc. and measuring the performance of one model
> > of Mac versus that of another model seems a bit irrelevant to me.
> >
>
> Sure, but unless Debian unsupported is willing to manage config files
> for the various systems, then it's not likely to happen.

It's easy to refute that. Just read my message from 2 days ago in this
very thread where I pointed to a different Debian kernel key generation
issue that got fixed.

>
> > Moreover, you've shown that your kernel builds produce stack smashing
> > errors whereas Debian's build does not. To resolve the problem with your
> > builds, why not begin by eliminating some of the differences between your
> > build and Debian's?
>
> The stack smashing appears to be intermittent. And it doesn't show up
> while booting the kernel; it only shows up while sysvinit scripts are
> running (I haven't tested using systemd, since that would be too painful
> on any 68030 slower than about 40 MHz).

No-one is asking for systemd tests.

> It takes too long to boot slow systems using Debian's kernel to run
> repeated tests ...

If your m68k machines are too slow, why do you care about stack smashing
errors at all?

>
> If the stack smashing is caused by a kernel bug that is hidden by
> Debian's choice of config options, then it would still be useful to
> identify the bug. If there is something missing from my config files
> that is causing the problem, then that would still be a kernel bug in
> its sanity checking of options.

This is not about sanity checking.

Anyway, if you follow the steps I gave, we all get to learn something
about the cause of the stack smashing error -- if that's what you want.

Michael Schmitz

unread,
Feb 8, 2023, 10:50:03 PM2/8/23
to
Hi Stan,
Following the 040 code a bit further, I suspect that happens in the 040
writeback handler, so this may be a red herring.

> I'll try and log such accesses caught by exception tables on 030 to see
> if they are rare enough to allow adding a kernel log message...

Looks like this kind of event is rare enough to not trigger in a normal
boot on my 030. Please give the attached patch a try so we can confirm
(or rule out) that user space access faults from kernel mode are to
blame for your stack smashes.

Cheers,

Michael


> Cheers,
>
> Michael
>
>
0001-m68k-debug-exception-handling-data-faults-on-030.patch

John Paul Adrian Glaubitz

unread,
Feb 9, 2023, 4:30:03 AM2/9/23
to
On Wed, 2023-02-08 at 10:20 +1100, Finn Thain wrote:
> Preventing pointless key generation would be beneficial for all Macs,
> Amigas, Ataris, emulators etc. and measuring the performance of one model
> of Mac versus that of another model seems a bit irrelevant to me.

Feel free to suggest a kernel configuration update for the m68k Debian kernel
or even a new kernel flavor such as a »-light« one.

I am planning to create a new »-virt« kernel anyway. Maybe we can make the main
m68k kernel flavor leaner and only enable all the »slow« stuff on the »-virt«
flavor.

John Paul Adrian Glaubitz

unread,
Feb 9, 2023, 4:30:04 AM2/9/23
to
On Wed, 2023-02-08 at 10:39 -0700, Stan Johnson wrote:
> On 2/7/23 4:20 PM, Finn Thain wrote:
> >
> > On Tue, 7 Feb 2023, Stan Johnson wrote:
> > ...
> > Preventing pointless key generation would be beneficial for all Macs,
> > Amigas, Ataris, emulators etc. and measuring the performance of one model
> > of Mac versus that of another model seems a bit irrelevant to me.
> >
>
> Sure, but unless Debian unsupported

What do you mean by »Debian unsupported«? You mean »Debian Ports«?

> is willing to manage config files for the various systems, then it's
> not likely to happen. I currently use separate config files for the
> following Macs, to build kernels with no initrd, no modules, and only
> minimal network and video support:
>
> 1) 68030 8 MiB, no network (PB-170)
> 2) 68030 >8 MiB (SE/30, IIci, IIfx, Centris LC III, etc.)
> 3) 68040 (Centris 650, PB 550c, etc.)

Debian supports different kernel configuration per architecture, the concept
is called »flavors«. We could certainly add a »-lean« or »-light« flavor
for smaller systems.

> If the stack smashing is caused by a kernel bug that is hidden by
> Debian's choice of config options, then it would still be useful to
> identify the bug. If there is something missing from my config files
> that is causing the problem, then that would still be a kernel bug in
> its sanity checking of options. Your script will be helpful if it
> becomes necessary to identify specific offending options.

FWIW, I haven't seen this issue on my Amiga although I haven't run a
dist-upgrade for a while now. I guess, I am going to do that in the
near future.

Michael Schmitz

unread,
Feb 10, 2023, 3:00:04 AM2/10/23
to
Hi Stan,

Am 10.02.2023 um 13:24 schrieb Stan Johnson:
> Hi Michael,
>
> On 2/8/23 8:41 PM, Michael Schmitz wrote:
>> ...
>>
>> Following the 040 code a bit further, I suspect that happens in the 040
>> writeback handler, so this may be a red herring.
>>
>>> I'll try and log such accesses caught by exception tables on 030 to see
>>> if they are rare enough to allow adding a kernel log message...
>>
>> Looks like this kind of event is rare enough to not trigger in a normal
>> boot on my 030.
>
> Have you seen the error using my modified config file? Perhaps I'm not
> including something that's causing (or revealing) the problem.

I haven't seen these stack smashing errors ever, but that's with a
really ancient user space and a kernel config for Atari that only
includes the bare minimum.

>> Please give the attached patch a try so we can confirm
>> (or rule out) that user space access faults from kernel mode are to
>> blame for your stack smashes.
>> ...
>
> With "0001-m68k-debug-exception-handling-data-faults-on-030.patch",
>
> Kernel Stack-Smashing
> 6.2.0-rc7 (no patch) 4, 3, 3 (from earlier test)
> 6.2.0-rc7 (new patch) 6, 2, 0
>
> The earlier patch is not applied. Serial console log is attached.

Without Al's patch, I doubt even in case a uaccess fault happens with
signal pending we'd return -1 from send_fault_sig() (the no_context path
isn't taken and do_page_fault() returns without error). No kernel
messages expected in that case. But none seen otherwise either which
indicates exception handling in uaccess isn't a problem.

Not sure it's worth the hassle to retry with both patches applied...

Thanks,

Michael


>
> thanks
>
> -Stan
>

Finn Thain

unread,
Feb 11, 2023, 5:30:03 PM2/11/23
to

On Sat, 11 Feb 2023, Stan Johnson wrote:

> v5.1 x SCSI2SD crashes, goes offline with activity LED on,
> rootfs corrupted, needed to be restored from backups, SCSI2SD SD card
> needed to have the Apple driver updated to boot MacOS
>
> v4.20 bad stack smashing on first boot, corrupted rootfs (bad
> superblock magic number) on second boot, fsck from a different rootfs
> found many block counts wrong, rootfs had to be restored from backups

That looks like an unrelated problem, presumably caused by hardware.

It could be related to the problem you described to me that arises when
you subdivide an SD card into multiple SCSI targets.

Anyway, if you want to pursue the stack smashing error you'll probably
need to use a different storage device.

Finn Thain

unread,
Feb 11, 2023, 6:30:02 PM2/11/23
to
On Sat, 11 Feb 2023, Stan Johnson wrote:

> I think if there were hardware problems with the SCSI2SD board or the SD
> card, then I would be seeing lots of errors in MacOS and with v6.x
> kernels.
>

That's not true in general. But I can agree that certain SCSI device
faults will show up in any operating systems.

Finn Thain

unread,
Feb 11, 2023, 7:40:03 PM2/11/23
to
I'll have to retract that -- there's no reason to blame the SCSI2SD. The
mac_scsi driver did not really stabilize until v5.3 so that seems like the
most likely cause.

Finn Thain

unread,
Feb 15, 2023, 11:40:04 PM2/15/23
to
On Wed, 15 Feb 2023, Stan Johnson wrote:

>
> 1) Default Debian kernel (vmlinux-6.1.0-4-m68k, initrd-6.1.0-4-m68k)
> Default Debian sysvinit scripts
> Boot time (ABC... to login prompt): 15 min 3 sec
> NIC not detected, no stack smashing
>

Did you check whether the NIC module (mac8390) got loaded? I don't see any
sign of that in the console log. You can find out by running 'lsmod'. You
should also check whether that module is present on the initrd. The initrd
contents can be listed with 'zstd -dc < initrd-6.1.0-4-m68k | cpio -t'.

John Paul Adrian Glaubitz

unread,
Feb 16, 2023, 5:10:03 AM2/16/23
to
Hi Stan!

On Wed, 2023-02-15 at 19:44 -0700, Stan Johnson wrote:
> Going from 15 min to about 4 min seems worth the effort on old hardware.
> As always, YMMV. Developers who use QEMU or other emulators likely don't
> always realize how long it takes to boot real hardware.

I am not sure what makes you think we aren't aware of the long boot times,
I just booted Debian unstable on my Amiga 4000/060 yesterday:

root@elgar:~> systemd-analyze
Startup finished in 1min 30.785s (kernel) + 5min 31.586s (userspace) = 7min 2.371s
graphical.target reached after 5min 26.053s in userspace
root@elgar:~>

I understand your frustration, but please keep in mind that modern software
is more complex and therefore runs slower on older hardware.

We will certainly work on providing a leaner kernel in the future to help
alleviate this issue. And rebootstrapping the whole distribution with 32-bit
alignment (-malign-int) should improve code execution as well.

Michael Schmitz

unread,
Feb 16, 2023, 6:20:03 PM2/16/23
to
Hi Stan,

On 16/02/23 15:44, Stan Johnson wrote:
> I could re-install the previous sysvinit over the version in the current
> Debian SID and see if the stack smashing is still gone. I don't know how
> to do that, but if someone has instructions, I'll try. I'm guessing I
> need to download the previous .deb binaries and use dpkg to install the
> older versions over the newer versions, while the newer init is still
> running; maybe rename /sbin/init to /sbin/init.tmp and boot with
> init=/sbin/init.tmp to get it out of the way (?).
No need to move the old init binary out of the way - as long as a file
is still in use, it won't actually be deleted (the directory entry just
points to the new file so subsequent invocations use the new file contents.
>
> Or I could download the source for sysvinit-core 3.06-2 and
> sysvinit-utils 3.06-2 and compile using all of Debian's options plus
> -fstack-protector-all (and perhaps other options?) to see whether there
> might be a bounds issue on an array somewhere. But I also don't know
> where to find the source or the options that Debian uses for compilation.

'apt-get source sysvinit=3.06-2' will download and unpack that specific
version. That should unpack the source in sysvinit-3.06-2/.

To add compile options, look at the patches in debian/patches/ there -
haven't found the 3.06-2 version, but older versions have a patch to add
additional CFLAGS to src/Makefile. Add a new patch in debian/patches/,
add that patch file name to debian/patches/series and use
dpkg-buildpackage to build the package.

Cheers,

    Michael


>
> Otherwise, I'm done looking into the stack smashing for now. If anyone
> is interested in developing "68030-lean" kernel config options or custom
> sysvinit scripts, I'll be happy to contribute.
>
> thanks
>
> -Stan
>

John Paul Adrian Glaubitz

unread,
Feb 17, 2023, 1:10:03 PM2/17/23
to
On Fri, 2023-02-17 at 10:09 -0700, Stan Johnson wrote:
> Would downloading the source from an x86 repository be sufficient, ...?

Yes, there is no architecture-specific source code repository.

Use:

deb-src http://ftp.debian.org/debian/ unstable main

Michael Schmitz

unread,
Feb 17, 2023, 3:40:04 PM2/17/23
to
Hi Stan,

Am 18.02.2023 um 06:09 schrieb Stan Johnson:
> Hi Michael,
>
> On 2/16/23 4:10 PM, Michael Schmitz wrote:
>> ...
>> 'apt-get source sysvinit=3.06-2' will download and unpack that specific
>> version. That should unpack the source in sysvinit-3.06-2/.
>
> That doesn't work for me:
> # apt-get source sysvinit=3.06-2
> Reading package lists... Done
> E: You must put some 'deb-src' URIs in your sources.list
>
> Adding this to /etc/apt/sources.list doesn't help (same error):
> deb-src http://ftp.ports.debian.org/debian-ports/ sid main
>
> Running "apt-get update", I see:
> W: Skipping acquire of configured file 'main/source/Sources' as
> repository 'http://ftp.ports.debian.org/debian-ports sid InRelease' does
> not seem to provide it (sources.list entry misspelt?)
>
> So I might not be pointing to the right source repository. Would
> downloading the source from an x86 repository be sufficient, or do I
> need patches that are specific to m68k?
>
>>
>> To add compile options, look at the patches in debian/patches/ there -
>> haven't found the 3.06-2 version, but older versions have a patch to add
>> additional CFLAGS to src/Makefile. Add a new patch in debian/patches/,
>> add that patch file name to debian/patches/series and use
>> dpkg-buildpackage to build the package.
>
> If I can download the source, I'll do that.
>
> Just to confirm that an earlier version (3.01-2) doesn't cause the stack
> smashing errors with an otherwise up-to-date Debian SID, I can restore

Good work -

> from an earlier backup and run "apt-get upgrade", though I would want to
> "keep back" sysvinit-core 3.01-2 and sysvinit-utils 3.01-2. Please let
> me know how to keep back a package from being upgraded.

'dpkg hold' should mark packages so apt won't consider them for
upgrading. You may want to check what other packages depend on newer
versions of sysvinit before attempting that (search for 'sysvinit-utils
(>=' in the Packages files in /var/lib/apt/lists) before attempting that.

>
> Do you know whether there is some location where all of the older binary
> packages are saved? If there is, I can download and check versions
> between 3.01-2 and 3.06-2 and determine where the issue started.

snapshot.debian.org is what you want. 15 versions to bisect ...

Cheers,

Michael


>
> thanks
>
> -Stan
>

Finn Thain

unread,
Feb 17, 2023, 5:10:03 PM2/17/23
to

On Fri, 17 Feb 2023, Stan Johnson wrote:

> I noticed that /sbin/init seems to ignore SIGABRT, so I thought that
> might mean that init itself was somehow triggering the stack smashing
> but nothing was really aborting, but I could be wrong about that.

Why do you say init ignores SIGABRT? I couldn't find that in the source
code. Did you try 'kill -ABRT 1'?

Finn Thain

unread,
Feb 17, 2023, 6:00:04 PM2/17/23
to
On Fri, 17 Feb 2023, Stan Johnson wrote:

>
> That's not to say a SIGABRT is ignored, it just doesn't kill PID 1.
>

I doubt that /sbin/init is generating the "stack smashing detected" error
but you may need to modify it to find out. If you can't figure out which
userland binary is involved, you'll have to focus on your custom kernel
binary, just as I proposed in my message dated 8 Feb 2023.

Andreas Schwab

unread,
Feb 17, 2023, 6:30:03 PM2/17/23
to
On Feb 18 2023, Finn Thain wrote:

> Why do you say init ignores SIGABRT?

PID 1 is special, it never receives signals it doesn't handle.

--
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Finn Thain

unread,
Feb 17, 2023, 6:40:03 PM2/17/23
to

On Fri, 17 Feb 2023, Stan Johnson wrote:

>
> The error could have been exposed in any package where
> "-fstack-protector-strong" was recently added.
>

And if you find the last good userland binary, what then? Fix the bad
userland binary? That's wonderful but it doesn't explain why the bad
userland binary went undetected with Debian's kernel build. And it doesn't
explain why you can't reproduce the problem in QEMU.

Moreover, the above was always an unlikely scenario because an actual
buffer overrun in a userland binary that only shows up on '030 is
improbable in the first place, because code paths conditional on processor
variant are normally found in the kernel.

Hence the advice I gave 10 days ago.

Finn Thain

unread,
Feb 17, 2023, 7:10:04 PM2/17/23
to
On Sat, 18 Feb 2023, Andreas Schwab wrote:

> On Feb 18 2023, Finn Thain wrote:
>
> > Why do you say init ignores SIGABRT?
>
> PID 1 is special, it never receives signals it doesn't handle.
>

I see. I wonder if there is some way to configure the kernel so that PID 1
could be aborted for fstack-protector. I doubt it.

My gut says that a compiler change somehow made the fstack-protector
implementation insensitive to kernel configuration.

So I still think that, if Stan adopted Debian's build environment, random
processes would cease to be aborted (regardless of kernel .config).

Michael Schmitz

unread,
Feb 17, 2023, 7:50:04 PM2/17/23
to
Hi Finn,

Am 18.02.2023 um 12:49 schrieb Finn Thain:
> On Sat, 18 Feb 2023, Andreas Schwab wrote:
>
>> On Feb 18 2023, Finn Thain wrote:
>>
>>> Why do you say init ignores SIGABRT?
>>
>> PID 1 is special, it never receives signals it doesn't handle.
>>
>
> I see. I wonder if there is some way to configure the kernel so that PID 1
> could be aborted for fstack-protector. I doubt it.

You could add SIGABRT to the list of signals handled by init (see
init.c:init_main() and init.c:process_signals() in the sysvinit source).

Not sure it's wise to allow init to abort though. You could instead try
to use a similar signal handler to segv_handler(), and dump core when
receiving the signal? Maybe re-exec init instead of continuing?

>
> My gut says that a compiler change somehow made the fstack-protector
> implementation insensitive to kernel configuration.
>
> So I still think that, if Stan adopted Debian's build environment, random
> processes would cease to be aborted (regardless of kernel .config).

Changes in compiler version between sysvinit 3.01 and 3.06 might cause a
bisect using snapshot binaries and a bisect using recompiled binaries to
differ. But using a build environment equivalent to that used by the
package autobuilders is certainly good advice.

Or did you mean the kernel build environment?

Cheers.

Michael

Finn Thain

unread,
Feb 17, 2023, 8:10:03 PM2/17/23
to

On Fri, 17 Feb 2023, Stan Johnson wrote:

>
> What's interesting to me is that although there are stack smashing
> errors and "aborted" messages during boot, nothing seems to have
> actually failed or aborted (note that no processes are named as
> "terminated" or "aborted" in the messages).

Initscripts do a lot of "if grep -q" and "this || that" etc. Since a
failure is normal, how would you tell that a process got aborted?

> Once logged in, I can't duplicate the stack smashing (yet), and I don't
> see random executables failing. The stack smashing seems to happen only
> in startup scripts that have heavy access to /proc (the mountkernfs
> script seems particularly vulnerable, as I recall), though none of the
> scripts seems to actually fail.
>

Yes. We've seen processes started from initscripts crash when accessing
/proc before. It was never debugged, though it wasn't a stack smashing
error at the time.

> I'm not a programmer, but I'm willing to learn. I don't know what
> Debian's build environment is, or how to adopt it or set it up, but I
> can look into it with some suggestions.

As I understand it, Debian/m68k SID is built in Debian/m68k SID in QEMU.
Please see,
https://lists.debian.org/debian-68k/2023/02/msg00025.html

Finn Thain

unread,
Feb 17, 2023, 10:20:03 PM2/17/23
to
On Sat, 18 Feb 2023, Michael Schmitz wrote:

> Am 18.02.2023 um 12:49 schrieb Finn Thain:
> > On Sat, 18 Feb 2023, Andreas Schwab wrote:
> >
> >> On Feb 18 2023, Finn Thain wrote:
> >>
> >>> Why do you say init ignores SIGABRT?
> >>
> >> PID 1 is special, it never receives signals it doesn't handle.
> >>
> >
> > I see. I wonder if there is some way to configure the kernel so that
> > PID 1 could be aborted for fstack-protector. I doubt it.
>
> You could add SIGABRT to the list of signals handled by init (see
> init.c:init_main() and init.c:process_signals() in the sysvinit source).
>
> Not sure it's wise to allow init to abort though. You could instead try
> to use a similar signal handler to segv_handler(), and dump core when
> receiving the signal? Maybe re-exec init instead of continuing?
>

I like the idea of patching the kernel so as to log every SIGABRT sent
(even if never delivered), ideally along with the target pid and cmd and
the sending pid and cmd.

> >
> > My gut says that a compiler change somehow made the fstack-protector
> > implementation insensitive to kernel configuration.
> >
> > So I still think that, if Stan adopted Debian's build environment,
> > random processes would cease to be aborted (regardless of kernel
> > .config).
>
> Changes in compiler version between sysvinit 3.01 and 3.06 might cause a
> bisect using snapshot binaries and a bisect using recompiled binaries to
> differ.
>

According to the buildd logs:

package | compiler
---------+-----------
3.01-1 | 11.2.0-12
3.02-1 | 11.2.0-19
3.03-1 | 11.2.0-19
3.04-1 | 12.1.0-7
3.05-7 | 12.2.0-7
3.06-2 | 12.2.0-10

Those debs are all available on
https://snapshot.debian.org/package/sysvinit/
in case Stan wants to bisect. No need to build anything.

> But using a build environment equivalent to that used by the package
> autobuilders is certainly good advice.
>
> Or did you mean the kernel build environment?
>

I meant the kernel build environment which was a double-think on my part
as it isn't really relevant here.

Finn Thain

unread,
Feb 18, 2023, 12:40:04 AM2/18/23
to

On Fri, 17 Feb 2023, Stan Johnson wrote:

> On 2/17/23 4:24 PM, Finn Thain wrote:
> >
> > On Fri, 17 Feb 2023, Stan Johnson wrote:
> >
> >>
> >> The error could have been exposed in any package where
> >> "-fstack-protector-strong" was recently added.
> >>
> >
> > And if you find the last good userland binary, what then? Fix the bad
> > userland binary? That's wonderful but it doesn't explain why the bad
> > userland binary went undetected with Debian's kernel build. And it
> > doesn't explain why you can't reproduce the problem in QEMU.
>
> Maybe the 68030 handles memory differently than the 68040? As I
> mentioned previously, if there is a 68030 Mac emulator that can boot
> Linux, then I could try that. I could also try running QEMU on a slow
> x86 host to see if a slow client combined with the rapid accesses to
> /proc from the scripts might be causing an issue, I suspect if the 68040
> were vulnerable to this problem, we'd be seeing it in QEMU. I'll check
> this weekend to see whether QEMU is still not seeing any problem with my
> latest kernel builds together with the latest Debian SID.
>
> And no, I likely won't be fixing anything, but perhaps I can report the
> problem to the maintainers of the relevant package. And I will likely
> choose (for my own systems) to keep back the last good userland binary
> to prevent it from being overwritten until a fix is found.
>

Good luck with that.

Andreas Schwab

unread,
Feb 18, 2023, 3:20:55 AM2/18/23
to
On Feb 18 2023, Michael Schmitz wrote:

> Not sure it's wise to allow init to abort though.

When PID 1 exits, the kernel panics.

Michael Schmitz

unread,
Feb 19, 2023, 2:00:05 PM2/19/23
to
Hi Andreas,

Am 18.02.2023 um 20:59 schrieb Andreas Schwab:
> On Feb 18 2023, Michael Schmitz wrote:
>
>> Not sure it's wise to allow init to abort though.
>
> When PID 1 exits, the kernel panics.

I might have been a mite subtle there ...

Anyway - Finn's intention was to log signals caused by the stack
protector mechanism. Reading the description of the abort() function, it
appears that allowing init to take the signal through a handler that
returns, abort() restores the default handler and retries. That might
just manage to abort init after all (I haven't found where PID 1 is
treated special in the kernel signal code to rule that out).

Logging ABRT in the kill() syscall path seems much the safer option.
I'll see what I can come up with.

Cheers,

Michael

Finn Thain

unread,
Mar 28, 2023, 12:00:04 AM3/28/23
to
Using the core dump I generated on my Mac LC III, and using a workaround
for the gdb regression, I was able to get the backtrace below.

root@(none):/root# gdb
GNU gdb (Debian 13.1-2) 13.1
Copyright (C) 2023 Free Software Foundation, Inc.
...
(gdb) set osabi GNU/Linux
(gdb) exec /bin/dash
(gdb) core /root/core.0
warning: core file may not match specified executable file.
[New LWP 366]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/m68k-linux-gnu/libthread_db.so.1".
Core was generated by `/bin/sh /etc/init.d/mountkernfs.sh reload'.
Program terminated with signal SIGABRT, Aborted.
#0 __pthread_kill_implementation (threadid=3222954656, signo=6, no_tid=0)
at pthread_kill.c:44
44 pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation (threadid=3222954656, signo=6, no_tid=0)
at pthread_kill.c:44
#1 0xc00a7080 in __pthread_kill_internal (signo=6, threadid=3222954656)
at pthread_kill.c:78
#2 __GI___pthread_kill (threadid=3222954656, signo=6) at pthread_kill.c:89
#3 0xc0064c22 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
#4 0xc0052faa in __GI_abort () at abort.c:79
#5 0xc009b328 in __libc_message (action=<optimized out>, fmt=<optimized out>)
at ../sysdeps/posix/libc_fatal.c:155
#6 0xc012a3c2 in __GI___fortify_fail (
msg=0xc0182c5e "stack smashing detected") at fortify_fail.c:26
#7 0xc012a3a0 in __stack_chk_fail () at stack_chk_fail.c:24
#8 0xc00e0172 in __wait3 (stat_loc=<optimized out>, options=<optimized out>,
usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:41
#9 0xd000c38e in ?? ()
#10 0xefee111e in ?? ()
#11 0x00000000 in ?? ()
(gdb)

It appears that the failure was in glibc (though I guess the root cause
may lie elsewhere). I have two more core files generated by dash (actually
by `/bin/sh /etc/rcS.d/S08mountall.sh start') that give the same
backtrace. So even though the failure is intermittent, the site of the
buffer overrun seems to be consistent.

Looking at sysdeps/unix/sysv/linux/wait3.c, I guess the only possible
place for a buffer overrun would be struct __rusage64 usage64.
https://sources.debian.org/src/glibc/2.36-8/sysdeps/unix/sysv/linux/wait3.c/?hl=41#L41

(gdb) select-frame 8
(gdb) print usage64
$3 = {ru_utime = {tv_sec = 6481621047248640, tv_usec = 91671782025504},
ru_stime = {tv_sec = 25769811968, tv_usec = 8591449888}, {
ru_maxrss = 1515296, __ru_maxrss_word = 1515296}, {ru_ixrss = 1515296,
__ru_ixrss_word = 1515296}, {ru_idrss = 224, __ru_idrss_word = 224}, {
ru_isrss = 224, __ru_isrss_word = 224}, {ru_minflt = 6,
__ru_minflt_word = 6}, {ru_majflt = 4, __ru_majflt_word = 4}, {
ru_nswap = 4, __ru_nswap_word = 4}, {ru_inblock = 372,
__ru_inblock_word = 372}, {ru_oublock = 0, __ru_oublock_word = 0}, {
ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 8,
__ru_msgrcv_word = 8}, {ru_nsignals = 367, __ru_nsignals_word = 367}, {
ru_nvcsw = 10, __ru_nvcsw_word = 10}, {ru_nivcsw = 0,
__ru_nivcsw_word = 0}}
(gdb)

Of course, at this point the damage has already been done and the culprit
has gone. I guess there was a buffer overrun during the call to
__wait4_time64().
https://sources.debian.org/src/glibc/2.36-8/sysdeps/unix/sysv/linux/wait4.c/?hl=26#L26

It's hard to read glibc source code without knowing what all the macros
were set to (such as __KERNEL_OLD_TIMEVAL_MATCHES_TIMEVAL64 and
__TIMESIZE).

It would be disappointing if rusage64_to_rusage() in __wait3() was being
applied to the result of rusage32_to_rusage64() from __wait4_time64().
Perhaps the ifdefs are arranged in such a way that it doesn't happen...

Anyway, does anyone know how to get a hex dump of the whole stack frame
including the canary, in case there is something to be learned from that?

Finn Thain

unread,
Mar 30, 2023, 11:50:04 PM3/30/23
to

On 28 March 2023, I wrote,

>
> Looking at sysdeps/unix/sysv/linux/wait3.c, I guess the only possible
> place for a buffer overrun would be struct __rusage64 usage64.
> https://sources.debian.org/src/glibc/2.36-8/sysdeps/unix/sysv/linux/wait3.c/?hl=41#L41
>

... but now I see the usage64 variable is not involved at all because
__wait3() was passed a NULL pointer:
https://sources.debian.org/src/dash/0.5.12-2/src/jobs.c/?hl=1179#L1179

So NULL (rather than &usage64) was passed to the wait4() syscall which
means the kernel didn't invoke copy_to_user() at all. AFAICS there's no
possible buffer overflow in __wait3(), __wait4_time64() etc.

That suggests to a problem with GCC's SSP detector.

Here's a more complete backtrace and some disassembly.

# gdb
GNU gdb (Debian 13.1-2) 13.1
...
(gdb) set osabi GNU/Linux
(gdb) file /bin/dash
Reading symbols from /bin/dash...
Reading symbols from /usr/lib/debug/.build-id/aa/4160f84f3eeee809c554cb9f3e1ef0686b8dcc.debug...
(gdb)
#9 0xd000c38e in waitproc (status=0xefee110e, block=1) at jobs.c:1179
#10 waitone (block=1, job=0xd0021930) at jobs.c:1055
#11 0xd000c5b8 in dowait (block=1, jp=0xd0021930) at jobs.c:1137
#12 0xd000ddb0 in waitforjob (jp=0xd0021930) at jobs.c:1014
#13 0xd000aade in expbackq (flag=324, cmd=0xd00222c4) at expand.c:520
#14 argstr (p=<optimized out>, flag=68) at expand.c:335
#15 0xd000b5ce in expandarg (arg=0xd00222ac, arglist=0xefee13bc, flag=4)
at expand.c:192
#16 0xd0007e2a in evalcommand (cmd=<optimized out>, flags=<optimized out>)
at eval.c:855
#17 0xd0006ffc in evaltree (n=0xd0022294, flags=0) at eval.c:300
#18 0xd0006e96 in evaltree (n=0xd0022294, flags=0) at eval.c:300
#19 0xd0006e6a in evaltree (n=0xd002224c, flags=0) at eval.c:292
#20 0xd0006e6a in evaltree (n=0xd00220d4, flags=0) at eval.c:292
#21 0xd0006e6a in evaltree (n=0xd002208c, flags=0) at eval.c:292
#22 0xd000746a in evalfun (func=0xd0022078, argc=<optimized out>,
argv=0xd001e61c <stackbase+376>, flags=<optimized out>) at eval.c:1009
#23 0xd0008176 in evalcommand (cmd=<optimized out>, flags=<optimized out>)
at eval.c:921
#24 0xd0006ffc in evaltree (n=0xd001e588 <stackbase+228>, flags=1)
at eval.c:300
#25 0xd00084c8 in evaltreenr (flags=1, n=0xd001e588 <stackbase+228>)
at eval.c:347
#26 evalbackcmd (n=<optimized out>, result=0xefee17d4) at eval.c:650
#27 0xd000a984 in expbackq (flag=324, cmd=0xd001e588 <stackbase+228>)
at expand.c:495
#28 argstr (p=<optimized out>, flag=68) at expand.c:335
#29 0xd000b5ce in expandarg (arg=0xd001e5b0 <stackbase+268>,
arglist=0xefee191c, flag=4) at expand.c:192
#30 0xd0007e2a in evalcommand (cmd=<optimized out>, flags=<optimized out>)
at eval.c:855
#31 0xd0006ffc in evaltree (n=0xd001e5c0 <stackbase+284>, flags=0)
at eval.c:300
#32 0xd000e3c0 in cmdloop (top=0) at main.c:246
#33 0xd000e588 in dotcmd (argc=2, argv=<optimized out>) at main.c:341
#34 0xd0007a12 in evalbltin (cmd=0xd001b598 <builtincmd>,
argc=<optimized out>, argv=<optimized out>, flags=<optimized out>)
at eval.c:967
#35 0xd00080ca in evalcommand (cmd=<optimized out>, flags=<optimized out>)
at eval.c:910
#36 0xd0006ffc in evaltree (n=0xd001e4e8 <stackbase+68>, flags=0) at eval.c:300
#37 0xd000e3c0 in cmdloop (top=1) at main.c:246
#38 0xd0005018 in main (argc=<optimized out>, argv=<optimized out>)
at main.c:181
(gdb) frame 8
#8 0xc00e0172 in __wait3 (stat_loc=<optimized out>, options=<optimized out>,
usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:41
41 ../sysdeps/unix/sysv/linux/wait3.c: No such file or directory.
(gdb) info frame
Stack level 8, frame at 0xefee10e0:
pc = 0xc00e0172 in __wait3 (../sysdeps/unix/sysv/linux/wait3.c:41);
saved pc = 0xd000c38e
called by frame at 0xefee11dc, caller of frame at 0xefee106c
source language c.
Arglist at 0xefee10d8, args: stat_loc=<optimized out>,
options=<optimized out>, usage=<optimized out>
Locals at 0xefee10d8, Previous frame's sp is 0xefee10e0
Saved registers:
a2 at 0xefee106c, a3 at 0xefee1070, a5 at 0xefee1074, fp at 0xefee10d8,
pc at 0xefee10dc
(gdb) x/32z 0xefee1060
0xefee1060: 0xc0182c5e 0xc0198000 0xc00e0172 0xd001e718
0xefee1070: 0xd001e498 0xd001b874 0x00170700 0x00170700
0xefee1080: 0x00170700 0x00005360 0x0000e920 0x00000006
0xefee1090: 0x00002000 0x00000002 0x00171f20 0x00171f20
0xefee10a0: 0x00171f20 0x000000e0 0x000000e0 0x00000006
0xefee10b0: 0x00000004 0x00000004 0x00000174 0x00000000
0xefee10c0: 0x00000000 0x00000008 0x0000016f 0x0000000a
0xefee10d0: 0x00000000 0x00ac3dbe 0xd001c1ec 0xd000c38e
(gdb)
0xefee10e0: 0xefee111e 0x00000000 0x00000000 0x00000001
0xefee10f0: 0x00000001 0xefee1284 0x00000044 0xd0017714
0xefee1100: 0x00000100 0xd0021930 0xd001c1ec 0xd001e498
0xefee1110: 0xd001b874 0xefee1308 0xc0023e8c 0xefee0000
0xefee1120: 0x00000044 0xd0017714 0x00000100 0xefee1274
0xefee1130: 0xc0023e8c 0xd001c028 0xd001b874 0xefee1208
0xefee1140: 0x00000000 0xc0023e8c 0x00000000 0x00000000
0xefee1150: 0x00000000 0x00000000 0x00000000 0x00000000
(gdb) print &usage64
$1 = (struct __rusage64 *) 0xefee107c
(gdb) disass
Dump of assembler code for function __wait3:
0xc00e0070 <+0>: linkw %fp,#-96
0xc00e0074 <+4>: moveml %a2-%a3/%a5,%sp@-
0xc00e0078 <+8>: lea %pc@(0xc0198000),%a5
0xc00e0080 <+16>: movel %fp@(8),%d0
0xc00e0084 <+20>: moveal %fp@(16),%a2
0xc00e0088 <+24>: moveal %a5@(108),%a3
0xc00e008c <+28>: movel %a3@,%fp@(-4)
0xc00e0090 <+32>: tstl %a2
0xc00e0092 <+34>: beqw 0xc00e0152 <__wait3+226>
0xc00e0096 <+38>: pea %fp@(-92)
0xc00e009a <+42>: movel %fp@(12),%sp@-
0xc00e009e <+46>: movel %d0,%sp@-
0xc00e00a0 <+48>: pea 0xffffffff
0xc00e00a4 <+52>: bsrl 0xc00e0174 <__GI___wait4_time64>
0xc00e00aa <+58>: lea %sp@(16),%sp
0xc00e00ae <+62>: tstl %d0
0xc00e00b0 <+64>: bgts 0xc00e00c8 <__wait3+88>
0xc00e00b2 <+66>: moveal %fp@(-4),%a0
0xc00e00b6 <+70>: movel %a3@,%d1
0xc00e00b8 <+72>: cmpl %a0,%d1
0xc00e00ba <+74>: bnew 0xc00e016c <__wait3+252>
0xc00e00be <+78>: moveml %fp@(-108),%a2-%a3/%a5
0xc00e00c4 <+84>: unlk %fp
0xc00e00c6 <+86>: rts
0xc00e00c8 <+88>: pea 0x44
0xc00e00cc <+92>: clrl %sp@-
0xc00e00ce <+94>: pea %a2@(4)
0xc00e00d2 <+98>: movel %d0,%fp@(-96)
0xc00e00d6 <+102>: bsrl 0xc00b8850 <__GI_memset>
0xc00e00dc <+108>: movel %fp@(-88),%a2@
0xc00e00e0 <+112>: movel %fp@(-80),%a2@(4)
0xc00e00e6 <+118>: movel %fp@(-72),%a2@(8)
0xc00e00ec <+124>: movel %fp@(-64),%a2@(12)
0xc00e00f2 <+130>: movel %fp@(-60),%a2@(16)
0xc00e00f8 <+136>: movel %fp@(-56),%a2@(20)
0xc00e00fe <+142>: movel %fp@(-52),%a2@(24)
0xc00e0104 <+148>: movel %fp@(-48),%a2@(28)
0xc00e010a <+154>: movel %fp@(-44),%a2@(32)
0xc00e0110 <+160>: movel %fp@(-40),%a2@(36)
0xc00e0116 <+166>: movel %fp@(-36),%a2@(40)
0xc00e011c <+172>: movel %fp@(-32),%a2@(44)
0xc00e0122 <+178>: movel %fp@(-28),%a2@(48)
0xc00e0128 <+184>: movel %fp@(-24),%a2@(52)
0xc00e012e <+190>: movel %fp@(-20),%a2@(56)
0xc00e0134 <+196>: movel %fp@(-16),%a2@(60)
0xc00e013a <+202>: movel %fp@(-12),%a2@(64)
0xc00e0140 <+208>: movel %fp@(-8),%a2@(68)
0xc00e0146 <+214>: lea %sp@(12),%sp
0xc00e014a <+218>: movel %fp@(-96),%d0
0xc00e014e <+222>: braw 0xc00e00b2 <__wait3+66>
0xc00e0152 <+226>: clrl %sp@-
0xc00e0154 <+228>: movel %fp@(12),%sp@-
0xc00e0158 <+232>: movel %d0,%sp@-
0xc00e015a <+234>: pea 0xffffffff
0xc00e015e <+238>: bsrl 0xc00e0174 <__GI___wait4_time64>
0xc00e0164 <+244>: lea %sp@(16),%sp
0xc00e0168 <+248>: braw 0xc00e00b2 <__wait3+66>
0xc00e016c <+252>: bsrl 0xc012a38c <__stack_chk_fail>
End of assembler dump.
(gdb)

Note that __wait3(stat_loc, options, NULL) reduces to,

return __wait4_time64(-1, stat_loc, options, NULL);

So I think the branch at __wait3+34 was taken, and after bsr
__GI___wait4_time64, the branch at __wait3+248 would have been taken. Then
the canary located at %fp@(-4) was compared with %a3@. From the hex dump
above, %fp@(-4) is 0xd000c38e.

As for %a3, we know its value when SIGABRT was caught, and if I'm not
mistaken, %a3 was not altered by __stack_chk_fail or
__GI___fortify_fail...

(gdb) frame 7
#7 0xc012a3a0 in __stack_chk_fail () at stack_chk_fail.c:24
24 stack_chk_fail.c: No such file or directory.
(gdb) info frame
Stack level 7, frame at 0xefee106c:
pc = 0xc012a3a0 in __stack_chk_fail (stack_chk_fail.c:24);
saved pc = 0xc00e0172
called by frame at 0xefee10e0, caller of frame at 0xefee1060
source language c.
Arglist at 0xefee105c, args:
Locals at 0xefee105c, Previous frame's sp is 0xefee106c
Saved registers:
a5 at 0xefee1064, pc at 0xefee1068
(gdb) disass
Dump of assembler code for function __stack_chk_fail:
0xc012a38c <+0>: movel %a5,%sp@-
0xc012a38e <+2>: lea %pc@(0xc0198000),%a5
0xc012a396 <+10>: movel %a5@(10696),%sp@-
0xc012a39a <+14>: bsrl 0xc012a3a0 <__GI___fortify_fail>
End of assembler dump.
(gdb) frame 6
#6 0xc012a3c2 in __GI___fortify_fail (
msg=0xc0182c5e "stack smashing detected") at fortify_fail.c:26
26 fortify_fail.c: No such file or directory.
(gdb) info frame
Stack level 6, frame at 0xefee1060:
pc = 0xc012a3c2 in __GI___fortify_fail (fortify_fail.c:26);
saved pc = 0xc012a3a0
called by frame at 0xefee106c, caller of frame at 0xefee1044
source language c.
Arglist at 0xefee1040, args: msg=0xc0182c5e "stack smashing detected"
Locals at 0xefee1040, Previous frame's sp is 0xefee1060
Saved registers:
d2 at 0xefee1050, d3 at 0xefee1054, a5 at 0xefee1058, pc at 0xefee105c
(gdb) disass
Dump of assembler code for function __GI___fortify_fail:
0xc012a3a0 <+0>: moveml %d2-%d3/%a5,%sp@-
0xc012a3a4 <+4>: lea %pc@(0xc0198000),%a5
0xc012a3ac <+12>: movel %sp@(16),%d3
0xc012a3b0 <+16>: movel %a5@(10700),%d2
0xc012a3b4 <+20>: movel %d3,%sp@-
0xc012a3b6 <+22>: movel %d2,%sp@-
0xc012a3b8 <+24>: pea 0x1
0xc012a3bc <+28>: bsrl 0xc009b1d8 <__libc_message>
=> 0xc012a3c2 <+34>: lea %sp@(12),%sp
0xc012a3c6 <+38>: movel %d3,%sp@-
0xc012a3c8 <+40>: movel %d2,%sp@-
0xc012a3ca <+42>: pea 0x1
0xc012a3ce <+46>: bsrl 0xc009b1d8 <__libc_message>
0xc012a3d4 <+52>: lea %sp@(12),%sp
0xc012a3d8 <+56>: bras 0xc012a3b4 <__GI___fortify_fail+20>
End of assembler dump.
(gdb) info reg
d0 0x0 0
d1 0x16e 366
d2 0xc0182c76 -1072157578
d3 0xc0182c5e -1072157602
d4 0xefee1122 -269610718
d5 0x1 1
d6 0xd0021930 -805168848
d7 0x100 256
a0 0xc01a62a0 0xc01a62a0
a1 0xffffffe6 0xffffffe6
a2 0x0 0x0
a3 0xefee1068 0xefee1068
a4 0xd001e71c 0xd001e71c <pending_sig>
a5 0xc0198000 0xc0198000
fp 0xefee10d8 0xefee10d8
sp 0xefee1044 0xefee1044
ps 0x10 [ X ]
pc 0xc012a3c2 0xc012a3c2 <__GI___fortify_fail+34>
fpcontrol 0x0 0
fpstatus 0x0 0
fpiaddr 0x0 0x0

So %a3 was a pointer into stack frame 6??

(gdb) x/z $a3
0xefee1068: 0xc00e0172

Clearly 0xd000c38e != 0xc00e0172 (that is, %fp@(-4) != %a3@) but did the
canary value change? It rather looks like the canary pointer is wrong...

Another way to find the value of %a3 during __wait3() execution is to look
at its initialization: moveal %a5@(108),%a3. And we can see from 'info
frame' above that __stack_chk_fail() saved %a5 at 0xefee1064.

(gdb) x/4z 0xefee1060
0xefee1060: 0xc0182c5e 0xc0198000 0xc00e0172 0xd001e718
(gdb) x/z *0xefee1064+108
0xc019806c: Cannot access memory at address 0xc019806c

Anyway, if the analysis is right (hopefully someone can confirm that)
this looks like a GCC bug.

I'm not sure why it only shows up during (sysvinit) init script execution.
The canary value is derived from /dev/urandom so I guess the failure is
intermittent because it is connected to kernel PRNG state during early
startup.

Finn Thain

unread,
Apr 1, 2023, 5:50:02 AM4/1/23
to
On 31 Mar 2023, I wrote,

> ...
>
> So %a3 was a pointer into stack frame 6??
>
> (gdb) x/z $a3
> 0xefee1068: 0xc00e0172
>
> Clearly 0xd000c38e != 0xc00e0172 (that is, %fp@(-4) != %a3@) but did the
> canary value change? It rather looks like the canary pointer is wrong...
>

So the question is, why is %a3 wrong? We know it was wrong because %a3 and
%fp got saved on the stack when __libc_message was called:

(gdb) info frame 5
Stack frame at 0xefee1044:
pc = 0xc009b328 in __libc_message (../sysdeps/posix/libc_fatal.c:155);
saved pc = 0xc012a3c2
called by frame at 0xefee1060, caller of frame at 0xefee0fd0
source language c.
Arglist at 0xefee103c, args: action=<optimized out>, fmt=<optimized out>
Locals at 0xefee103c, Previous frame's sp is 0xefee1044
Saved registers:
d2 at 0xefee100c, d3 at 0xefee1010, d4 at 0xefee1014, d5 at 0xefee1018,
d6 at 0xefee101c, d7 at 0xefee1020, a2 at 0xefee1024, a3 at 0xefee1028,
a4 at 0xefee102c, a5 at 0xefee1030, fp at 0xefee103c, pc at 0xefee1040
(gdb) x/z 0xefee1028
0xefee1028: 0xefee1068
(gdb) x/z 0xefee1068
0xefee1068: 0xc00e0172
(gdb) x/z 0xefee103c
0xefee103c: 0xefee10d8
(gdb) x/z 0xefee10d8-4
0xefee10d4: 0x00ac3dbe

That value for %fp@(-4) is the correct canary value (see below) showing
that there's no buffer overrun in __wait3, and also showing that %a3 was
wrong when the canary was to be validated.

> Another way to find the value of %a3 during __wait3() execution is to look
> at its initialization: moveal %a5@(108),%a3. And we can see from 'info
> frame' above that __stack_chk_fail() saved %a5 at 0xefee1064.
>
> (gdb) x/4z 0xefee1060
> 0xefee1060: 0xc0182c5e 0xc0198000 0xc00e0172 0xd001e718
> (gdb) x/z *0xefee1064+108
> 0xc019806c: Cannot access memory at address 0xc019806c
>

That was a mistake on my part. It should have been,

(gdb) x/z 0xefee1064
0xefee1064: 0xc0198000
(gdb) x/z 0xc0198000+108
0xc019806c: 0xc0023e8c

So that's the value %a3 was given at __wait3+24, and it looks fine:

(gdb) x/z 0xc0023e8c
0xc0023e8c <__stack_chk_guard>: 0x00ac3dbe

... the expected value for %a3 at __wait3+70 is <stack_chk_guard> as this
is used to validate the canary at %fp-4.

So, in summary, the canary validation failed in this case not because the
canary got clobbered but because %a3 got clobbered, somewhere between
__wait3+24 and __wait3+70 (below).

The call to __GI___wait4_time64 causes %a3 to be saved to and restored
from the stack, so stack corruption seems to be a strong possibility to
explain the change in %a3.

But if that's what happened, I'd expect __GI___wait4_time64 to report
stack smashing, not __wait3... And it just begs the question, what then
caused the corruption? Was it the wait4 syscall? Was it another thread?
And why is it so rare?

(gdb) disass __wait3
(gdb) disass __GI___wait4_time64
Dump of assembler code for function __GI___wait4_time64:
0xc00e0174 <+0>: lea %sp@(-80),%sp
0xc00e0178 <+4>: moveml %d2-%d5/%a2-%a3/%a5,%sp@-
0xc00e017c <+8>: lea %pc@(0xc0198000),%a5
0xc00e0184 <+16>: movel %sp@(116),%d2
0xc00e0188 <+20>: moveal %sp@(124),%a2
0xc00e018c <+24>: moveal %a5@(108),%a3
0xc00e0190 <+28>: movel %a3@,%sp@(104)
0xc00e0194 <+32>: bsrl 0xc0052e2c <__m68k_read_tp@plt>
0xc00e019a <+38>: movel %a0@(-29920),%d4
0xc00e019e <+42>: bnew 0xc00e026c <__GI___wait4_time64+248>
0xc00e01a2 <+46>: tstl %a2
0xc00e01a4 <+48>: beqs 0xc00e01aa <__GI___wait4_time64+54>
0xc00e01a6 <+50>: moveq #32,%d4
0xc00e01a8 <+52>: addl %sp,%d4
0xc00e01aa <+54>: movel %sp@(120),%d3
0xc00e01ae <+58>: movel %sp@(112),%d1
0xc00e01b2 <+62>: moveq #114,%d0
0xc00e01b4 <+64>: trap #0
0xc00e01b6 <+66>: cmpil #-4096,%d0
0xc00e01bc <+72>: bhiw 0xc00e02a6 <__GI___wait4_time64+306>
0xc00e01c0 <+76>: tstl %d0
0xc00e01c2 <+78>: blew 0xc00e0256 <__GI___wait4_time64+226>
0xc00e01c6 <+82>: tstl %a2
0xc00e01c8 <+84>: beqw 0xc00e0256 <__GI___wait4_time64+226>
0xc00e01cc <+88>: movel %sp@(32),%a2@(4)
0xc00e01d2 <+94>: smi %d1
0xc00e01d4 <+96>: extbl %d1
0xc00e01d6 <+98>: movel %d1,%a2@
0xc00e01d8 <+100>: movel %sp@(36),%a2@(12)
0xc00e01de <+106>: smi %d1
0xc00e01e0 <+108>: extbl %d1
0xc00e01e2 <+110>: movel %d1,%a2@(8)
0xc00e01e6 <+114>: movel %sp@(40),%a2@(20)
0xc00e01ec <+120>: smi %d1
0xc00e01ee <+122>: extbl %d1
0xc00e01f0 <+124>: movel %d1,%a2@(16)
0xc00e01f4 <+128>: movel %sp@(44),%a2@(28)
0xc00e01fa <+134>: smi %d1
0xc00e01fc <+136>: extbl %d1
0xc00e01fe <+138>: movel %d1,%a2@(24)
0xc00e0202 <+142>: movel %sp@(48),%a2@(32)
0xc00e0208 <+148>: movel %sp@(52),%a2@(36)
0xc00e020e <+154>: movel %sp@(56),%a2@(40)
0xc00e0214 <+160>: movel %sp@(60),%a2@(44)
0xc00e021a <+166>: movel %sp@(64),%a2@(48)
0xc00e0220 <+172>: movel %sp@(68),%a2@(52)
0xc00e0226 <+178>: movel %sp@(72),%a2@(56)
0xc00e022c <+184>: movel %sp@(76),%a2@(60)
0xc00e0232 <+190>: movel %sp@(80),%a2@(64)
0xc00e0238 <+196>: movel %sp@(84),%a2@(68)
0xc00e023e <+202>: movel %sp@(88),%a2@(72)
0xc00e0244 <+208>: movel %sp@(92),%a2@(76)
0xc00e024a <+214>: movel %sp@(96),%a2@(80)
0xc00e0250 <+220>: movel %sp@(100),%a2@(84)
0xc00e0256 <+226>: moveal %sp@(104),%a0
0xc00e025a <+230>: movel %a3@,%d1
0xc00e025c <+232>: cmpl %a0,%d1
0xc00e025e <+234>: bnew 0xc00e02f2 <__GI___wait4_time64+382>
0xc00e0262 <+238>: moveml %sp@+,%d2-%d5/%a2-%a3/%a5
0xc00e0266 <+242>: lea %sp@(80),%sp
0xc00e026a <+246>: rts
0xc00e026c <+248>: bsrl 0xc00a1b88 <__GI___pthread_enable_asynccancel>
0xc00e0272 <+254>: movel %d0,%d5
0xc00e0274 <+256>: tstl %a2
0xc00e0276 <+258>: beqs 0xc00e02c4 <__GI___wait4_time64+336>
0xc00e0278 <+260>: moveq #32,%d4
0xc00e027a <+262>: addl %sp,%d4
0xc00e027c <+264>: movel %sp@(120),%d3
0xc00e0280 <+268>: movel %sp@(112),%d1
0xc00e0284 <+272>: moveq #114,%d0
0xc00e0286 <+274>: trap #0
0xc00e0288 <+276>: cmpil #-4096,%d0
0xc00e028e <+282>: bhis 0xc00e02c8 <__GI___wait4_time64+340>
0xc00e0290 <+284>: movel %d5,%sp@-
0xc00e0292 <+286>: movel %d0,%sp@(32)
0xc00e0296 <+290>: bsrl 0xc00a1bea <__GI___pthread_disable_asynccancel>
0xc00e029c <+296>: addql #4,%sp
0xc00e029e <+298>: movel %sp@(28),%d0
0xc00e02a2 <+302>: braw 0xc00e01c0 <__GI___wait4_time64+76>
0xc00e02a6 <+306>: movel %d0,%sp@(28)
0xc00e02aa <+310>: bsrl 0xc0052e2c <__m68k_read_tp@plt>
0xc00e02b0 <+316>: addal %a5@(2cf8),%a0
0xc00e02b8 <+324>: movel %sp@(28),%d0
0xc00e02bc <+328>: negl %d0
0xc00e02be <+330>: movel %d0,%a0@
0xc00e02c0 <+332>: moveq #-1,%d0
0xc00e02c2 <+334>: bras 0xc00e0256 <__GI___wait4_time64+226>
0xc00e02c4 <+336>: clrl %d4
0xc00e02c6 <+338>: bras 0xc00e027c <__GI___wait4_time64+264>
0xc00e02c8 <+340>: movel %d0,%sp@(28)
0xc00e02cc <+344>: bsrl 0xc0052e2c <__m68k_read_tp@plt>
0xc00e02d2 <+350>: addal %a5@(2cf8),%a0
0xc00e02da <+358>: movel %sp@(28),%d0
0xc00e02de <+362>: negl %d0
0xc00e02e0 <+364>: movel %d0,%a0@
0xc00e02e2 <+366>: movel %d5,%sp@-
0xc00e02e4 <+368>: bsrl 0xc00a1bea <__GI___pthread_disable_asynccancel>
0xc00e02ea <+374>: addql #4,%sp
0xc00e02ec <+376>: moveq #-1,%d0
0xc00e02ee <+378>: braw 0xc00e0256 <__GI___wait4_time64+226>
0xc00e02f2 <+382>: bsrl 0xc012a38c <__stack_chk_fail>
End of assembler dump.

Andreas Schwab

unread,
Apr 1, 2023, 6:40:04 AM4/1/23
to
On Apr 01 2023, Finn Thain wrote:

> So, in summary, the canary validation failed in this case not because the
> canary got clobbered but because %a3 got clobbered, somewhere between
> __wait3+24 and __wait3+70 (below).
>
> The call to __GI___wait4_time64 causes %a3 to be saved to and restored
> from the stack, so stack corruption seems to be a strong possibility to
> explain the change in %a3.
>
> But if that's what happened, I'd expect __GI___wait4_time64 to report
> stack smashing, not __wait3...

The stask smashing probably didn't fire in __wait4_time64, because it
hit the saved register area, not the canary (which reside on the
opposite ends of the stack frame).

Michael Schmitz

unread,
Apr 2, 2023, 12:20:04 AM4/2/23
to
Hi Finn,

nice work!

Am 01.04.2023 um 22:27 schrieb Finn Thain:
> So, in summary, the canary validation failed in this case not because the
> canary got clobbered but because %a3 got clobbered, somewhere between
> __wait3+24 and __wait3+70 (below).
>
> The call to __GI___wait4_time64 causes %a3 to be saved to and restored
> from the stack, so stack corruption seems to be a strong possibility to
> explain the change in %a3.
>
> But if that's what happened, I'd expect __GI___wait4_time64 to report
> stack smashing, not __wait3... And it just begs the question, what then
> caused the corruption? Was it the wait4 syscall? Was it another thread?

Saved registers are restored from the stack before return from
__GI___wait4_time64 but we don't know which of the two wait4 call sites
was used, do we?

What registers does __m68k_read_tp@plt clobber?

> And why is it so rare?

Maybe an interaction between (multiple?) signals and syscall return...
depends on how long we sleep in wait4, and whether a signal happens just
during that time.

%a3 is the first register saved to the switch stack BTW.

That kernel does contain Al Viro's patch that corrected our switch stack
handling in the signal return path? I wonder whether there's a potential
race lurking in there?

And I just notice that we had had trouble with a copy_to_user in
setup_frame() earlier (reason for my buserr handler patch). I wonder
whether something's gone wrong there. Do you get a segfault instead of
the abort signal if you drop my patch?

Cheers,

Michael

Finn Thain

unread,
Apr 2, 2023, 5:50:03 AM4/2/23
to
On Sun, 2 Apr 2023, Michael Schmitz wrote:

> Saved registers are restored from the stack before return from
> __GI___wait4_time64 but we don't know which of the two wait4 call sites
> was used, do we?
>
> What registers does __m68k_read_tp@plt clobber?
>

But that won't matter to the caller, __wait3, right?

I did check that %a3 was saved on entry, before any wait4 syscall or
__m68k_read_tp call etc. I also looked at the rts and %a3 did get restored
there. Is it worth the effort to trace every branch, in case there's some
way to reach an rts without having first restored the saved registers?

>
> Maybe an interaction between (multiple?) signals and syscall return...

When running dash from gdb in QEMU, there's only one signal (SIGCHLD) and
it gets handled before __wait3() returns. (Of course, the "stack smashing
detected" failure never shows up in QEMU.)

> depends on how long we sleep in wait4, and whether a signal happens just
> during that time.
>

I agree, there seems to be a race condition there. (And dash's waitproc()
seems to take pains to reap the child and handle the signal in any order.)
I wouldn't be surprised if this race somehow makes the failure rare.

I don't want to recompile any userland binaries at this stage, so it would
be nice if we could modify the kernel to keep track of exactly how that
race gets won and lost. Or perhaps there's an easy way to rig the outcome
one way or the other.

> %a3 is the first register saved to the switch stack BTW.
>
> That kernel does contain Al Viro's patch that corrected our switch stack
> handling in the signal return path? I wonder whether there's a potential
> race lurking in there?
>

I'm not sure which patch you're referring to, but I think Al's signal
handling work appeared in v5.15-rc4. I have reproduced the "stack smashing
detected" failure with v5.14.0 and with recent mainline (62bad54b26db from
March 30th).

> And I just notice that we had had trouble with a copy_to_user in
> setup_frame() earlier (reason for my buserr handler patch). I wonder
> whether something's gone wrong there. Do you get a segfault instead of
> the abort signal if you drop my patch?
>

Are you referring to e36a82bebbf7? I doubt that it's related. I believe
that copy_to_user is not involved here for the reason already given i.e.
wait3(status, flags, NULL) means wait4 gets a NULL pointer for the struct
rusage * parameter. Also, Stan first reported this failure in December
with v6.0.9.

Finn Thain

unread,
Apr 2, 2023, 7:10:03 AM4/2/23
to

On Sat, 1 Apr 2023, Andreas Schwab wrote:

> On Apr 01 2023, Finn Thain wrote:
>
> > So, in summary, the canary validation failed in this case not because
> > the canary got clobbered but because %a3 got clobbered, somewhere
> > between __wait3+24 and __wait3+70 (below).
> >
> > The call to __GI___wait4_time64 causes %a3 to be saved to and restored
> > from the stack, so stack corruption seems to be a strong possibility
> > to explain the change in %a3.
> >
> > But if that's what happened, I'd expect __GI___wait4_time64 to report
> > stack smashing, not __wait3...
>
> The stask smashing probably didn't fire in __wait4_time64, because it
> hit the saved register area, not the canary (which reside on the
> opposite ends of the stack frame).
>

OK.

This is odd:

https://sources.debian.org/src/dash/0.5.12-2/src/jobs.c/?hl=1165#L1165

1176 do {
1177 gotsigchld = 0;
1178 do
1179 err = wait3(status, flags, NULL);
1180 while (err < 0 && errno == EINTR);
1181
1182 if (err || (err = -!block))
1183 break;
1184
1185 sigblockall(&oldmask);
1186
1187 while (!gotsigchld && !pending_sig)
1188 sigsuspend(&oldmask);
1189
1190 sigclearmask();
1191 } while (gotsigchld);
1192
1193 return err;

Execution of dash under gdb doesn't seem to agree with the source code
above.

If wait3() returns the child pid then the break should execute. And it
does return the pid (4107) but the while loop was not terminated. Hence
wait3() was called again and the same breakpoint was hit again. Also, the
while loop should have ended after the first iteration because gotsigchild
should have been set by the signal handler which executed before wait3()
even returned...

...
(gdb) c
Continuing.
#
#
# x=$(:)
[Detaching after fork from child process 4107]

Program received signal SIGCHLD, Child status changed.
0xc00e81b6 in __GI___wait4_time64 (pid=-1, stat_loc=0xeffff87a, options=2,
usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:35
35 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
(gdb) c
Continuing.

Breakpoint 3, waitproc (status=0xeffff86a, block=1) at jobs.c:1180
1180 jobs.c: No such file or directory.
(gdb) info locals
oldmask = {__val = {1101825, 3844132865, 2072969216, 192511, 4190371840,
4509697, 3836788738, 1049415681, 3837317121, 3094671359, 4184080384,
536870943, 717475840, 3485913089, 3836792833, 2072969216, 184321,
3844141055, 4190425089, 4127248385, 3094659084, 597610497, 4137734145,
3844079616, 131072, 269156352, 184320, 3878473729, 3844132865, 3094663168,
3549089793, 3844132865}}
flags = 2
err = 4107
oldmask = <optimized out>
flags = <optimized out>
err = <optimized out>
(gdb) print errno
$6 = 2
(gdb) c
Continuing.

Breakpoint 3, waitproc (status=0xeffff86a, block=0) at jobs.c:1180
1180 in jobs.c
(gdb) info locals
oldmask = {__val = {1101825, 3844132865, 2072969216, 192511, 4190371840,
4509697, 3836788738, 1049415681, 3837317121, 3094671359, 4184080384,
536870943, 717475840, 3485913089, 3836792833, 2072969216, 184321,
3844141055, 4190425089, 4127248385, 3094659084, 597610497, 4137734145,
3844079616, 131072, 269156352, 184320, 3878473729, 3844132865, 3094663168,
3549089793, 3844132865}}
flags = 3
err = -1
oldmask = <optimized out>
flags = <optimized out>
err = <optimized out>
(gdb) print errno
$7 = 10
(gdb)

Michael Schmitz

unread,
Apr 2, 2023, 6:10:04 PM4/2/23
to
Hi Finn,
I wonder whether line 1182 got miscompiled by gcc. As err == 4107 it's >
0 and the break clearly ought to have been taken, and the second
condition (which changes err) does not need to be examined.  Do the same
ordering constraints apply to '||' as to '&&' ?

What does the disassembly of this section look like?

> while loop should have ended after the first iteration because gotsigchild
> should have been set by the signal handler which executed before wait3()
> even returned...

Setting gotsigchild > 0 would cause the while loop to continue, no?

The second wait3 call then returns an error (errno ECHILD), leaves
gotsigchild clear and exits the loop (returning from waitproc with
error, which it shouldn't have).
Is that ENOENT the returned status from the child process?

Cheers,

    Michael

Michael Schmitz

unread,
Apr 3, 2023, 4:30:04 AM4/3/23
to
Hi Finn,

Am 02.04.2023 um 21:31 schrieb Finn Thain:
> On Sun, 2 Apr 2023, Michael Schmitz wrote:
>
>> Saved registers are restored from the stack before return from
>> __GI___wait4_time64 but we don't know which of the two wait4 call sites
>> was used, do we?
>>
>> What registers does __m68k_read_tp@plt clobber?
>>
>
> But that won't matter to the caller, __wait3, right?

Not if they are properly restored ... best not go there.

> I did check that %a3 was saved on entry, before any wait4 syscall or
> __m68k_read_tp call etc. I also looked at the rts and %a3 did get restored
> there. Is it worth the effort to trace every branch, in case there's some
> way to reach an rts without having first restored the saved registers?

No, I dont think that's possible - from inspection, I now see
__GI___wait4_time64 does not allow that, and I think the same is true
for wait3 (haven't spent quite long enough on that).

>
>>
>> Maybe an interaction between (multiple?) signals and syscall return...
>
> When running dash from gdb in QEMU, there's only one signal (SIGCHLD) and
> it gets handled before __wait3() returns. (Of course, the "stack smashing
> detected" failure never shows up in QEMU.)

Might be a clue that we need multiple signals to force the stack
smashing error. And we might not get that in QEMU, due to the faster
execution in emulating on a modern processor.

Thinking a bit more about interactions between signal delivery and
syscall return, it turns out that we don't check for pending signals
when returning from a syscall. That's OK on SMP systems, because we
don't have another process running while we execute the syscall (and we
_do_ run signal handling when scheduling, i.e. when wait4 sleeps or is
woken up)?

Seems we can forget about that interaction then.

>
>> depends on how long we sleep in wait4, and whether a signal happens just
>> during that time.
>>
>
> I agree, there seems to be a race condition there. (And dash's waitproc()
> seems to take pains to reap the child and handle the signal in any order.)

Yes, it makes sure the SIGCHLD is seen no matter in what order the
signals are delivered ...

> I wouldn't be surprised if this race somehow makes the failure rare.
>
> I don't want to recompile any userland binaries at this stage, so it would
> be nice if we could modify the kernel to keep track of exactly how that
> race gets won and lost. Or perhaps there's an easy way to rig the outcome
> one way or the other.

A race between syscall return due to child exit and signal delivery
seems unlikely, but maybe there is a race between syscall return due to
a timer firing and signal delivery. Are there any timers set to
periodically interrupt wait3?

>
>> %a3 is the first register saved to the switch stack BTW.
>>
>> That kernel does contain Al Viro's patch that corrected our switch stack
>> handling in the signal return path? I wonder whether there's a potential
>> race lurking in there?
>>
>
> I'm not sure which patch you're referring to, but I think Al's signal
> handling work appeared in v5.15-rc4. I have reproduced the "stack smashing

I have it in 5.15-rc2 in my tree but that's probably from my running
tests on that patch series.

> detected" failure with v5.14.0 and with recent mainline (62bad54b26db from
> March 30th).

OK, so it's not related (or the patch did not fix all the problems with
multiple signals, but a) that's unlikely and b) signals during wait4
should not matter, see above).

So the fact that %a3 is involved here is probably just coincidence.

>
>> And I just notice that we had had trouble with a copy_to_user in
>> setup_frame() earlier (reason for my buserr handler patch). I wonder
>> whether something's gone wrong there. Do you get a segfault instead of
>> the abort signal if you drop my patch?
>>
>
> Are you referring to e36a82bebbf7? I doubt that it's related. I believe
> that copy_to_user is not involved here for the reason already given i.e.
> wait3(status, flags, NULL) means wait4 gets a NULL pointer for the struct
> rusage * parameter. Also, Stan first reported this failure in December
> with v6.0.9.

Can't be related then.

Still no nearer to a solution - something smashes the stack near %sp,
causes the %a3 register restore after __GI___wait4_time64 to return a
wrong pointer to the stack canary, and triggers a stack smashing warning
in this indirect way. But what??

Cheers,

Michael

Finn Thain

unread,
Apr 3, 2023, 8:30:03 PM4/3/23
to
AFAICT, the source code is valid. This article has some information:
https://stackoverflow.com/questions/628526/is-short-circuiting-logical-operators-mandated-and-evaluation-order

It looks like I messed up. waitproc() appears to have been invoked
twice, which is why wait3 was invoked twice...

GNU gdb (Debian 13.1-2) 13.1
...
(gdb) set osabi GNU/Linux
(gdb) file /bin/dash
Reading symbols from /bin/dash...
Reading symbols from /usr/lib/debug/.build-id/aa/4160f84f3eeee809c554cb9f3e1ef0686b8dcc.debug...
(gdb) b waitproc
Breakpoint 1 at 0xc346: file jobs.c, line 1168.
(gdb) b jobs.c:1180
Breakpoint 2 at 0xc390: file jobs.c, line 1180.
(gdb) run
Starting program: /usr/bin/dash
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/m68k-linux-gnu/libthread_db.so.1".
# x=$(:)
[Detaching after fork from child process 570]

Breakpoint 1, waitproc (status=0xeffff86a, block=1) at jobs.c:1168
1168 jobs.c: No such file or directory.
(gdb) c
Continuing.

Breakpoint 2, waitproc (status=0xeffff86a, block=1) at jobs.c:1180
1180 in jobs.c
(gdb) info locals
oldmask = {__val = {1997799424, 49154, 396623872, 184321, 3223896090, 53249,
3836788738, 1049411610, 867225601, 3094609920, 0, 1048580, 2857693183,
4184129547, 3435708442, 863764480, 184321, 3844141055, 4190425089,
4127248385, 3094659084, 597610497, 4135112705, 3844079616, 131072,
37355520, 184320, 3878473729, 3844132865, 3094663168, 3549089793,
3844132865}}
flags = 2
err = 570
oldmask = <optimized out>
flags = <optimized out>
err = <optimized out>
(gdb) c
Continuing.

Breakpoint 1, waitproc (status=0xeffff86a, block=0) at jobs.c:1168
1168 in jobs.c
(gdb) c
Continuing.

Breakpoint 2, waitproc (status=0xeffff86a, block=0) at jobs.c:1180
1180 in jobs.c
(gdb) info locals
oldmask = {__val = {1997799424, 49154, 396623872, 184321, 3223896090, 53249,
3836788738, 1049411610, 867225601, 3094609920, 0, 1048580, 2857693183,
4184129547, 3435708442, 863764480, 184321, 3844141055, 4190425089,
4127248385, 3094659084, 597610497, 4135112705, 3844079616, 131072,
37355520, 184320, 3878473729, 3844132865, 3094663168, 3549089793,
3844132865}}
flags = 3
err = -1
oldmask = <optimized out>
flags = <optimized out>
err = <optimized out>
(gdb) c
Continuing.
#

> What does the disassembly of this section look like?
>
> > while loop should have ended after the first iteration because gotsigchild
> > should have been set by the signal handler which executed before wait3()
> > even returned...
>
> Setting gotsigchild > 0 would cause the while loop to continue, no?
>

Right. Sorry for the noise.

Finn Thain

unread,
Apr 4, 2023, 12:20:02 AM4/4/23
to
On Mon, 3 Apr 2023, Michael Schmitz wrote:

> Am 02.04.2023 um 21:31 schrieb Finn Thain:
> >
> >>
> >> Maybe an interaction between (multiple?) signals and syscall
> >> return...
> >
> > When running dash from gdb in QEMU, there's only one signal (SIGCHLD)
> > and it gets handled before __wait3() returns. (Of course, the "stack
> > smashing detected" failure never shows up in QEMU.)
>
> Might be a clue that we need multiple signals to force the stack
> smashing error. And we might not get that in QEMU, due to the faster
> execution in emulating on a modern processor.
>

Right -- being that the failure is intermittent on real hardware, it's not
surprising that I can't make it show up in QEMU or Aranym.

But no-one has reproduced it on Atari or Amiga hardware yet so I guess it
could be a driver issue...

I wonder whether anyone else is actually running recent Debian/SID with
sysvinit and without a Debian initrd on a Motorola 68030 system.

> Thinking a bit more about interactions between signal delivery and
> syscall return, it turns out that we don't check for pending signals
> when returning from a syscall. That's OK on SMP systems, because we
> don't have another process running while we execute the syscall (and we
> _do_ run signal handling when scheduling, i.e. when wait4 sleeps or is
> woken up)?
>
> Seems we can forget about that interaction then.
>
> >
> >> depends on how long we sleep in wait4, and whether a signal happens
> >> just during that time.
> >>
> >
> > I agree, there seems to be a race condition there. (And dash's
> > waitproc() seems to take pains to reap the child and handle the signal
> > in any order.)
>
> Yes, it makes sure the SIGCHLD is seen no matter in what order the
> signals are delivered ...
>
> > I wouldn't be surprised if this race somehow makes the failure rare.
> >
> > I don't want to recompile any userland binaries at this stage, so it
> > would be nice if we could modify the kernel to keep track of exactly
> > how that race gets won and lost. Or perhaps there's an easy way to rig
> > the outcome one way or the other.
>
> A race between syscall return due to child exit and signal delivery
> seems unlikely, but maybe there is a race between syscall return due to
> a timer firing and signal delivery. Are there any timers set to
> periodically interrupt wait3?
>

I searched the source code and SIGALRM appears to be unused by dash. And
'timeout' is not a dash builtin. But that doesn't mean we don't get
multiple signals. One crashy script looks like this:

TMPFS_SIZE="$(tmpfs_size_vm "$TMPFS_SIZE")"
RUN_SIZE="$(tmpfs_size_vm "$RUN_SIZE")"
LOCK_SIZE="$(tmpfs_size_vm "$LOCK_SIZE")"
SHM_SIZE="$(tmpfs_size_vm "$SHM_SIZE")"
TMP_SIZE="$(tmpfs_size_vm "$TMP_SIZE")"

Is it possible that the SIGCHLD from the first sub-shell got delayed?

>
> Still no nearer to a solution - something smashes the stack near %sp,
> causes the %a3 register restore after __GI___wait4_time64 to return a
> wrong pointer to the stack canary, and triggers a stack smashing warning
> in this indirect way. But what??
>

I've no idea.

The actual corruption might offer a clue here. I believe the saved %a3 was
clobbered with the value 0xefee1068 which seems to be a pointer into some
stack frame that would have come into existence shortly after
__GI___wait4_time64 was called. That stack frame is gone by the time the
core dump was made. Was it dash's signal handler, onsig(), or some libc
subroutine called by __GI___wait4_time64(), or was it something that the
kernel put there?

Dash's SIGCHLD handler looks safe enough -- I don't see how it could
corrupt the saved registers in the __GI___wait4_time64 stack frame (it's
not like 1 was stored in the wrong place).
https://sources.debian.org/src/dash/0.5.12-2/src/trap.c/?hl=285#L285

Finn Thain

unread,
Apr 4, 2023, 7:20:04 AM4/4/23
to
On Tue, 4 Apr 2023, I wrote:

>
> The actual corruption might offer a clue here. I believe the saved %a3
> was clobbered with the value 0xefee1068 which seems to be a pointer into
> some stack frame that would have come into existence shortly after
> __GI___wait4_time64 was called.

Wrong... it is a pointer to the location below the __wait3 stack frame.

(gdb) info frame
Stack level 8, frame at 0xefee10e0:
pc = 0xc00e0172 in __wait3 (../sysdeps/unix/sysv/linux/wait3.c:41);
saved pc = 0xd000c38e
called by frame at 0xefee11dc, caller of frame at 0xefee106c
source language c.
Arglist at 0xefee10d8, args: stat_loc=<optimized out>,
options=<optimized out>, usage=<optimized out>
Locals at 0xefee10d8, Previous frame's sp is 0xefee10e0
Saved registers:
a2 at 0xefee106c, a3 at 0xefee1070, a5 at 0xefee1074, fp at 0xefee10d8,
pc at 0xefee10dc

That shows %a2 was saved at 0xefee106c, and the address of interest is the
stack location immediately below that. But it has no particular
significance: it holds a NULL pointer when the struct __rusage64 *usage
argument to __wait4_time64() gets pushed there:

0xc00e8152 <__wait3+226>: clrl %sp@-
0xc00e8154 <__wait3+228>: movel %fp@(12),%sp@-
0xc00e8158 <__wait3+232>: movel %d0,%sp@-
0xc00e815a <__wait3+234>: pea 0xffffffff
0xc00e815e <__wait3+238>: bsrl 0xc00e8174 <__GI___wait4_time64>

But it's no longer a NULL pointer at the time of the crash, though it
should be, since that stack frame is still active.

(gdb) x/16z 0xefee1068
0xefee1068: 0xc00e0172 0xd001e718 0xd001e498 0xd001b874
0xefee1078: 0x00170700 0x00170700 0x00170700 0x00005360
0xefee1088: 0x0000e920 0x00000006 0x00002000 0x00000002
0xefee1098: 0x00171f20 0x00171f20 0x00171f20 0x000000e0

Beats me.

Michael Schmitz

unread,
Apr 4, 2023, 7:30:03 PM4/4/23
to
Hi Finn,
That means we may well see both signals delivered at the same time if
the parent shell wasn't scheduled to run until the second subshell
terminated (answering the question I was about to ask on your other
mail, the one about the crashy script with multiple subshells).

Now does waitproc() handle that case correctly? The first signal
delivered results in err == child PID so the break is taken, causing
exit from waitproc(). Does waitproc() get called repeatedly until an
error is returned?

Cheers,

    Michael

Finn Thain

unread,
Apr 4, 2023, 10:20:03 PM4/4/23
to
How is that possible? If the parent does not get scheduled, the second
fork will not take place.

> Now does waitproc() handle that case correctly? The first signal
> delivered results in err == child PID so the break is taken, causing
> exit from waitproc().

I don't follow. Can you rephrase that perhaps?

For a single subshell, the SIGCHLD signal can be delivered before wait4 is
called or after it returns. For example, $(sleep 5) seems to produce the
latter whereas $(:) tends to produce the former.

> Does waitproc() get called repeatedly until an error is returned?
>

It's complicated...
https://sources.debian.org/src/dash/0.5.12-2/src/jobs.c/?hl=1122#L1122

I don't care that much what dash does as long as it isn't corrupting it's
own stack, which is a real possibility, and one which gdb's data watch
point would normally resolve. And yet I have no way to tackle that.

I've been running gdb under QEMU, where the failure is not reproducible.
Running dash under gdb on real hardware is doable (RAM permitting). But
the failure is intermittent even then -- it only happens during execution
of certain init scripts, and I can't reproduce it by manually running
those scripts.

(Even if I could reproduce the failure under gdb, instrumenting execution
in gdb can alter timing in undesirable ways...)

So, again, the best avenue I can think of for such experiments to modify
the kernel to either keep track of the times of the wait4 syscalls and
signal delivery and/or push the timing one way or the other e.g. by
delaying signal delivery, altering scheduler behaviour, etc. But I don't
have code for that. I did try adding random delays around kernel_wait4()
but it didn't have any effect...

Michael Schmitz

unread,
Apr 6, 2023, 10:00:03 PM4/6/23
to
Hi Finn,
I assumed subshells could run asynchronously, and the parent shell
continue until it hits a statement that needs the result of one of the
subshells.

What is the point of subshells, if not to allow this?

>
>> Now does waitproc() handle that case correctly? The first signal
>> delivered results in err == child PID so the break is taken, causing
>> exit from waitproc().
>
> I don't follow. Can you rephrase that perhaps?

The first subshell exiting causes the wait3() to return, and the return
code is > 0 (child PID). The break statement executes and waitproc()
returns that PID.

I was wondering how multiple child processes exiting would be handled,
but I now see that repeatedly calling waitproc() until all outstanding
jobs have completed is the only way. dash implements this method, so it
must expect multiple jobs / subshells to run concurrently.

> For a single subshell, the SIGCHLD signal can be delivered before wait4 is
> called or after it returns. For example, $(sleep 5) seems to produce the
> latter whereas $(:) tends to produce the former.

I don't think wait4 can return with success before SIGCHLD has been
delivered. Delivery of SIGCHLD is what makes the wait syscalls unblock
as far as I understand.

>> Does waitproc() get called repeatedly until an error is returned?
>>
>
> It's complicated...

Yep, but I'm now satisfied that is the only way ... sorry for the noise.

> https://sources.debian.org/src/dash/0.5.12-2/src/jobs.c/?hl=1122#L1122
>
> I don't care that much what dash does as long as it isn't corrupting it's
> own stack, which is a real possibility, and one which gdb's data watch
> point would normally resolve. And yet I have no way to tackle that.
>
> I've been running gdb under QEMU, where the failure is not reproducible.
> Running dash under gdb on real hardware is doable (RAM permitting). But
> the failure is intermittent even then -- it only happens during execution
> of certain init scripts, and I can't reproduce it by manually running
> those scripts.
>
> (Even if I could reproduce the failure under gdb, instrumenting execution
> in gdb can alter timing in undesirable ways...)
>
> So, again, the best avenue I can think of for such experiments to modify
> the kernel to either keep track of the times of the wait4 syscalls and

The easiest way to do that is to log all wait and signal syscalls, as
well as process exit. That might alter timing if these log messages go
to the serial console though. Is that what you have in mind?

> signal delivery and/or push the timing one way or the other e.g. by
> delaying signal delivery, altering scheduler behaviour, etc. But I don't
> have code for that. I did try adding random delays around kernel_wait4()
> but it didn't have any effect...
>

I wonder whether it's possible to delay process exit (and parent process
signaling) by placing the exit syscall on a timer workqueue. But the
same effect could be had by inserting a sleep before subshell exit ...

And causing a half-dead task to schedule in order to delay signaling
doesn't seem safe to me ...

Cheers,

Michael

Finn Thain

unread,
Apr 7, 2023, 12:22:28 AM4/7/23
to
On Fri, 7 Apr 2023, Michael Schmitz wrote:

> Am 05.04.2023 um 14:00 schrieb Finn Thain:
> > On Wed, 5 Apr 2023, Michael Schmitz wrote:
> >
> >> That means we may well see both signals delivered at the same time if
> >> the parent shell wasn't scheduled to run until the second subshell
> >> terminated (answering the question I was about to ask on your other
> >> mail, the one about the crashy script with multiple subshells).
> >>
> >
> > How is that possible? If the parent does not get scheduled, the second
> > fork will not take place.
>
> I assumed subshells could run asynchronously, and the parent shell
> continue until it hits a statement that needs the result of one of the
> subshells.
>

That would be nice but I don't think dash is so sophisticated as to keep
track of data dependencies between the various expressions and commands in
a shell script.

> What is the point of subshells, if not to allow this?
>

$ x=$(exit 123)
$ echo $?
123

$ set -e
$ x=$(false)

Anyway, my gut says that we're barking up the wrong tree. My recent tests
show that the failure is not uniformly random. Either the script fails
often or it fails not at all. It's as if there was some unknown variable
that caused dash to corrupt its own stack.

> > Running dash under gdb on real hardware is doable (RAM permitting).
> > But the failure is intermittent even then -- it only happens during
> > execution of certain init scripts, and I can't reproduce it by
> > manually running those scripts.
> >
> > (Even if I could reproduce the failure under gdb, instrumenting
> > execution in gdb can alter timing in undesirable ways...)
> >
> > So, again, the best avenue I can think of for such experiments to
> > modify the kernel to either keep track of the times of the wait4
> > syscalls and
>
> The easiest way to do that is to log all wait and signal syscalls, as
> well as process exit. That might alter timing if these log messages go
> to the serial console though. Is that what you have in mind?
>

What I had in mind was collecting measurements in such way that would not
impact timing, perhaps by storing them somewhere they could be retrieved
from the process core dump.

But that's probably not realistic and it's probably pointless anyway -- I
don't expect to find an old bug in common code like kernel/exit.c, or in a
hot path like those in arch/m68k/kernel/entry.S.

More likely is that some kind of bug in dash causes it to corrupt its own
stack when conditions are just right. I just need to figure out how to
recreate those conditions. :-/

When dash is feeling crashy, you can get results like this:

root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument.
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument.
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument.
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument.
root@debian:~#

But when it's not feeling crashy, you can't:

root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument.
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument.
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument.
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument.

The only way I have found to alter dash's inclination to crash is to
reboot. (I said previously I was unable to reproduce this in a single user
mode shell but it turned out to be more subtle.)

Geert Uytterhoeven

unread,
Apr 7, 2023, 8:10:04 AM4/7/23
to
Hi Michael,

On Fri, Apr 7, 2023 at 3:58 AM Michael Schmitz <schmi...@gmail.com> wrote:
> The easiest way to do that is to log all wait and signal syscalls, as
> well as process exit. That might alter timing if these log messages go
> to the serial console though. Is that what you have in mind?

Store to RAM, retrieve through a new /proc file?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Geert Uytterhoeven

unread,
Apr 7, 2023, 8:20:04 AM4/7/23
to
Hi Finn,
That sounds like memory corruption somewhere else, e.g.
in the buffer cache...

Can you reproduce with CONFIG_DEBUG_SLAB=y?

Finn Thain

unread,
Apr 8, 2023, 1:50:05 AM4/8/23
to
On Fri, 7 Apr 2023, Geert Uytterhoeven wrote:

> >
> > The only way I have found to alter dash's inclination to crash is to
> > reboot. (I said previously I was unable to reproduce this in a single
> > user mode shell but it turned out to be more subtle.)
>
> That sounds like memory corruption somewhere else, e.g. in the buffer
> cache...
>

If so, once the corruption showed up, you would expect the same crash next
time...

root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
*** stack smashing detected ***: terminated
Aborted (core dumped)
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
*** stack smashing detected ***: terminated
Aborted (core dumped)
*** stack smashing detected ***: terminated
Aborted (core dumped)
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
*** stack smashing detected ***: terminated
Aborted (core dumped)
root@debian:~# echo 3 > /proc/sys/vm/drop_caches
[ 937.250000] bash (717): drop_caches: 3
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
*** stack smashing detected ***: terminated
Aborted (core dumped)
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
*** stack smashing detected ***: terminated
Aborted (core dumped)
*** stack smashing detected ***: terminated
Aborted (core dumped)

I'd say it's probably not buffer cache corruption causing this because we
can see two subshells fail, then just one.

For that build I enabled SLUB_DEBUG but forgot to enable SLUB_DEBUG_ON --

CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLUB_DEBUG_ON is not set
CONFIG_PAGE_POISONING=y
CONFIG_SCHED_STACK_END_CHECK=y

Below you can see the console log for a build with CONFIG_SLUB_DEBUG_ON=y.

BTW, these builds have a hacked RNG which produces only zeros, to try to
get more consistent behaviour. It seems to have made no difference. None
of the kernel changes I've made seem to make much difference, besides
perturbing timing slightly. I tend to think this is a dash bug that's
confined to slow machines only because it's timing sensitive.

ABCFGHIJK
[ 0.000000] Linux version 6.3.0-rc5-mac-00001-gfaf1ca712d79 (fthain@nippy) (m68k-unknown-linux-musl-gcc (Gentoo 12.2.1_p20221224 p7) 12.2.1 20221224, GNU ld (Gentoo 2.39 p5) 2.39.0) #4 Sat Apr 8 14:25:16 AEST 2023
[ 0.000000] Saving 402 bytes of bootinfo
[ 0.000000] printk: debug: ignoring loglevel setting.
[ 0.000000] printk: bootconsole [debug0] enabled
[ 0.000000] Detected Macintosh model: 27
[ 0.000000] Penguin bootinfo data:
[ 0.000000] Video: addr 0x60b00000 row 0x50 depth 1 dimensions 640 x 480
[ 0.000000] Videological 0xf0300000 phys. 0x60b00000, SCC at 0x50f04000
[ 0.000000] Boottime: 0x73e GMTBias: 0x0
[ 0.000000] Machine ID: 27 CPUid: 0x1 memory size: 0x24
[ 0.000000] Apple Macintosh LC III
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000000000000-0x00000023ffffffff]
[ 0.000000] Normal empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000000000-0x00000000023fffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x00000000023fffff]
[ 0.000000] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[ 0.000000] pcpu-alloc: [0] 0
[ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 9135
[ 0.000000] Kernel command line: console=ttyS0 ignore_loglevel ro earlyprintk fbcon=font:ProFont6x11 coherent_pool=4k root=/dev/sda6 single
[ 0.000000] Unknown kernel command line parameters "single", will be passed to user space.
[ 0.000000] Dentry cache hash table entries: 8192 (order: 3, 32768 bytes, linear)
[ 0.000000] Inode-cache hash table entries: 4096 (order: 2, 16384 bytes, linear)
[ 0.000000] Sorting __ex_table...
[ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[ 0.000000] Memory: 31008K/36864K available (4193K kernel code, 283K rwdata, 808K rodata, 132K init, 109K bss, 5856K reserved, 0K cma-reserved)
[ 0.000000] **********************************************************
[ 0.000000] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
[ 0.000000] ** **
[ 0.000000] ** This system shows unhashed kernel memory addresses **
[ 0.000000] ** via the console, logs, and other interfaces. This **
[ 0.000000] ** might reduce the security of your system. **
[ 0.000000] ** **
[ 0.000000] ** If you see this message and you are not debugging **
[ 0.000000] ** the kernel, report this immediately to your system **
[ 0.000000] ** administrator! **
[ 0.000000] ** **
[ 0.000000] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
[ 0.000000] **********************************************************
[ 0.000000] SLUB: HWalign=16, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[ 0.000000] NR_IRQS: 72
[ 0.000000] clocksource: via1: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 2439823894983 ns
[ 0.050000] Console: colour dummy device 80x25
[ 0.060000] printk: console [ttyS0] enabled
[ 0.060000] printk: console [ttyS0] enabled
[ 0.080000] printk: bootconsole [debug0] disabled
[ 0.080000] printk: bootconsole [debug0] disabled
[ 0.110000] Calibrating delay loop... 5.83 BogoMIPS (lpj=29184)
[ 0.270000] pid_max: default: 32768 minimum: 301
[ 0.320000] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 0.340000] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 0.730000] devtmpfs: initialized
[ 1.060000] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[ 1.070000] futex hash table entries: 256 (order: -1, 3072 bytes, linear)
[ 1.280000] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[ 1.340000] DMA: preallocated 4 KiB GFP_KERNEL pool for atomic allocations
[ 1.360000] DMA: preallocated 4 KiB GFP_KERNEL|GFP_DMA pool for atomic allocations
[ 1.940000] NuBus: Scanning NuBus slots.
[ 2.180000] SCSI subsystem initialized
[ 2.250000] libata version 3.00 loaded.
[ 2.510000] clocksource: Switched to clocksource via1
[ 5.540000] NET: Registered PF_INET protocol family
[ 5.590000] IP idents hash table entries: 2048 (order: 2, 16384 bytes, linear)
[ 5.780000] tcp_listen_portaddr_hash hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 5.790000] Table-perturb hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 5.810000] TCP established hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 5.820000] TCP bind hash table entries: 1024 (order: 1, 8192 bytes, linear)
[ 5.830000] TCP: Hash tables configured (established 1024 bind 1024)
[ 5.880000] UDP hash table entries: 256 (order: 0, 4096 bytes, linear)
[ 5.890000] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes, linear)
[ 5.930000] NET: Registered PF_UNIX/PF_LOCAL protocol family
[ 6.090000] RPC: Registered named UNIX socket transport module.
[ 6.110000] RPC: Registered udp transport module.
[ 6.120000] RPC: Registered tcp transport module.
[ 6.130000] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 6.430000] workingset: timestamp_bits=30 max_order=13 bucket_order=0
[ 6.910000] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
[ 6.920000] io scheduler mq-deadline registered
[ 6.930000] macfb: framebuffer at 0x60b00000, mapped to 0xd0000000, size 37k
[ 6.940000] macfb: mode is 640x480x1, linelength=80
[ 8.510000] Console: switching to mono frame buffer device 106x43
[ 9.670000] fb0: Sonora frame buffer device
[ 9.750000] scc.0: ttyS0 at MMIO 0x50f04002 (irq = 4, base_baud = 230400) is a Z85c30 ESCC - Serial port
[ 9.820000] scc.1: ttyS1 at MMIO 0x50f04000 (irq = 4, base_baud = 230400) is a Z85c30 ESCC - Serial port
[ 9.950000] Non-volatile memory driver v1.3
[ 9.980000] SWIM floppy driver Version 0.2 (2008-10-30)
[ 10.670000] brd: module loaded
[ 10.690000] cuda_input: 01 00 03 7c 25 b7 d6
[ 10.710000] Macintosh Cuda and Egret driver.
[ 11.270000] scsi host0: Macintosh NCR5380 SCSI, irq 19, io_port 0x0, base 0x50f10000, can_queue 16, cmd_per_lun 2, sg_tablesize 1, this_id 7, flags { }
[ 11.420000] scsi 0:0:0:0: Direct-Access codesrc SCSI2SD 4.2 PQ: 0 ANSI: 2
[ 11.800000] adb device [2]: 2 0x2
[ 11.830000] adb device [3]: 3 0x1
[ 11.880000] ADB keyboard at 2 has handler 0x3
[ 11.900000] Detected ADB keyboard, type ANSI.
[ 11.960000] input: ADB keyboard as /devices/virtual/input/input0
[ 12.180000] ADB mouse (standard) at 3 has handler 0x2
[ 12.240000] input: ADB mouse as /devices/virtual/input/input1
[ 13.720000] sd 0:0:0:0: [sda] 4194303 512-byte logical blocks: (2.15 GB/2.00 GiB)
[ 13.750000] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 13.800000] mac8390 slot.E: Memory length resource not found, probing
[ 13.810000] mac8390 slot.E: Ethernet A-Series (type farallon)
[ 13.820000] mac8390 slot.E: MAC 00:40:10:07:60:c8, IRQ 61, 32 KB shared memory at 0xfeed0000, 32-bit access.
[ 14.150000] sd 0:0:0:0: [sda] Write Protect is off
[ 14.160000] sd 0:0:0:0: [sda] Mode Sense: 8d 00 00 08
[ 14.180000] sd 0:0:0:0: [sda] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[ 14.480000] sda: [mac] sda1 sda2 sda3 sda4 sda5 sda6 sda7
[ 14.980000] sd 0:0:0:0: [sda] Attached SCSI disk
[ 15.050000] aoe: AoE v85 initialised.
[ 15.190000] mousedev: PS/2 mouse device common for all mice
[ 15.320000] rtc-generic rtc-generic: registered as rtc0
[ 15.410000] NET: Registered PF_PACKET protocol family
[ 15.440000] printk: bootconsole [debug0] disabled
[ 21.960000] EXT4-fs (sda6): mounting ext2 file system using the ext4 subsystem
[ 22.620000] EXT4-fs (sda6): mounted filesystem d9738810-a43e-43d5-8eea-fd50d23f69b4 without journal. Quota mode: disabled.
[ 22.650000] VFS: Mounted root (ext2 filesystem) readonly on device 8:6.
[ 22.690000] Freeing unused kernel image (initmem) memory: 132K
[ 22.700000] This architecture does not have kernel memory protection.
[ 22.710000] Run /sbin/init as init process
[ 22.720000] with arguments:
[ 22.730000] /sbin/init
[ 22.740000] single
[ 22.750000] with environment:
[ 22.760000] HOME=/
[ 22.770000] TERM=linux
INIT: version 3.06 booting
INIT: No inittab.d directory found
[ 40.210000] random: crng init done
Setting hostname to 'debian'...done.
findfs: unable to resolve 'UUID=d9738810-a43e-43d5-8eea-fd50d23f69b4'
Files under mount point '/run' will be hidden. ... (warning).
Starting hotplug events dispatcher: systemd-udevd.
Synthesizing the initial hotplug events (subsystems)...done.
Synthesizing the initial hotplug events (devices)...done.
Waiting for /dev to be fully populated...Failed to wait for daemon to reply: Connection timed out
done (timeout).
Press Enter for maintenance
(or press Control-D to continue):
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~#
exit
Activating swap:swapon: /dev/sda5: found signature [pagesize=4096, signature=swap]
swapon: /dev/sda5: pagesize=4096, swapsize=268435456, devsize=268435456
swapon /dev/sda5
[ 1768.650000] Adding 262140k swap on /dev/sda5. Priority:-2 extents:1 across:262140k
.
[ 1778.640000] EXT4-fs (sda6): re-mounted d9738810-a43e-43d5-8eea-fd50d23f69b4. Quota mode: disabled.
Will now check root file system:[/sbin/fsck.ext2 (1) -- /] fsck.ext2 -a -C0 /dev/sda6
/dev/sda6: clean, 24004/425152 files, 234859/424872 blocks
.
[ 1797.940000] EXT4-fs (sda6): re-mounted d9738810-a43e-43d5-8eea-fd50d23f69b4. Quota mode: disabled.
*** stack smashing detected ***: terminated
Aborted (core dumped)
*** stack smashing detected ***: terminated
Aborted (core dumped)
Will now check all file systems.
Checking all file systems.
UUID=d9738810-a43e-43d5-8eea-fd50d23f69b4 is mounted
Done checking file systems.
Log is being saved in /var/log/fsck/checkfs if that location is writable.
Cleaning up temporary files...Cleaning /tmp...done.
/tmp.
Will now mount local filesystems:.
Will now activate swapfile swap, if any:swapon: /dev/sda5: already active -- ignored
done.
Checking minimum space in /tmp...done.
Cleaning up temporary files....
Starting Setting kernel variables: sysctl.
Initializing random number generator...done.
Configuring network interfaces...done.
Starting nftables: nonemnl.c:60: Unable to initialize Netlink socket: Protocol not supported
failed!
Cleaning up temporary files....
Press Enter for maintenance
(or press Control-D to continue):
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
*** stack smashing detected ***: terminated
Aborted (core dumped)
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
*** stack smashing detected ***: terminated
Aborted (core dumped)
*** stack smashing detected ***: terminated
Aborted (core dumped)
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# reboot
INIT: Switching to runlevel: 6
INIT: No inittab.d directory found
INIT: Sending processes configured via /etc/inittab the TERM signal
INIT: Sending processes configured via /etc/inittab the KILL signal
INIT: Pid 1006 [id ~~] seems to hang
Stopping Flexible I/O Tester as service: fio is not running.
Stopping nftables: nonemnl.c:60: Unable to initialize Netlink socket: Protocol not supported
failed!
Asking all remaining processes to terminate...done.
All processes ended within 1 seconds...done.
Stopping hotplug events dispatcher: systemd-udevd.
Saving random seed...done.
Stopping enhanced syslogd: rsyslogd already stopped.
Saving the system clock to /dev/rtc0.
Hardware Clock updated to Thu Jan 1 11:13:47 AEST 1970.
Deconfiguring network interfaces...done.
Will now deactivate swap:swapoff /dev/sda5
.
Mounting root filesystem read-only...[ 2590.610000] EXT4-fs (sda6): re-mounted d9738810-a43e-43d5-8eea-fd50d23f69b4. Quota mode: disabled.
done.
Will now restart.
[ 2597.720000] reboot: Restarting system

Michael Schmitz

unread,
Apr 8, 2023, 10:00:03 PM4/8/23
to
Hi Geert,

Am 08.04.2023 um 00:06 schrieb Geert Uytterhoeven:
> Hi Michael,
>
> On Fri, Apr 7, 2023 at 3:58 AM Michael Schmitz <schmi...@gmail.com> wrote:
>> The easiest way to do that is to log all wait and signal syscalls, as
>> well as process exit. That might alter timing if these log messages go
>> to the serial console though. Is that what you have in mind?
>
> Store to RAM, retrieve through a new /proc file?

Yes, that could be done, though I'd rather avoid duplicating a lot of
the generic message formatting code (printk and friends).

I'll have a look around ...

Cheers,

Michael

>
> Gr{oetje,eeting}s,
>
> Geert
>

Michael Schmitz

unread,
Apr 8, 2023, 11:50:03 PM4/8/23
to
Hi Finn,

Am 07.04.2023 um 16:03 schrieb Finn Thain:
>>> So, again, the best avenue I can think of for such experiments to
>>> modify the kernel to either keep track of the times of the wait4
>>> syscalls and
>>
>> The easiest way to do that is to log all wait and signal syscalls, as
>> well as process exit. That might alter timing if these log messages go
>> to the serial console though. Is that what you have in mind?
>>
>
> What I had in mind was collecting measurements in such way that would not
> impact timing, perhaps by storing them somewhere they could be retrieved
> from the process core dump.

Storing such information to the process core dump would need
modifications to dash (allocate log buffer and pass the address to the
kernel) in addition to the kernel. Taking Geert's suggestions to have
the kernel allocate a log buffer (similar to the kernel message log) and
read that out through procfs after the tests would be a lot easier.
I wonder what could change from one boot to another - can you have dash
(and its subshells) dump /proc/self/maps and see whether there's any
variation in that? But what we really need is the physical mappings. How
can we find those?

With the kernel RNG disabled, I would expect neither of these mappings
to change between boots?

Cheers,

Michael

Finn Thain

unread,
Apr 9, 2023, 12:20:03 AM4/9/23
to
At the time of the crash, the corrupted %a3 was a pointer to location in
__wait3's stack. That location was a NULL pointer (the *usage parameter)
when __GI___wait4_time64 was called but now points to 0xc00e0172, which is
just after the __wait3 text and just before __GI___wait4_time64 text.

(gdb) disass __wait3
Dump of assembler code for function __wait3:
...
0xc00e015e <+238>: bsrl 0xc00e0174 <__GI___wait4_time64>
0xc00e0164 <+244>: lea %sp@(16),%sp
0xc00e0168 <+248>: braw 0xc00e00b2 <__wait3+66>
0xc00e016c <+252>: bsrl 0xc012a38c <__stack_chk_fail>
End of assembler dump.
(gdb) disass __GI___wait4_time64
Dump of assembler code for function __GI___wait4_time64:
0xc00e0174 <+0>: lea %sp@(-80),%sp
0xc00e0178 <+4>: moveml %d2-%d5/%a2-%a3/%a5,%sp@-
0xc00e017c <+8>: lea %pc@(0xc0198000),%a5
0xc00e0184 <+16>: movel %sp@(116),%d2
...

But I realize now that this stack location gets overwritten with the
return address for bsrl __stack_chk_fail, so there's nothing wrong there.

Perhaps its just a coincidence that the saved %a3, once corrupted, ended
up pointing to the *usage parameter... I don't know what to make of that.

Finn Thain

unread,
Apr 9, 2023, 1:00:03 AM4/9/23
to
On Sun, 9 Apr 2023, Michael Schmitz wrote:

> >
> > The only way I have found to alter dash's inclination to crash is to
> > reboot. (I said previously I was unable to reproduce this in a single
> > user mode shell but it turned out to be more subtle.)
>
> I wonder what could change from one boot to another - can you have dash
> (and its subshells) dump /proc/self/maps and see whether there's any
> variation in that? But what we really need is the physical mappings. How
> can we find those?
>
> With the kernel RNG disabled, I would expect neither of these mappings
> to change between boots?
>

It looks like the stack area still changes across invocations:

# sh
# cat < /proc/self/maps
c0000000-c0021000 r-xp 00000000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0021000-c0023000 rw-p 00000000 00:00 0
c0023000-c0024000 r--p 00021000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0024000-c0026000 rw-p 00022000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c002a000-c0199000 r-xp 00000000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c0199000-c019a000 ---p 0016f000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019a000-c019c000 r--p 00170000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019c000-c01a0000 rw-p 00172000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c01a0000-c01aa000 rw-p 00000000 00:00 0
d0000000-d0019000 r-xp 00000000 08:06 32713 /usr/bin/dash
d001b000-d001c000 r--p 00019000 08:06 32713 /usr/bin/dash
d001c000-d001d000 rw-p 0001a000 08:06 32713 /usr/bin/dash
d001d000-d001f000 rwxp 00000000 00:00 0 [heap]
d001f000-d0040000 rwxp 00000000 00:00 0 [heap]
eff9f000-effc0000 rw-p 00000000 00:00 0 [stack]
# sh
# cat < /proc/self/maps
c0000000-c0021000 r-xp 00000000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0021000-c0023000 rw-p 00000000 00:00 0
c0023000-c0024000 r--p 00021000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0024000-c0026000 rw-p 00022000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c002a000-c0199000 r-xp 00000000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c0199000-c019a000 ---p 0016f000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019a000-c019c000 r--p 00170000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019c000-c01a0000 rw-p 00172000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c01a0000-c01aa000 rw-p 00000000 00:00 0
d0000000-d0019000 r-xp 00000000 08:06 32713 /usr/bin/dash
d001b000-d001c000 r--p 00019000 08:06 32713 /usr/bin/dash
d001c000-d001d000 rw-p 0001a000 08:06 32713 /usr/bin/dash
d001d000-d001f000 rwxp 00000000 00:00 0 [heap]
d001f000-d0040000 rwxp 00000000 00:00 0 [heap]
effd8000-efff9000 rw-p 00000000 00:00 0 [stack]
# sh
# cat < /proc/self/maps
c0000000-c0021000 r-xp 00000000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0021000-c0023000 rw-p 00000000 00:00 0
c0023000-c0024000 r--p 00021000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0024000-c0026000 rw-p 00022000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c002a000-c0199000 r-xp 00000000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c0199000-c019a000 ---p 0016f000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019a000-c019c000 r--p 00170000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019c000-c01a0000 rw-p 00172000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c01a0000-c01aa000 rw-p 00000000 00:00 0
d0000000-d0019000 r-xp 00000000 08:06 32713 /usr/bin/dash
d001b000-d001c000 r--p 00019000 08:06 32713 /usr/bin/dash
d001c000-d001d000 rw-p 0001a000 08:06 32713 /usr/bin/dash
d001d000-d001f000 rwxp 00000000 00:00 0 [heap]
d001f000-d0040000 rwxp 00000000 00:00 0 [heap]
effdf000-f0000000 rw-p 00000000 00:00 0 [stack]
#

That can be disabled easily though (see below). I'll have to modify some
init scripts to find out what effect it has.

# setarch -R sh
# sh
# cat < /proc/self/maps
c0000000-c0021000 r-xp 00000000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0021000-c0023000 rw-p 00000000 00:00 0
c0023000-c0024000 r--p 00021000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0024000-c0026000 rw-p 00022000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c002a000-c0199000 r-xp 00000000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c0199000-c019a000 ---p 0016f000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019a000-c019c000 r--p 00170000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019c000-c01a0000 rw-p 00172000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c01a0000-c01aa000 rw-p 00000000 00:00 0
d0000000-d0019000 r-xp 00000000 08:06 32713 /usr/bin/dash
d001b000-d001c000 r--p 00019000 08:06 32713 /usr/bin/dash
d001c000-d001d000 rw-p 0001a000 08:06 32713 /usr/bin/dash
d001d000-d001f000 rwxp 00000000 00:00 0 [heap]
d001f000-d0040000 rwxp 00000000 00:00 0 [heap]
effdf000-f0000000 rw-p 00000000 00:00 0 [stack]
# md5sum < /proc/self/maps
baacbaf944fb01d3200d924da7f7a815 -
# sh
# md5sum < /proc/self/maps
baacbaf944fb01d3200d924da7f7a815 -
# sh
# md5sum < /proc/self/maps
baacbaf944fb01d3200d924da7f7a815 -

Michael Schmitz

unread,
Apr 9, 2023, 1:00:04 AM4/9/23
to
Hi Finn,

Am 09.04.2023 um 16:42 schrieb Finn Thain:
> On Sun, 9 Apr 2023, Michael Schmitz wrote:
>
>>>
>>> The only way I have found to alter dash's inclination to crash is to
>>> reboot. (I said previously I was unable to reproduce this in a single
>>> user mode shell but it turned out to be more subtle.)
>>
>> I wonder what could change from one boot to another - can you have dash
>> (and its subshells) dump /proc/self/maps and see whether there's any
>> variation in that? But what we really need is the physical mappings. How
>> can we find those?
>>
>> With the kernel RNG disabled, I would expect neither of these mappings
>> to change between boots?
>>
>
> It looks like the stack area still changes across invocations:

Yep, but running the same commands in the same order across different
boots, does it still change?

(I'm making a huge assumption here - that timing of the boot process and
hence evolution of the kernel RNG is sufficiently deterministic. And
this might apply only to the shells run from sysvinit, since that does
require no keyboard input ...)

Looks like cat < /proc/self/maps | grep stack would give us enough
information without overwhelming the serial console?

OTOH - if you can show the error is gone without stack address
randomization, that would be a hint maybe?

Cheers,

Michael

Finn Thain

unread,
Apr 9, 2023, 3:50:03 AM4/9/23
to
On Sun, 9 Apr 2023, Michael Schmitz wrote:

> Am 09.04.2023 um 16:42 schrieb Finn Thain:
> > On Sun, 9 Apr 2023, Michael Schmitz wrote:
> >
> >>>
> >>> The only way I have found to alter dash's inclination to crash is to
> >>> reboot. (I said previously I was unable to reproduce this in a
> >>> single user mode shell but it turned out to be more subtle.)
> >>
> >> I wonder what could change from one boot to another - can you have
> >> dash (and its subshells) dump /proc/self/maps and see whether there's
> >> any variation in that? But what we really need is the physical
> >> mappings. How can we find those?
> >>
> >> With the kernel RNG disabled, I would expect neither of these
> >> mappings to change between boots?
> >>
> >
> > It looks like the stack area still changes across invocations:
>
> Yep, but running the same commands in the same order across different
> boots, does it still change?
>
> (I'm making a huge assumption here - that timing of the boot process and
> hence evolution of the kernel RNG is sufficiently deterministic. And
> this might apply only to the shells run from sysvinit, since that does
> require no keyboard input ...)
>
> Looks like cat < /proc/self/maps | grep stack would give us enough
> information without overwhelming the serial console?
>
> OTOH - if you can show the error is gone without stack address
> randomization, that would be a hint maybe?
>

The results below were produced with 'norandmaps' added to the kernel
parameters to avoid ASLR.

root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# echo 3 > /proc/sys/vm/drop_caches
[ 913.560000] bash (1024): drop_caches: 3
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
*** stack smashing detected ***: terminated
Aborted (core dumped)
root@debian:~# sh -c "md5sum < /proc/self/maps"
baacbaf944fb01d3200d924da7f7a815 -
root@debian:~# sh -c "md5sum < /proc/self/maps"
baacbaf944fb01d3200d924da7f7a815 -
root@debian:~# sh -c "md5sum < /proc/self/maps"
baacbaf944fb01d3200d924da7f7a815 -
root@debian:~# sh -c "md5sum < /proc/self/maps"
baacbaf944fb01d3200d924da7f7a815 -
root@debian:~# sh -c "cat < /proc/self/maps"
c0000000-c0021000 r-xp 00000000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0021000-c0023000 rw-p 00000000 00:00 0
c0023000-c0024000 r--p 00021000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0024000-c0026000 rw-p 00022000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c002a000-c0199000 r-xp 00000000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c0199000-c019a000 ---p 0016f000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019a000-c019c000 r--p 00170000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019c000-c01a0000 rw-p 00172000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c01a0000-c01aa000 rw-p 00000000 00:00 0
d0000000-d0019000 r-xp 00000000 08:06 32713 /usr/bin/dash
d001b000-d001c000 r--p 00019000 08:06 32713 /usr/bin/dash
d001c000-d001d000 rw-p 0001a000 08:06 32713 /usr/bin/dash
d001d000-d001f000 rwxp 00000000 00:00 0 [heap]
d001f000-d0040000 rwxp 00000000 00:00 0 [heap]
effdf000-f0000000 rw-p 00000000 00:00 0 [stack]


So I guess this bug has more to do with timing and little to do with
state, contrary to my guesswork above. And no doubt I will have to
contradict myself again if/when it turns out that uninitialized memory is
a factor :-/

Geert Uytterhoeven

unread,
Apr 9, 2023, 4:40:04 AM4/9/23
to
Hi Finn,
OK.

> For that build I enabled SLUB_DEBUG but forgot to enable SLUB_DEBUG_ON --

FWIW, unfortunately not all SL*B_DEBUG are equal. I've seen
memory corruptions that were identified by SLAB_DEBUG, but not by
the other allocators.

Michael Schmitz

unread,
Apr 10, 2023, 4:20:04 AM4/10/23
to
Hi Finn,
What may still vary is physical mapping - I remember you had used some
tool before to parse proc/<pid>/pagemap to determine the physical
addresses for task stack areas? Or am I misremembering that from some
other bug?

> contradict myself again if/when it turns out that uninitialized memory is
> a factor :-/

I haven't found a config option to initialize memory returned by the
kernel page allocators, so not sure how to test that ...

Cheers,

Michael

Finn Thain

unread,
Apr 10, 2023, 6:00:04 AM4/10/23
to
On Mon, 10 Apr 2023, Michael Schmitz wrote:

> >
> > So I guess this bug has more to do with timing and little to do with
> > state, contrary to my guesswork above. And no doubt I will have to
>
> What may still vary is physical mapping - I remember you had used some
> tool before to parse proc/<pid>/pagemap to determine the physical
> addresses for task stack areas? Or am I misremembering that from some
> other bug?
>

You're right, back in September 2021 when I was chasing a different bug we
did discuss tools to look at physical mappings. I don't think that would
help here though. We know the failure is not bad RAM because multiple Macs
fail in the same way. Also, there's no DMA taking place on these
particular machines.

> > contradict myself again if/when it turns out that uninitialized memory
> > is a factor :-/
>
> I haven't found a config option to initialize memory returned by the
> kernel page allocators, so not sure how to test that ...
>

I was able to find some command line options (init_on_alloc, init_on_free)
and the related Kconfig symbols (CONFIG_INIT_ON_ALLOC_DEFAULT_ON,
CONFIG_INIT_ON_FREE_DEFAULT_ON).

Given the compiler supports -fzero-call-used-regs=used-gpr there's also
CONFIG_ZERO_CALL_USED_REGS. Also CONFIG_INIT_STACK_ALL_ZERO
(-ftrivial-auto-var-init=zero).

The problem with these options is that they may produce a large effect on
the timing of events but they should still have no effect on the behaviour
of a correct userspace program.

Since we are dealing with a suspect userspace program, what could we learn
from such a test? E.g. if the crashing stopped one could simply attribute
that to the timing change. I suppose, if the crashing became more
frequent, perhaps that would help debug the userspace program. So maybe
it's worth a try...

Finn Thain

unread,
Apr 10, 2023, 8:40:03 PM4/10/23
to
On Sun, 9 Apr 2023, Michael Schmitz wrote:

> Am 08.04.2023 um 00:06 schrieb Geert Uytterhoeven:
> >
> > On Fri, Apr 7, 2023 at 3:58 AM Michael Schmitz <wrote:
> >> The easiest way to do that is to log all wait and signal syscalls, as
> >> well as process exit. That might alter timing if these log messages
> >> go to the serial console though. Is that what you have in mind?
> >
> > Store to RAM, retrieve through a new /proc file?
>
> Yes, that could be done, though I'd rather avoid duplicating a lot of
> the generic message formatting code (printk and friends).
>
> I'll have a look around ...
>

A better solution might be be to port the existing instrumentation like
ftrace, kprobes, uprobes etc. Might be a lot of work though. I wonder
how portable that stuff is.

If you use printk, you could probably avoid most of the delays by enabling
the dummy console. Then the kernel messages would be collected with dmesg,
given a sufficiently large CONFIG_LOG_BUF_SHIFT. But it would be
inconvenient to have no serial console available for the usual purposes.

Michael Schmitz

unread,
Apr 11, 2023, 12:30:04 AM4/11/23
to
Hi Finn,

Am 10.04.2023 um 21:39 schrieb Finn Thain:
> On Mon, 10 Apr 2023, Michael Schmitz wrote:
>
>>>
>>> So I guess this bug has more to do with timing and little to do with
>>> state, contrary to my guesswork above. And no doubt I will have to
>>
>> What may still vary is physical mapping - I remember you had used some
>> tool before to parse proc/<pid>/pagemap to determine the physical
>> addresses for task stack areas? Or am I misremembering that from some
>> other bug?
>>
>
> You're right, back in September 2021 when I was chasing a different bug we
> did discuss tools to look at physical mappings. I don't think that would
> help here though. We know the failure is not bad RAM because multiple Macs
> fail in the same way. Also, there's no DMA taking place on these
> particular machines.
>
>>> contradict myself again if/when it turns out that uninitialized memory
>>> is a factor :-/
>>
>> I haven't found a config option to initialize memory returned by the
>> kernel page allocators, so not sure how to test that ...
>>
>
> I was able to find some command line options (init_on_alloc, init_on_free)
> and the related Kconfig symbols (CONFIG_INIT_ON_ALLOC_DEFAULT_ON,
> CONFIG_INIT_ON_FREE_DEFAULT_ON).

Right - not sure how I managed to miss those.

init_on_free might delay the boot process a while! But I would guesss
init_on_alloc should be OK in the first instance.

>
> Given the compiler supports -fzero-call-used-regs=used-gpr there's also
> CONFIG_ZERO_CALL_USED_REGS. Also CONFIG_INIT_STACK_ALL_ZERO
> (-ftrivial-auto-var-init=zero).
>
> The problem with these options is that they may produce a large effect on
> the timing of events but they should still have no effect on the behaviour
> of a correct userspace program.
>
> Since we are dealing with a suspect userspace program, what could we learn
> from such a test? E.g. if the crashing stopped one could simply attribute

We don't know for definite that we deal with a suspect user space
program - it might just be a change in a previously fine program that
now exposes a subtle kernel bug (undetected for quite a long time, but
we've seen a few of those now...)?

> that to the timing change. I suppose, if the crashing became more
> frequent, perhaps that would help debug the userspace program. So maybe
> it's worth a try...

We'd then have to try and minimize the impact on timing, by instead
initializing a 'shadow' page reserved for that purpose. Though I suspect
the loop over the pages might be optimized away in that case. See
include/linux/highmem.h:clear_highpage_kasan_tagged() and
mm/page_alloc.c:kernel_init_pages() ...

Cheers,

Michael

Michael Schmitz

unread,
Apr 11, 2023, 1:00:03 AM4/11/23
to
Hi Finn,
Can we disable the serial console after boot, by registering the dummy
console? Or will that just log messages to both?

Cheers,

Michael

Finn Thain

unread,
Apr 11, 2023, 1:10:04 AM4/11/23
to
On Tue, 11 Apr 2023, Michael Schmitz wrote:

> >
> > I was able to find some command line options (init_on_alloc,
> > init_on_free) and the related Kconfig symbols
> > (CONFIG_INIT_ON_ALLOC_DEFAULT_ON, CONFIG_INIT_ON_FREE_DEFAULT_ON).
>
> Right - not sure how I managed to miss those.
>
> init_on_free might delay the boot process a while! But I would guesss
> init_on_alloc should be OK in the first instance.
>
> >
> > Given the compiler supports -fzero-call-used-regs=used-gpr there's
> > also CONFIG_ZERO_CALL_USED_REGS. Also CONFIG_INIT_STACK_ALL_ZERO
> > (-ftrivial-auto-var-init=zero).
> >

With all of those options enabled I still see dash crash sometimes. I
don't think I've learned anything new about the bug from that test.

> > The problem with these options is that they may produce a large effect
> > on the timing of events but they should still have no effect on the
> > behaviour of a correct userspace program.
> >
> > Since we are dealing with a suspect userspace program, what could we
> > learn from such a test? E.g. if the crashing stopped one could simply
> > attribute
>
> We don't know for definite that we deal with a suspect user space
> program - it might just be a change in a previously fine program that
> now exposes a subtle kernel bug (undetected for quite a long time, but
> we've seen a few of those now...)?
>

That's right -- the kernel is also suspect. As is glibc. I will keep
looking for some way to narrow down the search.

Finn Thain

unread,
Apr 11, 2023, 2:10:03 AM4/11/23
to
I don't know. I'd have to experiment in QEMU or Aranym.

Geert Uytterhoeven

unread,
Apr 11, 2023, 3:20:04 AM4/11/23
to
Hi Michael,
You can increase loglevel.

Geert Uytterhoeven

unread,
Apr 11, 2023, 3:30:06 AM4/11/23
to

Michael Schmitz

unread,
Apr 11, 2023, 4:30:04 AM4/11/23
to
Hi Geert,

Am 11.04.2023 um 19:19 schrieb Geert Uytterhoeven:
> Hi Michael,
>
> On Tue, Apr 11, 2023 at 6:56 AM Michael Schmitz <schmi...@gmail.com> wrote:
>> Am 11.04.2023 um 12:20 schrieb Finn Thain:
>>> On Sun, 9 Apr 2023, Michael Schmitz wrote:
>>>> Am 08.04.2023 um 00:06 schrieb Geert Uytterhoeven:
>>>>> On Fri, Apr 7, 2023 at 3:58 AM Michael Schmitz <wrote:
>>>>>> The easiest way to do that is to log all wait and signal syscalls, as
>>>>>> well as process exit. That might alter timing if these log messages
>>>>>> go to the serial console though. Is that what you have in mind?
>>>>>
>>>>> Store to RAM, retrieve through a new /proc file?
>>>>
>>>> Yes, that could be done, though I'd rather avoid duplicating a lot of
>>>> the generic message formatting code (printk and friends).
>>>>
>>>> I'll have a look around ...
>>>>
>>>
>>> A better solution might be be to port the existing instrumentation like
>>> ftrace, kprobes, uprobes etc. Might be a lot of work though. I wonder
>>> how portable that stuff is.

Looking at a few arch implementations, I'm utterly confused. Wouldn't
know where to start.

>>>
>>> If you use printk, you could probably avoid most of the delays by enabling
>>> the dummy console. Then the kernel messages would be collected with dmesg,
>>> given a sufficiently large CONFIG_LOG_BUF_SHIFT. But it would be
>>> inconvenient to have no serial console available for the usual purposes.
>>
>> Can we disable the serial console after boot, by registering the dummy
>> console? Or will that just log messages to both?
>
> You can increase loglevel.

Yes, we could do that. I thought /proc has to be mounted for the
loglevel to be changed, but I might be wrong.

OTOH, if we log wait and signal actions at level debug, and set loglevel
to info or notice through the command line, we won't miss much of the
usual boot messages on the serial console ...

Cheers,

Michael

Eero Tamminen

unread,
Apr 12, 2023, 7:40:04 PM4/12/23
to
Hi Michael,

On 11.4.2023 11.24, Michael Schmitz wrote:
> Am 11.04.2023 um 19:19 schrieb Geert Uytterhoeven:
>> On Tue, Apr 11, 2023 at 6:56 AM Michael Schmitz <schmi...@gmail.com>
>>> Am 11.04.2023 um 12:20 schrieb Finn Thain:
>>>> A better solution might be be to port the existing instrumentation like
>>>> ftrace, kprobes, uprobes etc. Might be a lot of work though. I wonder
>>>> how portable that stuff is.

Another possibility is just using manually added kernel tracepoints:
https://www.kernel.org/doc/html/latest/trace/tracepoints.html

"git grep TRACE_EVENT" did not given any hits for arch/m68k/ though.

(I don't think those need any particular architecture support. One just
adds them to locations where it makes sense to trace activity that is
important, but which does not happen too often for the very small
overhead from disabled static trace point to be significant.)


> Looking at a few arch implementations, I'm utterly confused. Wouldn't
> know where to start.

Good article: https://lwn.net/Articles/132196/

(Ignore jprobes stuff as that's deprecated.)


Because probe replaces instruction at the probe point with a breakpoint:
https://www.kernel.org/doc/html/latest/trace/kprobes.html#how-does-a-kprobe-work

It needs architecture specific code to decode and simulate the replaced
instruction (its side-effects) when it is later executed "out-of-line".

This also causes architecture specific limitations on where the probe
can be set. Architecture may not support simulating all possible
instructions, just the most common ones (e.g. ones typically used on
function entry and exit points).

Besides that, AFAIK plain kprobes support should not have much
architecture specific stuff.


(I have no experience with kernel code, I just remember some LWN
articles about kprobes during past 2 decades, and I was involved with a
tool that did something similar in user-space, before Linux got uprobes
support.)



As to what architecture to use as an example, I think it would be best
to ask tracing maintainers for advice, as there seems to be a lot of
code which has been copied from each other at different points in time,
that might be nowadays better as shared rather than under each arch...

These architectures support kprobes:
------------------------------------
$ grep " ok " Documentation/features/debug/kprobes/arch-support.txt
| arc: | ok |
| arm: | ok |
| arm64: | ok |
| csky: | ok |
| ia64: | ok |
| mips: | ok |
| parisc: | ok |
| powerpc: | ok |
| riscv: | ok |
| s390: | ok |
| sh: | ok |
| sparc: | ok |
| x86: | ok |
------------------------------------

But only these seem to have probing functionality isolated nicely to its
own directory:
------------------------------------
$ ls arch/*/kernel/probes/

arch/arm64/kernel/probes/:
decode-insn.c decode-insn.h kprobes.c kprobes_trampoline.S Makefile
simulate-insn.c simulate-insn.h uprobes.c

arch/csky/kernel/probes/:
decode-insn.c decode-insn.h ftrace.c kprobes.c kprobes_trampoline.S
Makefile simulate-insn.c simulate-insn.h uprobes.c

arch/riscv/kernel/probes/:
decode-insn.c decode-insn.h ftrace.c kprobes.c Makefile rethook.c
rethook.h rethook_trampoline.S simulate-insn.c simulate-insn.h uprobes.c
------------------------------------


As to testing whether kprobes work, that could be easier with ftrace
support also present:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/ftrace/README


- Eero

PS. Looking at the kernel features list, m68k arch seems to be missing a
lot of other features too:
https://www.kernel.org/doc/html/latest/m68k/features.html

Finn Thain

unread,
Apr 14, 2023, 5:50:03 AM4/14/23
to
On Wed, 5 Apr 2023, I wrote:

>
> I don't care that much what dash does as long as it isn't corrupting
> it's own stack, which is a real possibility, and one which gdb's data
> watch point would normally resolve. And yet I have no way to tackle
> that.
>
> I've been running gdb under QEMU, where the failure is not reproducible.
> Running dash under gdb on real hardware is doable (RAM permitting). But
> the failure is intermittent even then -- it only happens during
> execution of certain init scripts, and I can't reproduce it by manually
> running those scripts.
>
> (Even if I could reproduce the failure under gdb, instrumenting
> execution in gdb can alter timing in undesirable ways...)
>

Somewhat optimistically, I upgraded the RAM on this system to 36 MB so I
can run dash under gdb (20 MB was not enough). But, as expected, the crash
went away when I did so.

Outside of gdb, I was able to reproduce the same failure with a clean
build from the dash repo (commit b00288f). I can get a crash with
optimization level -O1 and -O though it becomes even more rare. So it's
easier to use Debian's build (-O2).

One of the difficulties with the core dump is that it happens too late.
After the canary check fails, __stack_chk_fail() is called, which then
calls a bunch of other stuff until finally abort() is called. This
obliterates whatever was below the stack pointer at the time of the
failure.

So I modified libc.so.6 and now it just crashes with an illegal
instruction in __wait3 rather than branching to __stack_chk_fail. This let
me see whatever was left behind in stack memory by __wait4_time64() etc.

__wait4_time64() calls __m68k_read_tp(), and the return address from
__m68k_read_tp() can still be seen in stack memory, which suggests that
the stack never grew after that call. (So __m68k_read_tp() is implicated.)

Would signal delivery erase any of the memory immediately below the USP?
If so, it would erase those old stack frames, which would give some
indication of the timing of signal delivery.

If I run dash under gdb under QEMU, I can break on entry to onsig() and
find the signal frame on the stack. But when I examine stack memory from
the core dump, I can't find 0x70774e40 (i.e. moveq __NR_sigreturn,%d0 ;
trap #0) which the kernel puts on the stack in my QEMU experiments.

That suggests that no signal was delivered... and yet gotsigchld == 1 at
the time of the coredump, after having been initialized by waitproc()
prior to calling __wait3(). So the signal handler onsig() must have
executed during __wait3() or __wait4_time64(). I can't explain this.

Eero Tamminen

unread,
Apr 15, 2023, 12:00:05 PM4/15/23
to
Hi,
These are the kernel symbol entry point instructions that get executed
when booting Atari Falcon from IDE (using emulator LILO) to Busybox init:
4524 link.w
4 movea.l
3 move.l
3 clr.l
2 movem.l
2 dc.w
2 btst.b
1 subq.l
1 ori.w
1 move.w
1 lea.l
1 jsr
1 bra.b

For Linus' 6.2 kernel built with this config:
https://git.tuxfamily.org/hatari/hatari.git/tree/tools/linux/kernel.config

Using "gcc-m68k-linux-gnu" (4:10.2.1-1) in Debian Stable.


Count is for how many of the executed kernel symbol addresses had given
instruction.

I'm a bit surprised how many functions (symbol addresses) start with
link.w instruction, instead of e.g. move* instruction.

(dc.w is check for emulator NatFeats API.)


- Eero

PS. Above info is produced by building & booting Linux under Hatari
emulator as documented here:
https://hatari.tuxfamily.org/doc/m68k-linux.txt

And using:
-------------------
hatari ... --parse sym-breakpoint.ini > instr.txt
-------------------

Where "sym-breakpoint.ini" is:
-------------------
b PConSymbol > 0 :trace :quiet :file pc-disasm.ini
-------------------

And "pc-disasm.ini" file is:
-------------------
d pc-"pc+2"
-------------------

Then to get counts from the output:
-------------------
$ cat instr.txt | grep -E -e '^\$' | sort -u |\
cut -b46-52 | sort | uniq -c | sort -nr
-------------------

(Match disassembly lines, drop duplicate function calls, count uniq
entry point instructions.)

Brad Boyer

unread,
Apr 15, 2023, 6:40:03 PM4/15/23
to
On Sat, Apr 15, 2023 at 06:50:08PM +0300, Eero Tamminen wrote:
> I'm a bit surprised how many functions (symbol addresses) start with link.w
> instruction, instead of e.g. move* instruction.

That seems pretty normal to me. The LINK instruction sets up a stack
frame for a function. It takes an address register and size (the .w
suffix indicates a 16-bit size, while .l would indicate a 32-bit value
of size). The old value of the register is pushed to the stack, the
current stack pointer is saved into the given register, then the stack
is increased by the given size. There is a corresponding UNLK
instruction to tear down the stack frame which will frequently be the
last instruction before returning to the caller. The compiler will use
these in the function prologue and epilogue in all but the most
trivial functions.

Brad Boyer
fl...@allandria.com

Michael Schmitz

unread,
Apr 15, 2023, 10:00:04 PM4/15/23
to
Hi Finn,

Am 14.04.2023 um 21:30 schrieb Finn Thain:
> Would signal delivery erase any of the memory immediately below the USP?
> If so, it would erase those old stack frames, which would give some
> indication of the timing of signal delivery.

The signal stack is set up immediately below USP, from my reading of
signal.c:setup_frame(). Old stack frames will be overwritten.

>
> If I run dash under gdb under QEMU, I can break on entry to onsig() and
> find the signal frame on the stack. But when I examine stack memory from
> the core dump, I can't find 0x70774e40 (i.e. moveq __NR_sigreturn,%d0 ;
> trap #0) which the kernel puts on the stack in my QEMU experiments.

As I understand this, the call to sys_sigreturn() removes both this code
(signal trampoline IIRC) and the signal stack. At that point, the kernel
signal handler (which in turn calls onsig()) has already been run.

Again as far as I understand, the core dump happens on process exit.
Stack smashing is detected and process exit is forced only at exit from
__wait3() or __wait4_time64(), and those would have only been woken by
receiving a signal. Both signal delivery and removal of signal stack by
sys_sigreturn() must have happened before return from the wait syscalls.

Cheers,

Michael

Finn Thain

unread,
Apr 16, 2023, 3:00:03 AM4/16/23
to
On Sun, 16 Apr 2023, Michael Schmitz wrote:

> Am 14.04.2023 um 21:30 schrieb Finn Thain:
> > Would signal delivery erase any of the memory immediately below the
> > USP? If so, it would erase those old stack frames, which would give
> > some indication of the timing of signal delivery.
>
> The signal stack is set up immediately below USP, from my reading of
> signal.c:setup_frame(). Old stack frames will be overwritten.
>

OK.

> >
> > If I run dash under gdb under QEMU, I can break on entry to onsig()
> > and find the signal frame on the stack. But when I examine stack
> > memory from the core dump, I can't find 0x70774e40 (i.e. moveq
> > __NR_sigreturn,%d0 ; trap #0) which the kernel puts on the stack in my
> > QEMU experiments.
>
> As I understand this, the call to sys_sigreturn() removes both this code
> (signal trampoline IIRC) and the signal stack...

I don't see that stuff getting removed when I run dash under gdb under
QEMU. With breakpoints at the head of onsig() and the tail of __wait3(),
the memory under USP is the same when examined at either juncture.

The backtrace confirms that this signal was delivered during execution of
__wait3(). (Delivery can happen during execution of __libc_fork() but I
just repeat the test until I get these ducks in a row.)

(gdb) c
Continuing.
# x=$(:)
[Detaching after fork from child process 1055]

Breakpoint 6.1, onsig (signo=17) at trap.c:286
286 trap.c: No such file or directory.
(gdb) bt
#0 onsig (signo=17) at trap.c:286
#1 <signal handler called>
#2 0xc00e81b6 in __GI___wait4_time64 (pid=-1, stat_loc=0xeffff86a, options=2,
usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:35
#3 0xc00e8164 in __GI___wait3_time64 (usage=0x0, options=<optimized out>,
stat_loc=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:26
#4 __wait3 (stat_loc=<optimized out>, options=<optimized out>,
usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:35
#5 0xd000c38e in waitproc (status=0xeffff85a, block=1) at jobs.c:1179
#6 waitone (block=1, job=0xd001f618) at jobs.c:1055
#7 0xd000c5b8 in dowait (block=1, jp=0xd001f618) at jobs.c:1137
#8 0xd000ddb0 in waitforjob (jp=0xd001f618) at jobs.c:1014
#9 0xd000aade in expbackq (flag=68, cmd=0xd001e4c8 <stackbase+36>)
at expand.c:520
#10 argstr (p=<optimized out>, flag=68) at expand.c:335
#11 0xd000b5ce in expandarg (arg=0xd001e4e8 <stackbase+68>,
arglist=0xeffffb08, flag=4) at expand.c:192
#12 0xd0007e2a in evalcommand (cmd=<optimized out>, flags=<optimized out>)
at eval.c:855
#13 0xd0006ffc in evaltree (n=0xd001e4f8 <stackbase+84>, flags=0) at eval.c:300
#14 0xd000e3c0 in cmdloop (top=1) at main.c:246
#15 0xd0005018 in main (argc=<optimized out>, argv=<optimized out>)
at main.c:181


0xeffff750: 0xc01a0000 saved $a5 == libc .got
0xeffff74c: 0xc0023e8c saved $a3 == &__stack_chk_guard
0xeffff748: 0x00000000 saved $a2
0xeffff744: 0x00000001 saved $d5
0xeffff740: 0xeffff86e saved $d4
0xeffff73c: 0xeffff86a saved $d3
0xeffff738: 0x00000002 saved $d2
0xeffff734: 0x00000000
0xeffff730: 0x00000000
0xeffff72c: 0x00000000
0xeffff728: 0x00000000
0xeffff724: 0x00000000
0xeffff720: 0x00000000
0xeffff71c: 0x00000000
0xeffff718: 0x00000000
0xeffff714: 0x00000000
0xeffff710: 0x00000000
0xeffff70c: 0x00000000
0xeffff708: 0x00000000
0xeffff704: 0x00000000
0xeffff700: 0x00000000
0xeffff6fc: 0x00000000
0xeffff6f8: 0x00000000
0xeffff6f4: 0x00000000
0xeffff6f0: 0x00000000
0xeffff6ec: 0x00000000
0xeffff6e8: 0x00000000
0xeffff6e4: 0x00000000
0xeffff6e0: 0x00000000
0xeffff6dc: 0x00000000
0xeffff6d8: 0x00000000
0xeffff6d4: 0x00000000
0xeffff6d0: 0x00000000
0xeffff6cc: 0x00000000
0xeffff6c8: 0x00000000
0xeffff6c4: 0x00000000
0xeffff6c0: 0x00000000
0xeffff6bc: 0x00000000
0xeffff6b8: 0x00000000
0xeffff6b4: 0x00000000
0xeffff6b0: 0x00000000
0xeffff6ac: 0x00000000
0xeffff6a8: 0x00000000
0xeffff6a4: 0x00000000
0xeffff6a0: 0x00000000
0xeffff69c: 0x00000000
0xeffff698: 0x00000000
0xeffff694: 0x00000000
0xeffff690: 0x00000000
0xeffff68c: 0x00000000
0xeffff688: 0x00000000
0xeffff684: 0x00000000
0xeffff680: 0x00000000
0xeffff67c: 0x00000000
0xeffff678: 0x00000000
0xeffff674: 0x00000000
0xeffff670: 0x00000000
0xeffff66c: 0x00000000
0xeffff668: 0x00000000
0xeffff664: 0x00000000
0xeffff660: 0x41000000
0xeffff65c: 0x00000000
0xeffff658: 0x00000000
0xeffff654: 0x00000000
0xeffff650: 0x00000000
0xeffff64c: 0x80000000
0xeffff648: 0x3fff0000
0xeffff644: 0x00000000
0xeffff640: 0xd0000000
0xeffff63c: 0x40020000
0xeffff638: 0x81b60080
0xeffff634: 0x0000c00e
0xeffff630: 0xd001e4e3 saved a1?
0xeffff62c: 0xc0028780 saved a0?
0xeffff628: 0xffffffff saved d1?
0xeffff624: 0x0000041f saved d0?
0xeffff620: 0xeffff738 saved sp?
0xeffff61c: 0x00000000
0xeffff618: 0x00000000
0xeffff614: 0x00000000
0xeffff610: 0x70774e40 moveq #119,%d0 ; trap #0
0xeffff60c: 0xeffff61c
0xeffff608: 0x00000080
0xeffff604: 0x00000011
0xeffff600: 0xeffff610 return address

The above comes from dash running under gdb under qemu, which does not
exhibit the failure but is convenient for that kind of experiment.

> Again as far as I understand, the core dump happens on process exit.
> Stack smashing is detected and process exit is forced only at exit from
> __wait3() or __wait4_time64(),

I placed an illegal instruction in __wait3. This executes instead of the
call to __stack_chk_fail because that obliterates stack memory of
interest.

Consequently the latest core dump still contains dead stack frames (see
below) of subroutines that returned before __wait3() dumped core. You can
see the return address for the branch to __wait4_time64() and below that
you can see the return address for the branch to __m68k_read_tp().

(gdb) disas __wait4_time64
Dump of assembler code for function __GI___wait4_time64:
0xc00e4174 <+0>: lea %sp@(-80),%sp
0xc00e4178 <+4>: moveml %d2-%d5/%a2-%a3/%a5,%sp@-
0xc00e417c <+8>: lea %pc@(0xc019c000),%a5
0xc00e4184 <+16>: movel %sp@(116),%d2
0xc00e4188 <+20>: moveal %sp@(124),%a2
0xc00e418c <+24>: moveal %a5@(108),%a3
0xc00e4190 <+28>: movel %a3@,%sp@(104)
0xc00e4194 <+32>: bsrl 0xc0056e2c <__m68k_read_tp@plt>

I gather the signal was delivered before __wait4_time64+38, otherwise the
return address 0xc00e419a (which appears below) would have been
overwritten by the signal frame. The signal must have been delivered after
waitproc() initialized gotsigchld = 0 since gotsigchld is 1 at the time of
the coredump.

I assume the %a3 corruption happened after __wait4_time64+8 because that's
when %a3 first appears on the stack. And the corruption must have happened
before __wait4_time64+238, which is when %a3 was restored.

If it was the signal which somehow corrupted the saved %a3, there's only a
small window for that. The only syscall in that window is get_thread_area.

Here's some stack memory from the core dump.

0xeffff0dc: 0xd000c38e return address waitproc+124
0xeffff0d8: 0xd001c1ec frame 0 $fp == &suppressint
0xeffff0d4: 0x00add14b canary
0xeffff0d0: 0x00000000
0xeffff0cc: 0x0000000a
0xeffff0c8: 0x00000202
0xeffff0c4: 0x00000008
0xeffff0c0: 0x00000000
0xeffff0bc: 0x00000000
0xeffff0b8: 0x00000174
0xeffff0b4: 0x00000004
0xeffff0b0: 0x00000004
0xeffff0ac: 0x00000006
0xeffff0a8: 0x000000e0
0xeffff0a4: 0x000000e0
0xeffff0a0: 0x00171f20
0xeffff09c: 0x00171f20
0xeffff098: 0x00171f20
0xeffff094: 0x00000002
0xeffff090: 0x00002000
0xeffff08c: 0x00000006
0xeffff088: 0x0000e920
0xeffff084: 0x00005360
0xeffff080: 0x00170700
0xeffff07c: 0x00170700
0xeffff078: 0x00170700 frame 0 $fp - 96
0xeffff074: 0xd001b874 saved $a5 == dash .got
0xeffff070: 0xd001e498 saved $a3 == &dash_errno
0xeffff06c: 0xd001e718 frame 0 $sp saved $a2 == &gotsigchld
0xeffff068: 0x00000000
0xeffff064: 0x00000000
0xeffff060: 0xeffff11e
0xeffff05c: 0xffffffff
0xeffff058: 0xc00e4164 return address __wait3+244
0xeffff054: 0x00add14b canary
0xeffff050: 0x00000001
0xeffff04c: 0x00000004
0xeffff048: 0x0000000d
0xeffff044: 0x0000000d
0xeffff040: 0x0015ef82
0xeffff03c: 0x0015ef82
0xeffff038: 0x0015ef82
0xeffff034: 0x00000003
0xeffff030: 0x00000004
0xeffff02c: 0x00000004
0xeffff028: 0x00000140
0xeffff024: 0x00000140
0xeffff020: 0x00000034
0xeffff01c: 0x00000034
0xeffff018: 0x00000034
0xeffff014: 0x00000006
0xeffff010: 0x003b003a
0xeffff00c: 0x000a0028
0xeffff008: 0x00340020
0xeffff004: 0xc019c000 saved $a5 == libc .got
0xeffff000: 0xeffff068 saved $a3 (corrupted)
0xefffeffc: 0x00000000 saved $a2
0xefffeff8: 0x00000001 saved $d5
0xefffeff4: 0xeffff122 saved $d4
0xefffeff0: 0xeffff11e saved $d3
0xefffefec: 0x00000000 saved $d2
0xefffefe8: 0xc00e419a return address __GI___wait4_time64+38
0xefffefe4: 0xc0028780
0xefffefe0: 0x3c344bfb
0xefffefdc: 0x000af353
0xefffefd8: 0x3c340170
0xefffefd4: 0x00000000
0xefffefd0: 0xc00e417c
0xefffefcc: 0xc00e417e
0xefffefc8: 0xc00e4180
0xefffefc4: 0x48e73c34
0xefffefc0: 0x00000000
0xefffefbc: 0xefffeff8
0xefffefb8: 0xefffeffc
0xefffefb4: 0x4bfb0170
0xefffefb0: 0x0eee0709
0xefffefac: 0x00000000
0xefffefa8: 0x00000000
0xefffefa4: 0x00000000
0xefffefa0: 0x00000000
0xefffef9c: 0x00000000
0xefffef98: 0x00000000
0xefffef94: 0x00000000
0xefffef90: 0x00000000
0xefffef8c: 0x00000000
0xefffef88: 0x00000000
0xefffef84: 0x00000000
0xefffef80: 0x00000000
0xefffef7c: 0x00000000
0xefffef78: 0x00000000
0xefffef74: 0x00000000
0xefffef70: 0x00000000
0xefffef6c: 0x00000000
0xefffef68: 0x00000000
0xefffef64: 0x00000000
0xefffef60: 0x00000000
0xefffef5c: 0x00000000
0xefffef58: 0x00000000
0xefffef54: 0x00000000
0xefffef50: 0x00000000
0xefffef4c: 0x00000000
0xefffef48: 0x00000000
0xefffef44: 0x00000000
0xefffef40: 0x00000000
0xefffef3c: 0x00000000
0xefffef38: 0x00000000
0xefffef34: 0x00000000
0xefffef30: 0x00000000
0xefffef2c: 0x00000000
0xefffef28: 0x00000000
0xefffef24: 0x00000000
0xefffef20: 0x00000000
0xefffef1c: 0x00000000
0xefffef18: 0x00000000
0xefffef14: 0x00000000
0xefffef10: 0x7c0effff
0xefffef0c: 0xffffffff
0xefffef08: 0xaaaaaaaa
0xefffef04: 0xaf54eaaa
0xefffef00: 0x40040000
0xefffeefc: 0x40040000
0xefffeef8: 0x2b000000
0xefffeef4: 0x00000000
0xefffeef0: 0x00000000
0xefffeeec: 0x408ece9a
0xefffeee8: 0x00000000
0xefffeee4: 0xf0ff0000
0xefffeee0: 0x0f800000
0xefffeedc: 0xf0fff0ff
0xefffeed8: 0x1f380000
0xefffeed4: 0x00000000
0xefffeed0: 0x00000000
0xefffeecc: 0x00000000
0xefffeec8: 0xffffffff
0xefffeec4: 0xffffffff
0xefffeec0: 0x7fff0000
0xefffeebc: 0xffffffff
0xefffeeb8: 0xffffffff
0xefffeeb4: 0x7fff0000

The signal frame is not readily apparent (to me).

Also, stack memory in the core dump and stack memory as observed upon
signal handler breakpoint do not agree very well... I'd expect the values
immediately below the stack pointer to be zeros once the signal frame was
put there.

I can't explain all of that unless it's discontiguous stack corruption.

Michael Schmitz

unread,
Apr 17, 2023, 7:40:03 PM4/17/23
to
Hi Finn,

Am 16.04.2023 um 18:44 schrieb Finn Thain:
>> As I understand this, the call to sys_sigreturn() removes both this code
>> (signal trampoline IIRC) and the signal stack...
>
> I don't see that stuff getting removed when I run dash under gdb under
> QEMU. With breakpoints at the head of onsig() and the tail of __wait3(),
> the memory under USP is the same when examined at either juncture.

OK - with usp restored to what it was before the signal stack was added,
the stack below original usp can be left as is (will be overwritten
eventually).

>
> The backtrace confirms that this signal was delivered during execution of
> __wait3(). (Delivery can happen during execution of __libc_fork() but I
> just repeat the test until I get these ducks in a row.)
>
> (gdb) c
> Continuing.
> # x=$(:)
> [Detaching after fork from child process 1055]
>
> Breakpoint 6.1, onsig (signo=17) at trap.c:286
> 286 trap.c: No such file or directory.
> (gdb) bt
> #0 onsig (signo=17) at trap.c:286
> #1 <signal handler called>
> #2 0xc00e81b6 in __GI___wait4_time64 (pid=-1, stat_loc=0xeffff86a, options=2,
> usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:35
> #3 0xc00e8164 in __GI___wait3_time64 (usage=0x0, options=<optimized out>,
> stat_loc=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:26

Where did that one come from? I don't think we saw __GI___wait3_time64
called in your disassembly of __wait3 ...
> 0xeffff63c: 0x40020000 <= (sc.formatvec& 0xffff) << 16; fpregs from here on
> 0xeffff638: 0x81b60080 <= (sc.pc & 0xffff) << 16 | sc.formatvec >> 16
> 0xeffff634: 0x0000c00e <= sc.sr << 16 sc.pc >> 16
> 0xeffff630: 0xd001e4e3 saved a1? <= sc.a1
> 0xeffff62c: 0xc0028780 saved a0? <= sc.a0
> 0xeffff628: 0xffffffff saved d1? <= sc.d1
> 0xeffff624: 0x0000041f saved d0? <= sc.d0
> 0xeffff620: 0xeffff738 saved sp? <= sc.usp
> 0xeffff61c: 0x00000000 <= sc.mask
> 0xeffff618: 0x00000000 <= extramask
> 0xeffff614: 0x00000000 <= frame.retcode[1]
> 0xeffff610: 0x70774e40 moveq #119,%d0 ; trap #0
> 0xeffff60c: 0xeffff61c <= frame->sc
> 0xeffff608: 0x00000080 <= tregs->vector
> 0xeffff604: 0x00000011 <= signal no.
> 0xeffff600: 0xeffff610 return address
>
> The above comes from dash running under gdb under qemu, which does not
> exhibit the failure but is convenient for that kind of experiment.

I would have expected to see a different signal trampoline (for
sys_rt_sigreturn) ... But anyway:

The saved pc is 0xc00e81b6 which does match the backtrace above. Vector
offset 80 matches trap 0 which suggests 0xc00e81b6 should be the
instruction after a trap 0 instruction. d0 is 1055 which is not a signal
number I recognize.

>
>> Again as far as I understand, the core dump happens on process exit.
>> Stack smashing is detected and process exit is forced only at exit from
>> __wait3() or __wait4_time64(),
>
> I placed an illegal instruction in __wait3. This executes instead of the
> call to __stack_chk_fail because that obliterates stack memory of
> interest.

OK.

>
> Consequently the latest core dump still contains dead stack frames (see
> below) of subroutines that returned before __wait3() dumped core. You can
> see the return address for the branch to __wait4_time64() and below that
> you can see the return address for the branch to __m68k_read_tp().
>
> (gdb) disas __wait4_time64
> Dump of assembler code for function __GI___wait4_time64:
> 0xc00e4174 <+0>: lea %sp@(-80),%sp
> 0xc00e4178 <+4>: moveml %d2-%d5/%a2-%a3/%a5,%sp@-
> 0xc00e417c <+8>: lea %pc@(0xc019c000),%a5
> 0xc00e4184 <+16>: movel %sp@(116),%d2
> 0xc00e4188 <+20>: moveal %sp@(124),%a2
> 0xc00e418c <+24>: moveal %a5@(108),%a3
> 0xc00e4190 <+28>: movel %a3@,%sp@(104)
> 0xc00e4194 <+32>: bsrl 0xc0056e2c <__m68k_read_tp@plt>
>
> I gather the signal was delivered before __wait4_time64+38, otherwise the
> return address 0xc00e419a (which appears below) would have been
> overwritten by the signal frame. The signal must have been delivered after
> waitproc() initialized gotsigchld = 0 since gotsigchld is 1 at the time of
> the coredump.
>
> I assume the %a3 corruption happened after __wait4_time64+8 because that's
> when %a3 first appears on the stack. And the corruption must have happened
> before __wait4_time64+238, which is when %a3 was restored.
>
> If it was the signal which somehow corrupted the saved %a3, there's only a
> small window for that. The only syscall in that window is get_thread_area.

I see sys_wait4 called in two places (0xc00e01b4, and then 0xc00e0286
depending on the return code of the first). The second one again would
have called __m68k_read_tp so would have left a return address on the
stack (0xc00e02d2). Leaves the first.
From looking at the above stack dump, sc ought to start at 0xefffee90,
and the trampoline would be three words below that. The last address you
show corresponds to 0xeffff640 in first dump above, which is at the
start of the saved fpregs. I'd say we just miss the beginning of the
signal frame?

(My reasoning is that copy_siginfo_to_user clears the end of the signal
stack, which is what we can see in both cases.)

Can't explain the 14 words below the saved return address though.

>
> Also, stack memory in the core dump and stack memory as observed upon
> signal handler breakpoint do not agree very well... I'd expect the values
> immediately below the stack pointer to be zeros once the signal frame was
> put there.

Yes, but what you get to see in the core dump is the stack after return
from __GI___wait4_time64 ... and differences in timing may have resulted
in different code executed (assuming the gdb trace is from the first
subshell signaling, and the dump from one of the later ones ...).

>
> I can't explain all of that unless it's discontiguous stack corruption.
>

Me neither, sorry.

Cheers,

Michael

Finn Thain

unread,
Apr 17, 2023, 10:20:03 PM4/17/23
to
On Tue, 18 Apr 2023, Michael Schmitz wrote:

> Am 16.04.2023 um 18:44 schrieb Finn Thain:
>
> >
> > The backtrace confirms that this signal was delivered during execution
> > of __wait3(). (Delivery can happen during execution of __libc_fork()
> > but I just repeat the test until I get these ducks in a row.)
> >
> > (gdb) c
> > Continuing.
> > # x=$(:)
> > [Detaching after fork from child process 1055]
> >
> > Breakpoint 6.1, onsig (signo=17) at trap.c:286
> > 286 trap.c: No such file or directory.
> > (gdb) bt
> > #0 onsig (signo=17) at trap.c:286
> > #1 <signal handler called>
> > #2 0xc00e81b6 in __GI___wait4_time64 (pid=-1, stat_loc=0xeffff86a,
> > options=2,
> > usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:35
> > #3 0xc00e8164 in __GI___wait3_time64 (usage=0x0, options=<optimized out>,
> > stat_loc=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:26
>
> Where did that one come from? I don't think we saw __GI___wait3_time64
> called in your disassembly of __wait3 ...
>
> > #4 __wait3 (stat_loc=<optimized out>, options=<optimized out>,
> > usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:35


Well spotted. However, it turns out there is a good explanation for that:

(gdb) print __GI___wait3_time64
$3 = {pid_t (int *, int, struct __rusage64 *)} 0xc00e4054 <__GI___wait3_time64>
(gdb) disass __GI___wait3_time64
Dump of assembler code for function __GI___wait3_time64:
0xc00e4054 <+0>: movel %sp@(12),%sp@-
0xc00e4058 <+4>: movel %sp@(12),%sp@-
0xc00e405c <+8>: movel %sp@(12),%sp@-
0xc00e4060 <+12>: pea 0xffffffff
0xc00e4064 <+16>: bsrl 0xc00e4174 <__GI___wait4_time64>
0xc00e406a <+22>: lea %sp@(16),%sp
0xc00e406e <+26>: rts
End of assembler dump.
(gdb) print __wait3
$2 = {pid_t (int *, int, struct rusage *)} 0xc00e4070 <__wait3>
(gdb) disass __wait3
Dump of assembler code for function __wait3:
0xc00e4070 <+0>: linkw %fp,#-96
0xc00e4074 <+4>: moveml %a2-%a3/%a5,%sp@-
0xc00e4078 <+8>: lea %pc@(0xc019c000),%a5
0xc00e4080 <+16>: movel %fp@(8),%d0
0xc00e4084 <+20>: moveal %fp@(16),%a2
0xc00e4088 <+24>: moveal %a5@(108),%a3
0xc00e408c <+28>: movel %a3@,%fp@(-4)
0xc00e4090 <+32>: tstl %a2
0xc00e4092 <+34>: beqw 0xc00e4152 <__wait3+226>
0xc00e4096 <+38>: pea %fp@(-92)
0xc00e409a <+42>: movel %fp@(12),%sp@-
0xc00e409e <+46>: movel %d0,%sp@-
0xc00e40a0 <+48>: pea 0xffffffff
0xc00e40a4 <+52>: bsrl 0xc00e4174 <__GI___wait4_time64>
0xc00e40aa <+58>: lea %sp@(16),%sp
0xc00e40ae <+62>: tstl %d0
0xc00e40b0 <+64>: bgts 0xc00e40c8 <__wait3+88>
0xc00e40b2 <+66>: moveal %fp@(-4),%a0
0xc00e40b6 <+70>: movel %a3@,%d1
0xc00e40b8 <+72>: cmpl %a0,%d1
0xc00e40ba <+74>: bnew 0xc00e416c <__wait3+252>
0xc00e40be <+78>: moveml %fp@(-108),%a2-%a3/%a5
0xc00e40c4 <+84>: unlk %fp
0xc00e40c6 <+86>: rts
0xc00e40c8 <+88>: pea 0x44
0xc00e40cc <+92>: clrl %sp@-
0xc00e40ce <+94>: pea %a2@(4)
0xc00e40d2 <+98>: movel %d0,%fp@(-96)
0xc00e40d6 <+102>: bsrl 0xc00bc850 <__GI_memset>
0xc00e40dc <+108>: movel %fp@(-88),%a2@
0xc00e40e0 <+112>: movel %fp@(-80),%a2@(4)
0xc00e40e6 <+118>: movel %fp@(-72),%a2@(8)
0xc00e40ec <+124>: movel %fp@(-64),%a2@(12)
0xc00e40f2 <+130>: movel %fp@(-60),%a2@(16)
0xc00e40f8 <+136>: movel %fp@(-56),%a2@(20)
0xc00e40fe <+142>: movel %fp@(-52),%a2@(24)
0xc00e4104 <+148>: movel %fp@(-48),%a2@(28)
0xc00e410a <+154>: movel %fp@(-44),%a2@(32)
0xc00e4110 <+160>: movel %fp@(-40),%a2@(36)
0xc00e4116 <+166>: movel %fp@(-36),%a2@(40)
0xc00e411c <+172>: movel %fp@(-32),%a2@(44)
0xc00e4122 <+178>: movel %fp@(-28),%a2@(48)
0xc00e4128 <+184>: movel %fp@(-24),%a2@(52)
0xc00e412e <+190>: movel %fp@(-20),%a2@(56)
0xc00e4134 <+196>: movel %fp@(-16),%a2@(60)
0xc00e413a <+202>: movel %fp@(-12),%a2@(64)
0xc00e4140 <+208>: movel %fp@(-8),%a2@(68)
0xc00e4146 <+214>: lea %sp@(12),%sp
0xc00e414a <+218>: movel %fp@(-96),%d0
0xc00e414e <+222>: braw 0xc00e40b2 <__wait3+66>
0xc00e4152 <+226>: clrl %sp@-
0xc00e4154 <+228>: movel %fp@(12),%sp@-
0xc00e4158 <+232>: movel %d0,%sp@-
0xc00e415a <+234>: pea 0xffffffff
0xc00e415e <+238>: bsrl 0xc00e4174 <__GI___wait4_time64>
0xc00e4164 <+244>: lea %sp@(16),%sp
0xc00e4168 <+248>: braw 0xc00e40b2 <__wait3+66>
=> 0xc00e416c <+252>: illegal
End of assembler dump.


(gdb) info frame
Stack level 3, frame at 0xeffff82c:
pc = 0xc00e8164 in __GI___wait3_time64
(../sysdeps/unix/sysv/linux/wait3.c:26); saved pc = 0xd000c38e
inlined into frame 4, caller of frame at 0xeffff7a8
source language c.
Arglist at unknown address.
Locals at unknown address, Previous frame's sp is 0xeffff7a8
Saved registers:
d2 at 0xeffff738, d3 at 0xeffff73c, d4 at 0xeffff740, d5 at 0xeffff744,
a2 at 0xeffff748, a3 at 0xeffff74c, a5 at 0xeffff750, pc at 0xeffff7a4


(gdb) up
#4 __wait3 (stat_loc=<optimized out>, options=<optimized out>,
usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:35
35 in ../sysdeps/unix/sysv/linux/wait3.c
(gdb) info frame
Stack level 4, frame at 0xeffff82c:
pc = 0xc00e8164 in __wait3 (../sysdeps/unix/sysv/linux/wait3.c:35);
saved pc = 0xd000c38e
called by frame at 0xeffff928, caller of frame at 0xeffff82c
source language c.
Arglist at 0xeffff824, args: stat_loc=<optimized out>,
options=<optimized out>, usage=<optimized out>
Locals at 0xeffff824, Previous frame's sp is 0xeffff82c
Saved registers:
a2 at 0xeffff7b8, a3 at 0xeffff7bc, a5 at 0xeffff7c0, fp at 0xeffff824,
pc at 0xeffff828


Note that frame 3 was "inlined into frame 4". The inlined code can be seen
above at 0xc00e4154. So the backtrace is misleading inasmuchas it
represents the source code rather than the disassembly.
> > 0xeffff63c: 0x40020000 <= (sc.formatvec & 0xffff) << 16; fpregs from here on
> > 0xeffff638: 0x81b60080 <= (sc.pc & 0xffff) << 16 | sc.formatvec >> 16
> > 0xeffff634: 0x0000c00e <= sc.sr << 16 sc.pc >> 16
> > 0xeffff630: 0xd001e4e3 <= sc.a1
> > 0xeffff62c: 0xc0028780 <= sc.a0
> > 0xeffff628: 0xffffffff <= sc.d1
> > 0xeffff624: 0x0000041f <= sc.d0
> > 0xeffff620: 0xeffff738 <= sc.usp
> > 0xeffff61c: 0x00000000 <= sc.mask
> > 0xeffff618: 0x00000000 <= extramask
> > 0xeffff614: 0x00000000 <= frame.retcode[1]
> > 0xeffff610: 0x70774e40 moveq #119,%d0 ; trap #0
> > 0xeffff60c: 0xeffff61c <= frame->sc
> > 0xeffff608: 0x00000080 <= tregs->vector
> > 0xeffff604: 0x00000011 <= signal no.
> > 0xeffff600: 0xeffff610 return address
> >
> > The above comes from dash running under gdb under qemu, which does not
> > exhibit the failure but is convenient for that kind of experiment.
>
> I would have expected to see a different signal trampoline (for
> sys_rt_sigreturn) ...

Well, this seems to be the trampoline from setup_frame() and not
setup_rt_frame().

> But anyway:
>
> The saved pc is 0xc00e81b6 which does match the backtrace above. Vector
> offset 80 matches trap 0 which suggests 0xc00e81b6 should be the
> instruction after a trap 0 instruction. d0 is 1055 which is not a signal
> number I recognize.
>

I don't know what d0 represents here. But &frame->sig == 0x11 is correct
(SIGCHLD).
That's why my analysis stopped at __wait4_time64+38: the rest of
__wait4_time64 is not relevant to the dead stack contents. (It would have
left a different return address in that memory location.)
> > 0xefffeeb4: 0x7fff0000 sc_formatvec
> >
> > The signal frame is not readily apparent (to me).
>
> From looking at the above stack dump, sc ought to start at 0xefffee90,
> and the trampoline would be three words below that.

0xefffeeb0: 0x4178b008 sc_pc, sc_formatvec
0xefffeeac: 0x0008c00e sc_sr, sc_pc
0xefffeea8: 0xd00223bb sc_a1
0xefffeea4: 0xd001e32c sc_a0
0xefffeea0: 0x00000003 sc_d1
0xefffee9c: 0xeffff11e sc_d0
0xefffee98: 0xeffff004 sc_usp
0xefffee94: 0x00000000 sc_mask
0xefffee90: 0x00000000 extramask
0xefffee8c: 0xc0024a90 retcode[1]
0xefffee88: 0x70774e40 retcode[0]
0xefffee84: 0xefffee94 psc
0xefffee80: 0x00000008 code
0xefffee7c: 0x00000011 sig
0xefffee78: 0xefffee88 pretcode
0xefffee74: 0xc019c000
0xefffee70: 0x00000000
0xefffee6c: 0xc0025878
0xefffee68: 0xc0007ed4
0xefffee64: 0xc0024000
0xefffee60: 0xefffef50
0xefffee5c: 0xc0024000
0xefffee58: 0xc002a034
0xefffee54: 0xc0024a90
0xefffee50: 0xc0025878
0xefffee4c: 0x00000001
0xefffee48: 0x0017f020
0xefffee44: 0x0000002c
0xefffee40: 0x0000000f
0xefffee3c: 0x00000000
0xefffee38: 0xfffff7fa
0xefffee34: 0xffffffff
0xefffee30: 0x00009782
0xefffee2c: 0x00000000
0xefffee28: 0x0000001e
0xefffee24: 0xc0025858
0xefffee20: 0xc0025af8
0xefffee1c: 0xc000b376
0xefffee18: 0xc0024000
0xefffee14: 0xc0025878
0xefffee10: 0x0000001d
0xefffee0c: 0xd0001b60
0xefffee08: 0x0000002f
0xefffee04: 0xc002563e
0xefffee00: 0xc0025490

> The last address you show corresponds to 0xeffff640 in first dump above,
> which is at the start of the saved fpregs. I'd say we just miss the
> beginning of the signal frame?
>

It looks like you're right. I'm not sure how I missed that.

So when the signal was delivered, PC == 0xc00e4178 and USP == 0xc00e4178.
Those addresses can be found in the disassembly and the stack contents I
sent previously (quoted above) and it all seems to line up.

> (My reasoning is that copy_siginfo_to_user clears the end of the signal
> stack, which is what we can see in both cases.)
>
> Can't explain the 14 words below the saved return address though.
>

Right. Is it sc_fpstate? Perhaps we should expect QEMU to differ here.

Bottom line is, the corrupted %a3 register would have been saved by the
MOVEM instruction at 0xc00e4178, which turns out to be the PC in the
signal frame. So it certainly looks like the kernel was the culprit here.

Michael Schmitz

unread,
Apr 18, 2023, 1:30:04 AM4/18/23
to
Hi Finn,
OK then ...
According to the manpages I've seen, glibc ought to pick rt signals if
the kernel supports those (which I suppose it does).

>
>> But anyway:
>>
>> The saved pc is 0xc00e81b6 which does match the backtrace above. Vector
>> offset 80 matches trap 0 which suggests 0xc00e81b6 should be the
>> instruction after a trap 0 instruction. d0 is 1055 which is not a signal
>> number I recognize.
>>
>
> I don't know what d0 represents here. But &frame->sig == 0x11 is correct
> (SIGCHLD).

Correct - that all works out. But d0 holds the syscall number when we
enter the kernel via trap 0, and that one is odd.
OK, that's our SIGCHLD. But the signal frame format is odd ...

Frame format b, vector offset 008. That's a bus error? How does that get
on the user mode stack?
USP is 0xeffff004 AFAICS. That's the location 15 was saved to above
(holding libc .got according to your interpretation).

The saved PC is that from the exception frame, in this case a long bus
error sequence fault frame. The PC is that of the instruction executing
when the fault occurred. As you say, that's the moveml saving registers
to the stack.

I don't believe the whole fault frame is on the signal stack in one
contiguous piece, just the first four words, then we have struct
sigcontext. But after that, the extra contents follows, and that nicely
explains the extra bits right below the return address from the
__m68k_read_tp call.

> Those addresses can be found in the disassembly and the stack contents I
> sent previously (quoted above) and it all seems to line up.
>
>> (My reasoning is that copy_siginfo_to_user clears the end of the signal
>> stack, which is what we can see in both cases.)
>>
>> Can't explain the 14 words below the saved return address though.
>>
>
> Right. Is it sc_fpstate? Perhaps we should expect QEMU to differ here.

See above - I think what's stored there is the extra frame content for a
format b bus error frame. But that extra frame is incomplete at best
(should be 22 longwords, only a4 are seen). Probably overwritten by the
stack frame from __GI___wait4_time64.

Let's parse what's left:
<=
>>> 0xefffefe4: 0xc0028780 <= internal registers (6x)
>>> 0xefffefe0: 0x3c344bfb <=
>>> 0xefffefdc: 0x000af353 <=
>>> 0xefffefd8: 0x3c340170 <= internal reg; version no.
>>> 0xefffefd4: 0x00000000 <= data input buffer
>>> 0xefffefd0: 0xc00e417c <= internal registers (2x)
>>> 0xefffefcc: 0xc00e417e <= stage b address
>>> 0xefffefc8: 0xc00e4180 <= internal registers (4x)
>>> 0xefffefc4: 0x48e73c34 <=
>>> 0xefffefc0: 0x00000000 <= data output buffer
>>> 0xefffefbc: 0xefffeff8 <= internal registers (2x)
>>> 0xefffefb8: 0xefffeffc <= data fault address
>>> 0xefffefb4: 0x4bfb0170 <= ins stage c, stage b
>>> 0xefffefb0: 0x0eee0709 <= internal register; ssw

The fault address is the location on the stack where a2 is saved. That
does match the data output buffer contents BTW. fc, fb, rc, rb bits
clear means the fault didn't occur in stage b or c instructions. ssw bit
8 set indicates a data fault - the data cycle should be rerun on rte. rm
and rw bits clear tell us it's a write fault. If the moveml instruction
copies registers to the stack in descending order, the fault address
makes sense - the stack pointer just crossed a page boundary.

>
> Bottom line is, the corrupted %a3 register would have been saved by the
> MOVEM instruction at 0xc00e4178, which turns out to be the PC in the
> signal frame. So it certainly looks like the kernel was the culprit here.

I think the moveml instruction did cause a bus error, and on return from
that exception the signal got delivered.

On entering the buserror handler, only a1 and a2 are saved, but the
comment in entry.h states that a3-a6 and d6, d7 are preserved by C code.
After buserr_c returns, a3 should be restored to what it was when taking
the bus error. All registers restored before rte, the moveml instruction
ought to be able to resume normally.

Unless that register use constraint has changed, I don't see how a3
could have changed midway during return from the bus error exception.
But maybe a disassembly of buserr_c from your kernel could confirm that?

Cheers,

Michael

Finn Thain

unread,
Apr 18, 2023, 10:10:03 PM4/18/23
to
It's got to be the trampoline from setup_frame() because dash did this:

act.sa_flags = 0;
sigfillset(&act.sa_mask);
sigaction(signo, &act, 0);

and the kernel did this:

/* set up the stack frame */
if (ksig->ka.sa.sa_flags & SA_SIGINFO)
err = setup_rt_frame(ksig, oldset, regs);
else
err = setup_frame(ksig, oldset, regs);

> >
> >> But anyway:
> >>
> >> The saved pc is 0xc00e81b6 which does match the backtrace above.
> >> Vector offset 80 matches trap 0 which suggests 0xc00e81b6 should be
> >> the instruction after a trap 0 instruction. d0 is 1055 which is not a
> >> signal number I recognize.
> >>
> >
> > I don't know what d0 represents here. But &frame->sig == 0x11 is
> > correct (SIGCHLD).
>
> Correct - that all works out. But d0 holds the syscall number when we
> enter the kernel via trap 0, and that one is odd.
>

Well, you showed subsequently that the kernel was probably entered via a
page fault and not the get_thread_area trap. Would that explain the d0
value?

> >>> ...
Right, it was a typo. USP is 0xeffff004, where a5 is to be saved.

> The saved PC is that from the exception frame, in this case a long bus
> error sequence fault frame. The PC is that of the instruction executing
> when the fault occurred. As you say, that's the moveml saving registers
> to the stack.
>
> I don't believe the whole fault frame is on the signal stack in one
> contiguous piece, just the first four words, then we have struct
> sigcontext. But after that, the extra contents follows, and that nicely
> explains the extra bits right below the return address from the
> __m68k_read_tp call.
>
> > Those addresses can be found in the disassembly and the stack contents
> > I sent previously (quoted above) and it all seems to line up.
> >
> >> (My reasoning is that copy_siginfo_to_user clears the end of the
> >> signal stack, which is what we can see in both cases.)
> >>
> >> Can't explain the 14 words below the saved return address though.
> >>
> >
> > Right. Is it sc_fpstate? Perhaps we should expect QEMU to differ here.
>
> See above - I think what's stored there is the extra frame content for a
> format b bus error frame. But that extra frame is incomplete at best
> (should be 22 longwords, only a4 are seen). Probably overwritten by the
> stack frame from __GI___wait4_time64.
>

Maybe the exception frame leaked onto the user stack via setup_frame()?

> Let's parse what's left:
> <=
> >>> 0xefffefe4: 0xc0028780 <= internal registers (6x)
> >>> 0xefffefe0: 0x3c344bfb <=
> >>> 0xefffefdc: 0x000af353 <=
> >>> 0xefffefd8: 0x3c340170 <= internal reg; version no.
> >>> 0xefffefd4: 0x00000000 <= data input buffer
> >>> 0xefffefd0: 0xc00e417c <= internal registers (2x)
> >>> 0xefffefcc: 0xc00e417e <= stage b address
> >>> 0xefffefc8: 0xc00e4180 <= internal registers (4x)
> >>> 0xefffefc4: 0x48e73c34 <=
> >>> 0xefffefc0: 0x00000000 <= data output buffer
> >>> 0xefffefbc: 0xefffeff8 <= internal registers (2x)
> >>> 0xefffefb8: 0xefffeffc <= data fault address
> >>> 0xefffefb4: 0x4bfb0170 <= ins stage c, stage b
> >>> 0xefffefb0: 0x0eee0709 <= internal register; ssw
>
> The fault address is the location on the stack where a2 is saved. That
> does match the data output buffer contents BTW. fc, fb, rc, rb bits
> clear means the fault didn't occur in stage b or c instructions. ssw bit
> 8 set indicates a data fault - the data cycle should be rerun on rte. rm
> and rw bits clear tell us it's a write fault. If the moveml instruction
> copies registers to the stack in descending order, the fault address
> makes sense - the stack pointer just crossed a page boundary.
>

Well spotted!

> >
> > Bottom line is, the corrupted %a3 register would have been saved by
> > the MOVEM instruction at 0xc00e4178, which turns out to be the PC in
> > the signal frame. So it certainly looks like the kernel was the
> > culprit here.
>
> I think the moveml instruction did cause a bus error, and on return from
> that exception the signal got delivered.
>

Maybe the signal frame was partially overwritten by the resumed MOVEM?

I wonder what we'd see if we patched the kernel to log every user data
write fault caused by a MOVEM instruction. I'll try to code that up.

> On entering the buserror handler, only a1 and a2 are saved, but the
> comment in entry.h states that a3-a6 and d6, d7 are preserved by C code.
> After buserr_c returns, a3 should be restored to what it was when taking
> the bus error. All registers restored before rte, the moveml instruction
> ought to be able to resume normally.
>
> Unless that register use constraint has changed, I don't see how a3
> could have changed midway during return from the bus error exception.
> But maybe a disassembly of buserr_c from your kernel could confirm that?
>

I disassembled the relevant build. AFAICT, buserr_c() saves and restores
those registers in the right places.

BTW, I've reproduced the failures with kernels built with both GCC 12 and
GCC 6.

Michael Schmitz

unread,
Apr 19, 2023, 4:20:03 AM4/19/23
to
Hi Finn,

Am 19.04.2023 um 13:50 schrieb Finn Thain:
>>>> I would have expected to see a different signal trampoline (for
>>>> sys_rt_sigreturn) ...
>>>
>>> Well, this seems to be the trampoline from setup_frame() and not
>>> setup_rt_frame().
>>
>> According to the manpages I've seen, glibc ought to pick rt signals if
>> the kernel supports those (which I suppose it does).
>>
>
> It's got to be the trampoline from setup_frame() because dash did this:
>
> act.sa_flags = 0;
> sigfillset(&act.sa_mask);
> sigaction(signo, &act, 0);

Ah - dash explicitly requests the old format. Make sense then.
>
> and the kernel did this:
>
> /* set up the stack frame */
> if (ksig->ka.sa.sa_flags & SA_SIGINFO)
> err = setup_rt_frame(ksig, oldset, regs);
> else
> err = setup_frame(ksig, oldset, regs);
>
>>>
>>>> But anyway:
>>>>
>>>> The saved pc is 0xc00e81b6 which does match the backtrace above.
>>>> Vector offset 80 matches trap 0 which suggests 0xc00e81b6 should be
>>>> the instruction after a trap 0 instruction. d0 is 1055 which is not a
>>>> signal number I recognize.
>>>>
>>>
>>> I don't know what d0 represents here. But &frame->sig == 0x11 is
>>> correct (SIGCHLD).
>>
>> Correct - that all works out. But d0 holds the syscall number when we
>> enter the kernel via trap 0, and that one is odd.
>>
>
> Well, you showed subsequently that the kernel was probably entered via a
> page fault and not the get_thread_area trap. Would that explain the d0
> value?

That d0 was from the dash under gdb run. But I got my signal delivery
mixed up - d0 is only expected to hold the syscall number when we issue
a syscall. That would be in the child process, not the parent which we
debug.

d0 is just whatever the parent had in its register when it started
do_signal_return after exception or syscall. On return after syscall, d0
holds the task info flags, maybe that's what we see here.

>> See above - I think what's stored there is the extra frame content for a
>> format b bus error frame. But that extra frame is incomplete at best
>> (should be 22 longwords, only a4 are seen). Probably overwritten by the
>> stack frame from __GI___wait4_time64.
>>
>
> Maybe the exception frame leaked onto the user stack via setup_frame()?

Yes, for exception frames larger than four words the excess is copied
after the end of the sigcontext block.
That's possible - the saved usp in the signal frame is that of the first
register saved to the stack (before the page fault).

> I wonder what we'd see if we patched the kernel to log every user data
> write fault caused by a MOVEM instruction. I'll try to code that up.

If these instructions did always cause stack corruption on 030, I think
we would have noticed long ago?

>
>> On entering the buserror handler, only a1 and a2 are saved, but the
>> comment in entry.h states that a3-a6 and d6, d7 are preserved by C code.
>> After buserr_c returns, a3 should be restored to what it was when taking
>> the bus error. All registers restored before rte, the moveml instruction
>> ought to be able to resume normally.
>>
>> Unless that register use constraint has changed, I don't see how a3
>> could have changed midway during return from the bus error exception.
>> But maybe a disassembly of buserr_c from your kernel could confirm that?
>>
>
> I disassembled the relevant build. AFAICT, buserr_c() saves and restores
> those registers in the right places.
>
> BTW, I've reproduced the failures with kernels built with both GCC 12 and
> GCC 6.

Thanks - that was highly unlikely but had to be checked.

Leaves the possibility that some kernel bug did corrupt the saved a3
copy in struct switch_stack... but that is not used in bus error
exceptions. And the only other use of a3 is in ret_from_kernel_thread
which is called only from copy_thread() ...

Still baffled...

Cheers,

Michael

Finn Thain

unread,
Apr 19, 2023, 7:10:03 AM4/19/23
to
On Tue, 18 Apr 2023, Michael Schmitz wrote:

>
> ... I think what's stored there is the extra frame content for a format
Inspired by your observation about the page fault and stack growth, I
wrote a small test program (given below) that just pushes registers onto
the stack recursively while forking processes and collecting the SIGCHLD
signals.

On a Motorola '030 the stack grows to about 7 MiB before it gets
corrupted. The program detects the stack corruption and terminates
immediately with an illegal instruction. Oddly, the program never detects
any stack corruption when run on the QEMU '040.

root@debian:~# ./movem
Illegal instruction
root@debian:~# ulimit -a
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 242
max locked memory (kbytes, -l) 8192
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 242
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
root@debian:~# ulimit -s 7200
root@debian:~# ./movem
Illegal instruction
root@debian:~# ulimit -s 7000
root@debian:~# ./movem
Segmentation fault
root@debian:~# ulimit -s 16384
root@debian:~# ./movem
Illegal instruction
root@debian:~#

Looking at the core dump in gdb, the backtrace has 189869 frames. The dead
stack frames confirm the recursion depth reached the limit I set at 200000
before the stack began to reduce again. This was also confirmed by the
lowest page fault address that was logged by the custom kernel.

That means validation succeeded 200000 - 189869 == 10131 times before it
encountered corruption (I should try to figure out whether this varies).

The registers %a2, %a3 and %a4 below should contain 0x91929394, 0xa1a2a3a4
and 0xb1b2b3b4 respectively. But they don't. Their values were restored
from a corrupted stack by the returning rec() function call.

(gdb) info reg
d0 0x91929394 -1852664940
d1 0xf3 243
d2 0xd1d2d3d4 -774712364
d3 0xe1e2e3e4 -505224220
d4 0xf1f2f3f4 -235736076
d5 0x80003f0c -2147467508
d6 0xd014c528 -803945176
d7 0x0 0
a0 0xc0021708 0xc0021708
a1 0xc0023e8c 0xc0023e8c <__stack_chk_guard>
a2 0xf3 0xf3
a3 0x1464000 0x1464000
a4 0xef97bf44 0xef97bf44
a5 0xc1c2c3c4 0xc1c2c3c4
fp 0xef97b034 0xef97b034
sp 0xef97b018 0xef97b018
ps 0x8 [ N ]
pc 0x800005f6 0x800005f6 <rec+262>
fpcontrol 0x0 0
fpstatus 0x0 0
fpiaddr 0x0 0x0

(gdb) x/z $sp - 36
0xef97aff4: 0xd1d2d3d4
(gdb)
0xef97aff8: 0xe1e2e3e4
(gdb)
0xef97affc: 0xf1f2f3f4
(gdb)
0xef97b000: 0x000000f3
(gdb)
0xef97b004: 0x01464000
(gdb)
0xef97b008: 0xef97bf44
(gdb)
0xef97b00c: 0xc1c2c3c4
(gdb)
0xef97b010: 0xef97b034
(gdb)
0xef97b014: 0x8000055c

As with dash, the corruption lies the page boundary.

Any signal frames or exception frames have been completely overwritten
because the recursion continued after the corruption took place. So
there's not much to see in the core dump.

(gdb) disass rec
Dump of assembler code for function rec:
0x800004f0 <+0>: linkw %fp,#0
0x800004f4 <+4>: moveml %d2-%d4/%a2-%a5,%sp@-
0x800004f8 <+8>: moveal 0x80000672 <i0>,%a2
0x800004fe <+14>: moveal 0x80000676 <i1>,%a3
0x80000504 <+20>: moveal 0x8000067a <i2>,%a4
0x8000050a <+26>: moveal 0x8000067e <i3>,%a5
0x80000510 <+32>: movel 0x80000682 <i4>,%d2
0x80000516 <+38>: movel 0x80000686 <i5>,%d3
0x8000051c <+44>: movel 0x8000068a <i6>,%d4
0x80000522 <+50>: movel 0x80004034 <depth>,%d0
0x80000528 <+56>: andil #2047,%d0
0x8000052e <+62>: bnes 0x80000542 <rec+82>
0x80000530 <+64>: jsr 0x8000042c <fork@plt>
0x80000536 <+70>: tstl %d0
0x80000538 <+72>: bnes 0x80000542 <rec+82>
0x8000053a <+74>: clrl %sp@-
0x8000053c <+76>: jsr 0x80000404 <exit@plt>
0x80000542 <+82>: movel 0x80004034 <depth>,%d0
0x80000548 <+88>: subql #1,%d0
0x8000054a <+90>: movel %d0,0x80004034 <depth>
0x80000550 <+96>: movel 0x80004034 <depth>,%d0
0x80000556 <+102>: beqs 0x8000055c <rec+108>
0x80000558 <+104>: jsr %pc@(0x800004f0 <rec>)
0x8000055c <+108>: movel %a2,0x8000403c <o0>
0x80000562 <+114>: movel %a3,0x80004040 <o1>
0x80000568 <+120>: movel %a4,0x80004044 <o2>
0x8000056e <+126>: movel %a5,0x80004048 <o3>
0x80000574 <+132>: movel %d2,0x8000404c <o4>
0x8000057a <+138>: movel %d3,0x80004050 <o5>
0x80000580 <+144>: movel %d4,0x80004054 <o6>
0x80000586 <+150>: movel 0x8000403c <o0>,%d1
0x8000058c <+156>: movel #-1852664940,%d0
0x80000592 <+162>: cmpl %d1,%d0
0x80000594 <+164>: bnes 0x800005f6 <rec+262>
0x80000596 <+166>: movel 0x80004040 <o1>,%d1
0x8000059c <+172>: movel #-1583176796,%d0
0x800005a2 <+178>: cmpl %d1,%d0
0x800005a4 <+180>: bnes 0x800005f6 <rec+262>
0x800005a6 <+182>: movel 0x80004044 <o2>,%d1
0x800005ac <+188>: movel #-1313688652,%d0
0x800005b2 <+194>: cmpl %d1,%d0
0x800005b4 <+196>: bnes 0x800005f6 <rec+262>
0x800005b6 <+198>: movel 0x80004048 <o3>,%d1
0x800005bc <+204>: movel #-1044200508,%d0
0x800005c2 <+210>: cmpl %d1,%d0
0x800005c4 <+212>: bnes 0x800005f6 <rec+262>
0x800005c6 <+214>: movel 0x8000404c <o4>,%d1
0x800005cc <+220>: movel #-774712364,%d0
0x800005d2 <+226>: cmpl %d1,%d0
0x800005d4 <+228>: bnes 0x800005f6 <rec+262>
0x800005d6 <+230>: movel 0x80004050 <o5>,%d1
0x800005dc <+236>: movel #-505224220,%d0
0x800005e2 <+242>: cmpl %d1,%d0
0x800005e4 <+244>: bnes 0x800005f6 <rec+262>
0x800005e6 <+246>: movel 0x80004054 <o6>,%d1
0x800005ec <+252>: movel #-235736076,%d0
0x800005f2 <+258>: cmpl %d1,%d0
0x800005f4 <+260>: beqs 0x800005f8 <rec+264>
=> 0x800005f6 <+262>: illegal
0x800005f8 <+264>: nop
0x800005fa <+266>: moveml %fp@(-28),%d2-%d4/%a2-%a5
0x80000600 <+272>: unlk %fp
0x80000602 <+274>: rts
End of assembler dump.

---

#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <signal.h>
#include <string.h>

int depth = 200000;

const unsigned long i0 = 0x91929394;
const unsigned long i1 = 0xa1a2a3a4;
const unsigned long i2 = 0xb1b2b3b4;
const unsigned long i3 = 0xc1c2c3c4;
const unsigned long i4 = 0xd1d2d3d4;
const unsigned long i5 = 0xe1e2e3e4;
const unsigned long i6 = 0xf1f2f3f4;

unsigned long o0;
unsigned long o1;
unsigned long o2;
unsigned long o3;
unsigned long o4;
unsigned long o5;
unsigned long o6;

static void rec(void)
{
// initialize registers
asm( " move.l %0, %%a2\n"
" move.l %1, %%a3\n"
" move.l %2, %%a4\n"
" move.l %3, %%a5\n"
" move.l %4, %%d2\n"
" move.l %5, %%d3\n"
" move.l %6, %%d4\n"
:
: "m" (i0), "m" (i1), "m" (i2),
"m" (i3), "m" (i4), "m" (i5), "m" (i6)
: "a2", "a3", "a4", "a5", "d2", "d3", "d4"
);

// maybe fork a short-lived process
if ((depth & 0x7ff) == 0)
if (fork() == 0)
exit(0);

if (--depth)
rec(); // callee to save & restore registers

// compare register contents
asm( " move.l %%a2, %0\n"
" move.l %%a3, %1\n"
" move.l %%a4, %2\n"
" move.l %%a5, %3\n"
" move.l %%d2, %4\n"
" move.l %%d3, %5\n"
" move.l %%d4, %6\n"
: "=m" (o0), "=m" (o1), "=m" (o2),
"=m" (o3), "=m" (o4), "=m" (o5), "=m" (o6)
:
:
);
if (o0 != i0 || o1 != i1 || o2 != i2 ||
o3 != i3 || o4 != i4 || o5 != i5 || o6 != i6)
asm("illegal");
}

static void handler(int)
{
}

int main(void)
{
struct sigaction act;

memset(&act, 0, sizeof(act));
act.sa_handler = handler;
sigaction(SIGCHLD, &act, NULL);

rec();
}

Geert Uytterhoeven

unread,
Apr 19, 2023, 8:00:03 AM4/19/23
to
Hi Finn,

On Wed, Apr 19, 2023 at 12:53 PM Finn Thain <fth...@linux-m68k.org> wrote:
> Inspired by your observation about the page fault and stack growth, I
> wrote a small test program (given below) that just pushes registers onto
> the stack recursively while forking processes and collecting the SIGCHLD
> signals.
>
> On a Motorola '030 the stack grows to about 7 MiB before it gets
> corrupted. The program detects the stack corruption and terminates
> immediately with an illegal instruction. Oddly, the program never detects
> any stack corruption when run on the QEMU '040.

Nice! (the reproducer, not the bug ;-)

Does it also fail on a very old kernel image you still have lying around?
Just to rule out a recent kernel bug.

Michael Schmitz

unread,
Apr 19, 2023, 9:00:04 PM4/19/23
to
Hi Finn,
That's great - finally irrefutable confirmation that we're onto a kernel
bug.
Hence implies a page fault handled at the page boundary.

Can you try and fault in as many of these stack pages as possible, ahead
of filling the stack? (Depending on how much RAM you have ...). Maybe we
would need to lock those pages into memory? Just to show that with no
page faults (but still signals) there is no corruption?

> Any signal frames or exception frames have been completely overwritten
> because the recursion continued after the corruption took place. So
> there's not much to see in the core dump.

We'd need a way to stop recursion once the first corruption has taken
place. If the 'safe' recursion depth of 10131 is constant, the dump
taken at that point should look similar to what you saw in dash
(assuming it is the page fault and subsequent signal return that causes
the corruption).

I'll give your test case a spin on my Falcon. I should also have a range
of kernels to test and answer Geert's question...

Cheers,

    Michael

Finn Thain

unread,
Apr 19, 2023, 9:20:03 PM4/19/23
to

On Wed, 19 Apr 2023, Geert Uytterhoeven wrote:

> Does it also fail on a very old kernel image you still have lying
> around? Just to rule out a recent kernel bug.
>

These are two mainline builds I've tried (among others):

[ 0.000000] Linux version 4.14.0-mac (fthain@nippy) (gcc version 6.4.0
(btc)) #1 Mon May 23 15:03:18 AEST 2022

[ 0.000000] Linux version 6.3.0-rc5-mac (fthain@nippy)
(m68k-unknown-linux-musl-gcc (Gentoo 12.2.1_p20221224 p7) 12.2.1 20221224,
GNU ld (Gentoo 2.39 p5) 2.39.0) #1 Thu Apr 20 10:10:37 AEST 2023

The results are the same.

Finn Thain

unread,
Apr 19, 2023, 11:20:03 PM4/19/23
to
On Thu, 20 Apr 2023, Michael Schmitz wrote:

> Can you try and fault in as many of these stack pages as possible, ahead
> of filling the stack? (Depending on how much RAM you have ...). Maybe we
> would need to lock those pages into memory? Just to show that with no
> page faults (but still signals) there is no corruption?
>

OK.

> > Any signal frames or exception frames have been completely overwritten
> > because the recursion continued after the corruption took place. So
> > there's not much to see in the core dump.
>
> We'd need a way to stop recursion once the first corruption has taken
> place. If the 'safe' recursion depth of 10131 is constant, the dump
> taken at that point should look similar to what you saw in dash
> (assuming it is the page fault and subsequent signal return that causes
> the corruption).
>

It turns out that the recursion depth can be set a lot lower than the
200000 that I chose in that test program. (I used that value as it kept
the stack size just below the default 8192 kB limit.)

At depth = 2500, a failure is around 95% certain. At depth = 2048 I can
still get an intermittent failure. This only required 21 stack pagefaults
and one fork.

I suspect that the location of the corruption is probably somewhat random,
and the larger the stack happens to be when the signal comes in, the
better the odds of detection.

Michael Schmitz

unread,
Apr 20, 2023, 12:10:03 AM4/20/23
to
Hi Finn,

reproduced on my Falcon (with minor mods to the C source - my version of
gcc didn't like asm with no clobbers, so I added "memory" as clobber in
the second asm block). In this case it's a4 that is corrupted, but that
varies.

depth of 4096 gets me two core dumps on 20 attempts so this isn't quite
as fast on my Falcon. With 8192, it's nine.

Example:

Core was generated by `./moveml'.
Program terminated with signal 4, Illegal instruction.
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld.so.1...done.
Loaded symbols for /lib/ld.so.1
#0 0x8000060e in rec ()
(gdb) info reg
d0 0x8000057c -2147482244
d1 0xc0017000 -1073647616
d2 0xd1d2d3d4 -774712364
d3 0xe1e2e3e4 -505224220
d4 0xf1f2f3f4 -235736076
d5 0x80096168 -2146868888
d6 0x80093108 -2146881272
d7 0x0 0
a0 0x0 0x0
a1 0xefdadbdc 0xefdadbdc
a2 0x91929394 0x91929394
a3 0xa1a2a3a4 0xa1a2a3a4
a4 0x8000057c 0x8000057c
a5 0xc1c2c3c4 0xc1c2c3c4
fp 0xef87402c 0xef87402c
sp 0xef874010 0xef874010
ps 0x209 521
pc 0x8000060e 0x8000060e <rec+242>
fpcontrol 0x0 0
fpstatus 0x0 0
fpiaddr 0x0 0
(gdb)


Am 20.04.2023 um 14:57 schrieb Finn Thain:
> On Thu, 20 Apr 2023, Michael Schmitz wrote:
>
>> Can you try and fault in as many of these stack pages as possible, ahead
>> of filling the stack? (Depending on how much RAM you have ...). Maybe we
>> would need to lock those pages into memory? Just to show that with no
>> page faults (but still signals) there is no corruption?
>>
>
> OK.
>
>>> Any signal frames or exception frames have been completely overwritten
>>> because the recursion continued after the corruption took place. So
>>> there's not much to see in the core dump.
>>
>> We'd need a way to stop recursion once the first corruption has taken
>> place. If the 'safe' recursion depth of 10131 is constant, the dump
>> taken at that point should look similar to what you saw in dash
>> (assuming it is the page fault and subsequent signal return that causes
>> the corruption).
>>
>
> It turns out that the recursion depth can be set a lot lower than the
> 200000 that I chose in that test program. (I used that value as it kept
> the stack size just below the default 8192 kB limit.)

And it does keep the core a lot smaller. Still not hard to work with on
my 14 MB RAM Falcon...

>
> At depth = 2500, a failure is around 95% certain. At depth = 2048 I can
> still get an intermittent failure. This only required 21 stack pagefaults
> and one fork.
>
> I suspect that the location of the corruption is probably somewhat random,
> and the larger the stack happens to be when the signal comes in, the
> better the odds of detection.

Yep, but there must me some more to that. Timing of page faults due to
swap bandwidth, perhaps?

Cheers,

Michael

>

Finn Thain

unread,
Apr 20, 2023, 1:40:03 AM4/20/23
to
On Thu, 20 Apr 2023, Michael Schmitz wrote:

> >
> > As with dash, the corruption lies the page boundary.
>
> Hence implies a page fault handled at the page boundary.
>
> Can you try and fault in as many of these stack pages as possible, ahead
> of filling the stack? (Depending on how much RAM you have ...). Maybe we
> would need to lock those pages into memory? Just to show that with no
> page faults (but still signals) there is no corruption?
>

I modified the test program to execute rec() to full depth with no
forking, then do it again with forking.

root@(none):/root# while ./stack-test 5000 ; do : ; done
starting recursion
done.
starting recursion with fork
done.
starting recursion
done.
starting recursion with fork
Illegal instruction
root@(none):/root#

I can't get this to crash during the first descent. The second descent
always crashes, given sufficient depth:

root@(none):/root# while ./stack-test 50000 ; do : ; done
starting recursion
done.
starting recursion with fork
Illegal instruction

So all the stack pages would have been faulted in well before the failure
shows up. It appears to be the signal that's the problem and not the page
fault. That's not surprising considering the PC in the signal frame in the
dash crash was a MOVEM saving registers onto the stack.

It's worth noting that the test program never crashes with a corrupted
return address. Random corruption would have clobbered that address about
10% of the time, since the entire rec() stack frame is 9 long words. So it
must be that a MOVEM went awry when a signal got delivered.

Finn Thain

unread,
Apr 20, 2023, 2:10:04 AM4/20/23
to
On Thu, 20 Apr 2023, I wrote:

>
> So it must be that a MOVEM went awry when a signal got delivered.

Or signal delivery went awry after a MOVEM got resumed?

Finn Thain

unread,
Apr 20, 2023, 2:20:03 AM4/20/23
to
On Wed, 19 Apr 2023, I wrote:

> Oddly, the program never detects any stack corruption when run on the
> QEMU '040.
>

I tested a Motorola '040 and got the same result.
It is loading more messages.
0 new messages