On Thu, Feb 26, 2026 at 9:40 AM Russ Cox <
r...@swtch.com> wrote:
> Apologies for the excessive snark.
Ha, no, it's fine: in this case, I was agreeing with your assessment:
the `forkall` behavior _is_ insane.
> However, I seriously can't imagine why you would ever fork all the threads. It makes no sense at all. At the point where one thread is calling fork, that thread is in a state where it expects to be duplicated and has a way to signal to the two copies which one they are. Any other thread is just doing its thing independently and then bam! there are two copies of the thread doing the same exact thing, none the wiser, sharing output file descriptors, network connections, and so on. You need some kind of coordination to make this reasonable. If you were coordinating with all the threads and quiesced them all somehow and then did a forkall and woke them back up with different messages in the parent and child, maybe that could work, but I still don't quite see the utility of that approach versus having the child kick off new threads instead. And it's a lot of kernel work to duplicate all the threads, all for this use case that is almost impossible to invoke correctly.
I'm speculating, but my guess is that this was/is for compatibility
with the earlier `libthread` library that implemented M:N threading. I
don't know that anybody thought it was a good idea, really, but
Solaris has an almost obsessive commitment to backwards compatibility
(in contrast to SunOS4, I think).
> Fork1 has problems, of course, but forkall has so many more problems.
>
> Anyway, enough speculation. I wrote a C program to test the hypothesis, and it seems to indicate that the thread stack is lost across fork. This runs fine on my Mac but crashes on one of the Go solaris builders, which has this uname -a output:
>
> SunOS s11-i386.foss 5.11 11.4.93.215.0 i86pc i386 i86pc kernel-zone
>
> [snip]
Thanks, this is very handy; I was able to reproduce this behavior on
illumos. Here's `truss` output:
: spitfire; truss -f ./rsc
8669: execvex("rsc", 0xFFFFFC7FFFDF9138, 0xFFFFFC7FFFDF9148, 0) argc = 1
8669: sysinfo(SI_MACHINE, "i86pc", 257) = 6
8669: mmap(0x00000000, 56, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON, 4294967295, 0) = 0xFFFFFC7FE3030000
8669: mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON, 4294967295, 0) = 0xFFFFFC7FE5AC0000
8669: mmap(0x00000000, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANON, 4294967295, 0) = 0xFFFFFC7FED2A0000
8669: mmap(0x00000000, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANON, 4294967295, 0) = 0xFFFFFC7FEDAC0000
8669: memcntl(0xFFFFFC7FEF399000, 88824, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
8669: mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON, 4294967295, 0) = 0xFFFFFC7FEF320000
8669: memcntl(0x00400000, 4528, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
8669: resolvepath("/usr/lib/amd64/ld.so.1", "/lib/amd64/ld.so.1", 1023) = 18
8669: getcwd("/home/cross", 1019) = 0
8669: resolvepath("/home/cross/rsc", "/home/cross/rsc", 1023) = 15
8669: stat("/home/cross/rsc", 0xFFFFFC7FFFDF8D80) = 0
8669: open("/var/ld/64/ld.config", O_RDONLY) Err#2 ENOENT
8669: stat("/usr/gcc/14/lib/amd64/libc.so.1", 0xFFFFFC7FFFDF8150) Err#2 ENOENT
8669: stat("/lib/64/libc.so.1", 0xFFFFFC7FFFDF8150) = 0
8669: resolvepath("/lib/64/libc.so.1", "/lib/amd64/libc.so.1", 1023) = 20
8669: open("/lib/64/libc.so.1", O_RDONLY) = 3
8669: mmapobj(3, MMOBJ_INTERPRET, 0xFFFFFC7FEF320BE8,
0xFFFFFC7FFFDF80AC, 0x00000000) = 0 8669: close(3) = 0
8669: mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON, 4294967295, 0) = 0xFFFFFC7FEDEE0000
8669: memcntl(0xFFFFFC7FEDD00000, 446360, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
8669: mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON, 4294967295, 0) = 0xFFFFFC7FEE1D0000
8669: mmap(0x00010000, 24576, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFFFFFC7FEF1F0000
8669: getcontext(0xFFFFFC7FFFDF88A0)
8669: getrlimit(RLIMIT_STACK, 0xFFFFFC7FFFDF8890) = 0
8669: getpid() = 8669 [8668]
8669: lwp_private(0, 0, 0xFFFFFC7FEF1F2A40) = 0x00000000
8669: getrandom("C1AC HDA TBB13 X", 8, 0) = 8
8669: setustack(0xFFFFFC7FEF1F2AE8)
8669: lwp_cond_broadcast(0xFFFFFC7FEDEE01A8) = 0
8669: lwp_cond_broadcast(0xFFFFFC7FEF3201A8) = 0
8669: sysi86(SI86FPSTART, 0xFFFFFC7FFFDF90DC, 0x0000137F, 0x00001F80)
= 0x00000001
8669: sysconfig(_CONFIG_PAGESIZE) = 4096
8669: schedctl() = 0xFFFFFC7FEF31E000
8669: priocntlsys(1, 0xFFFFFC7FFFDF8DE0, 3, 0xFFFFFC7FFFDF8FF0, 0) = 8669
8669: priocntlsys(1, 0xFFFFFC7FFFDF8D50, 1, 0xFFFFFC7FFFDF8F10, 0) = 4
8669: priocntlsys(1, 0xFFFFFC7FFFDF8CF0, 0, 0xFFFFFC7FEDE90BA8, 0) = 4
8669: mmap(0x00000000, 131072, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANON, 4294967295, 0) = 0xFFFFFC7FEF218000
8669: mmap(0x00000000, 65536, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANON, 4294967295, 0) = 0xFFFFFC7FEF2F0000
8669: sigaction(SIGCANCEL, 0xFFFFFC7FFFDF8BC0, 0x00000000) = 0
8669: sysconfig(_CONFIG_STACK_PROT) = 3
8669: mmap(0x00000000, 2088960, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_NORESERVE|MAP_ANON, 4294967295, 0) =
0xFFFFFC7FEE379000
8669: mmap(0x00010000, 65536, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON|MAP_ALIGN, 4294967295, 0) = 0xFFFFFC7FEF070000
8669: uucopy(0xFFFFFC7FFFDF8B70, 0xFFFFFC7FEE576FE8, 24) = 0
8669: lwp_create(0xFFFFFC7FFFDF8C80, LWP_SUSPENDED, 0xFFFFFC7FFFDF8C7C) = 2
8669/2: lwp_create() (returning as new lwp ...) = 0
8669/1: lwp_continue(2) = 0
8669/2: setustack(0xFFFFFC7FEF0702E8)
8669/1: mmap(0x00000000, 2088960, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_NORESERVE|MAP_ANON, 4294967295, 0) =
0xFFFFFC7FEEAF7000
8669/2: schedctl() = 0xFFFFFC7FEF31E010
8669/1: uucopy(0xFFFFFC7FFFDF8B70, 0xFFFFFC7FEECF4FE8, 24) = 0
8669/1: lwp_create(0xFFFFFC7FFFDF8C80, LWP_SUSPENDED, 0xFFFFFC7FFFDF8C7C) = 3
8669/3: lwp_create() (returning as new lwp ...) = 0
8669/1: lwp_continue(3) = 0
8669/3: setustack(0xFFFFFC7FEF070AE8)
8669/3: schedctl() = 0xFFFFFC7FEF31E020
8669/3: ioctl(1, TCGETA, 0xFFFFFC7FEECF3C50) = 0
8669/3: fstat(1, 0xFFFFFC7FEECF3BD0) = 0
parent: *p = 42
8669/3: write(1, " p a r e n t : * p =".., 16) = 16
8669/3: lwp_suspend(1) = 0
8669/3: lwp_suspend(2) = 0
8669/3: forkx(0) = 8670
8669/3: lwp_continue(1) = 0
8670: forkx() (returning as child ...) = 8669
8669/3: lwp_continue(2) = 0
8670: getpid() = 8670 [8669]
8669/3: lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000,
0x00000000) = 0xFFBFFEFF [0xFFFFFFFF]
parent: pid = 8670
8669/3: write(1, " p a r e n t : p i d ".., 19) = 19
8670: lwp_self() = 3
8670: munmap(0xFFFFFC7FEE379000, 2088960) = 0
8670: lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000,
0x00000000) = 0xFFBFFEFF [0xFFFFFFFF]
8670: Incurred fault #6, FLTBOUNDS %pc = 0x004014B6
8670: siginfo: SIGSEGV SEGV_MAPERR addr=0xFFFFFC7FEE576FAC
8670: Received signal #11, SIGSEGV [default]
8670: siginfo: SIGSEGV SEGV_MAPERR addr=0xFFFFFC7FEE576FAC
8669/3: Received signal #18, SIGCLD, in waitid() [default]
8669/3: siginfo: SIGCLD CLD_DUMPED pid=8670 status=0x000B
8669/3: waitid(P_PID, 8670, 0xFFFFFC7FEECF4E50, WEXITED|WTRAPPED) = 0
child signal 11
8669/3: write(1, " c h i l d s i g n a l".., 16) = 16
8669/3: lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0xFFFFFFF7, 0x000003FF,
0x00000000) = 0xFFBFFEFF [0xFFFFFFFF]
8669/3: open("/usr/lib/locale/en_US.UTF-8/LC_MESSAGES/SUNW_OST_SGS.mo",
O_RDONLY) Err#2 ENOENT
8669/3: mmap(0x00010000, 65536, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON|MAP_ALIGN, 4294967295, 0) = 0xFFFFFC7FEEFA0000
8669/3: lwp_exit()
8669/1: lwp_wait(3, 0xFFFFFC7FFFDF904C) = 0
8669/1: _exit(0)
: spitfire;
Note that the offending address in the child is 0xfffffc7fee576fac;
that lies within the region that the parent had created via anonymous
memory mapping covering [0xfffffc7fee379000..0xfffffc7fee577000), that
the child explicit `munmap`d when it started running. So it appears
that it _was_ copied into the child, but the child elected to discard
it.
I wonder where that came from. Ah, I see it; there's a function called
in the pthreads library, `postfork1_child`, that runs in the child
after a `fork1`, and clears out LWPs that are no longer runnable. It
contains this comment:
```
/*
* All lwps except ourself are gone. Mark them so.
* First mark all of the lwps that have already been freed.
* Then mark and free all of the active lwps except ourself.
* Since we are single-threaded, no locks are required here.
*/
```
The code stanza following that comment does exactly what it says. The
LWP free routine will free stacks assigned to now-dead LWPs into a
per-process cache, but "trim"s that cache to stay within some bound
(which appears to be tied to the number of active threads) and the
trim function will `munmap` a stack.
So that's almost certainly what's going on: the memory _is_ copied
into the child, but the child's post-fork thread cleanup goo unmaps
it.
- Dan C.