Linux: debugging leftover processes and zombies

829 views
Skip to first unread message

Paweł Hajdan, Jr.

unread,
May 15, 2012, 10:47:18 AM5/15/12
to chromium-dev
It seems there are problems with processes still running after quitting chrome, some of them zombies. See https://bugs.gentoo.org/show_bug.cgi?id=413637 and also Chromium bugs linked from there:


Is there some info needed from the users to fix those issues? Should those bugs be marked as important and get owners?

Paweł Hajdan, Jr.

unread,
May 31, 2012, 12:35:53 PM5/31/12
to chromium-dev
FYI, those problems were caused by tcmalloc. After I disabled it by gyp switch -Dlinux_use_tcmalloc=0, users reported the problems have stopped.

William Chan (陈智昌)

unread,
May 31, 2012, 12:48:38 PM5/31/12
to phajd...@chromium.org, chromium-dev
Not that it's impossible, but I'm kinda skeptical of that claim :) You have any stronger evidence?

--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev

Evan Martin

unread,
May 31, 2012, 1:01:57 PM5/31/12
to will...@chromium.org, phajd...@chromium.org, chromium-dev
From a comment on the bug:
"I have been experiencing the same problem. In was random - sometimes
it shut down nicely, sometimes it left those processes consuming ~1%
CPU. Gdb showed that it is a problem related to Chrome's custom malloc
(tcmalloc) in combination with the Nvidia drivers (in particular, the
driver calling tcmalloc's malloc(), and the malloc waiting for
something that never happened)."

We've had these sorts of bugs before. Doesn't seem so surprising to
me except in how it manifests as a hang instead of a crash.

William Chan (陈智昌)

unread,
May 31, 2012, 1:10:11 PM5/31/12
to Evan Martin, phajd...@chromium.org, chromium-dev
OK, that counts as stronger evidence :) Why aren't we terminating processes with prejudice when the parent (browser) process goes away?

William Chan (陈智昌)

unread,
May 31, 2012, 1:36:43 PM5/31/12
to Evan Martin, phajd...@chromium.org, chromium-dev
Oh, I see in the comments this logic is flawed because the browser process may die anyway. OnChannelError() is supposed to lead to renderer suicide, but I guess if the IPC thread is stuck in malloc() (can i haz stacktrace plz?) then we'll never process OnChannelError() and never complete. Sounds kinda crappy I guess. We could throw a thread at it to detect channel errors, but it might be difficult to fix this case because we have to avoid re-entering malloc() within that thread. Really, the root of the problem is most likely the nvidia driver, although I can't figure it out without at least a stacktrace. I hate nvidia linux drivers.

Sad panda.

Paweł Hajdan, Jr.

unread,
Jun 1, 2012, 10:28:14 AM6/1/12
to William Chan (陈智昌), Evan Martin, chromium-dev
On Thu, May 31, 2012 at 7:36 PM, William Chan (陈智昌) <will...@chromium.org> wrote:
Oh, I see in the comments this logic is flawed because the browser process may die anyway.

I'd expect that to also result in termination of all child processes, because of the PID namespace.
 
OnChannelError() is supposed to lead to renderer suicide, but I guess if the IPC thread is stuck in malloc() (can i haz stacktrace plz?) then we'll never process OnChannelError() and never complete. Sounds kinda crappy I guess.

Yeah, I asked the user for the stacktrace.
 
We could throw a thread at it to detect channel errors, but it might be difficult to fix this case because we have to avoid re-entering malloc() within that thread.

Sounds like a hack.
 
Really, the root of the problem is most likely the nvidia driver, although I can't figure it out without at least a stacktrace. I hate nvidia linux drivers.

I'm not a fan of them either, but Chrome is not without fault. It seems that using glibc's malloc consistently doesn't trigger this issue. I think you might be able to test this on some machine with nvidia drivers... or just take a look whether tcmalloc hooks all available malloc functions in glibc, and that the nvidia driver sees and uses those hooks.

David Klempner

unread,
Aug 23, 2012, 4:11:28 AM8/23/12
to chromi...@chromium.org, Evan Martin, phajd...@chromium.org


On Thursday, May 31, 2012 10:36:43 AM UTC-7, William Chan wrote:
Oh, I see in the comments this logic is flawed because the browser process may die anyway. OnChannelError() is supposed to lead to renderer suicide, but I guess if the IPC thread is stuck in malloc() (can i haz stacktrace plz?) then we'll never process OnChannelError() and never complete. Sounds kinda crappy I guess. We could throw a thread at it to detect channel errors, but it might be difficult to fix this case because we have to avoid re-entering malloc() within that thread. Really, the root of the problem is most likely the nvidia driver, although I can't figure it out without at least a stacktrace.

I can get you a full fledged coredump if you want. I fairly reliably get a few every time I restart Chrome on my home desktop. From 21.0.1180.81:

(gdb) bt
#0  0x00007fd498c2d839 in syscall () from /lib64/libc.so.6
#1  0x00007fd49f3e784e in ?? ()
#2  0x00007fd49f3e76ac in ?? ()
#3  0x00007fd4a1e837dd in calloc ()
#4  0x00007fd496a162c3 in ?? () from /usr/lib64/libGL.so.1
#5  0x00007fd49316555f in ?? () from /usr/lib64/libnvidia-glcore.so.302.17
#6  0x00007fd4969eff2d in ?? () from /usr/lib64/libGL.so.1
#7  0x00007fd4969f6a0f in ?? () from /usr/lib64/libGL.so.1
#8  0x00007fd4969f6b28 in ?? () from /usr/lib64/libGL.so.1
#9  0x00007fd4969f700c in ?? () from /usr/lib64/libGL.so.1
#10 0x00007fd498c00866 in fork () from /lib64/libc.so.6
#11 0x00007fd49fb94fd5 in ?? ()
#12 0x00007fd4a155febf in ?? ()
#13 0x00007fd4a155fbb5 in ?? ()
#14 0x00007fd4a148c7fb in ?? ()
#15 0x00007fd4a148d8fd in ?? ()
#16 0x00007fd49fb84c39 in ?? ()
#17 0x00007fd49fb854eb in ?? ()
#18 0x00007fd49fb85af8 in ?? ()
#19 0x00007fd49fb89099 in ?? ()
#20 0x00007fd49fb81e5c in ?? ()
#21 0x00007fd49fba813d in ?? ()
#22 0x00007fd49fba5562 in ?? ()
#23 0x00007fd49afa0ec6 in start_thread () from /lib64/libpthread.so.0
#24 0x00007fd498c30b8d in clone () from /lib64/libc.so.6

I believe it is stuck in SpinLockDelay.

Paweł Hajdan, Jr.

unread,
Dec 17, 2012, 2:50:17 PM12/17/12
to chromi...@chromium.org, Evan Martin, phajd...@chromium.org
Yup, that seems to be a problem with fork() and nvidia-drivers. William, could you take a look? Please let me know if you need more info, I'm pretty sure I can get it. Another Gentoo developer and chromium package maintainer can reproduce this.

More detailed stack trace from https://413637.bugs.gentoo.org/attachment.cgi?id=332444 (I've stripped out info about locals etc to make it shorter; the full backtrace is at the mentioned url) :

#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39
#1  0x00007f838d23e5cf in base::internal::SpinLockDelay (
    w=0x7f839136df40 <tcmalloc::Static::pageheap_lock_>, value=2, loop=<optimized out>)
    at third_party/tcmalloc/chromium/src/base/spinlock_linux-inl.h:98
#2  0x00007f838d23e4cc in SpinLock::SlowLock (
    this=0x7f839136df40 <tcmalloc::Static::pageheap_lock_>)
    at third_party/tcmalloc/chromium/src/base/spinlock.cc:132
#3  0x00007f838ffbc97e in Lock (this=<optimized out>)
    at ./third_party/tcmalloc/chromium/src/base/spinlock.h:75
#4  SpinLockHolder (l=<optimized out>, this=<synthetic pointer>)
    at ./third_party/tcmalloc/chromium/src/base/spinlock.h:141
#5  do_malloc_pages (size=253952, heap=0x7f83931199c0)
    at ./third_party/tcmalloc/chromium/src/tcmalloc.cc:1062
#6  do_malloc (size=253743) at ./third_party/tcmalloc/chromium/src/tcmalloc.cc:1100
#7  cpp_alloc (size=<optimized out>, nothrow=<optimized out>)
    at ./third_party/tcmalloc/chromium/src/tcmalloc.cc:1400
#8  do_malloc_or_cpp_alloc (size=253743) at ./third_party/tcmalloc/chromium/src/tcmalloc.cc:1023
#9  do_calloc (elem_size=<optimized out>, n=<optimized out>)
    at ./third_party/tcmalloc/chromium/src/tcmalloc.cc:1112
#10 tc_calloc (n=<optimized out>, elem_size=<optimized out>)
    at ./third_party/tcmalloc/chromium/src/tcmalloc.cc:1557
#11 0x00007f838055e3ac in ?? () from /usr/lib64/libGL.so.1
#12 0x00007f837d4e69ef in ?? () from /usr/lib64/libnvidia-glcore.so.310.19
#13 0x00007f83805371fd in ?? () from /usr/lib64/libGL.so.1
#14 0x00007f838053d9ff in ?? () from /usr/lib64/libGL.so.1
#15 0x00007f838053db34 in ?? () from /usr/lib64/libGL.so.1
#16 0x00007f838053e33f in ?? () from /usr/lib64/libGL.so.1
#17 0x00007f83823eb7ae in __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c:189
#18 0x00007f838d8ddd45 in base::LaunchProcess (argv=std::vector of length 4, capacity 4 = {...}, 
    options=..., process_handle=0x7f837746cf7c) at base/process_util_posix.cc:592
#19 0x00007f838f1eedc6 in content::ZygoteHostImpl::AdjustRendererOOMScore (this=<optimized out>, 
    pid=8257, score=300) at content/browser/zygote_host/zygote_host_impl_linux.cc:396
#20 0x00007f838f1efd29 in content::ZygoteHostImpl::ForkRequest (this=0x7f839319a640, argv=..., 
    mapping=..., process_type=...) at content/browser/zygote_host/zygote_host_impl_linux.cc:335
#21 0x00007f838f0e458a in content::ChildProcessLauncher::Context::LaunchInternal (this_object=..., 
    client_thread_id=content::BrowserThread::UI, child_process_id=10, use_zygote=true, 
    env=std::vector of length 0, capacity 0, ipcfd=142, cmd_line=0x7f8396dec870)
    at content/browser/child_process_launcher.cc:206
#22 0x00007f838f0e3fda in Run (a5=..., a1=..., this=<synthetic pointer>, a2=<optimized out>, 
    a3=<optimized out>, a4=<optimized out>, a6=<optimized out>, a7=<optimized out>)
    at ./base/bind_internal.h:584
#23 MakeItSo (a7=@0x7f839392afb0: 0x7f8396dec870, a6=@0x7f839392afa8: 142, 
    a5=std::vector of length 0, capacity 0, a4=@0x7f839392af88: true, a3=@0x7f839392af84: 10, 
    a2=@0x7f839392af80: content::BrowserThread::UI, a1=0x7f8396a64510, runnable=...)
    at ./base/bind_internal.h:1068
#24 base::internal::Invoker<7, base::internal::BindState<base::internal::RunnableAdapter<void (*)(scoped_refptr<content::ChildProcessLauncher::Context>, content::BrowserThread::ID, int, bool, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, int, CommandLine*)>, void (scoped_refptr<content::ChildProcessLauncher::Context>, content::BrowserThread::ID, int, bool, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, int, CommandLine*), void (scoped_refptr<content::ChildProcessLauncher::Context>, content::BrowserThread::ID, int, bool, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > >, int, CommandLine*)>, void (scoped_refptr<content::ChildProcessLauncher::Context>, content::BrowserThread::ID, int, bool, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, int, CommandLine*)>::Run(base::internal::BindStateBase*) (base=0x7f839392af60)
    at ./base/bind_internal.h:2518
#25 0x00007f838d8c4e13 in Run (this=0x7f837746e1e8) at ./base/callback.h:396
#26 MessageLoop::RunTask (this=this@entry=0x7f837746eac0, pending_task=...)
    at base/message_loop.cc:473
#27 0x00007f838d8c5cf8 in MessageLoop::DeferOrRunPendingTask (this=this@entry=0x7f837746eac0, 
    pending_task=...) at base/message_loop.cc:485
#28 0x00007f838d8c78a9 in DoWork (this=<optimized out>) at base/message_loop.cc:668
#29 MessageLoop::DoWork (this=0x7f837746eac0) at base/message_loop.cc:647
#30 0x00007f838d8caba9 in base::MessagePumpDefault::Run (this=0x7f83932678e0, 
    delegate=0x7f837746eac0) at base/message_pump_default.cc:29
#31 0x00007f838d8c7964 in MessageLoop::RunInternal (this=0x7f837746eac0)
    at base/message_loop.cc:430
#32 0x00007f838d8df798 in base::RunLoop::Run (this=0x7f837746e5b0) at base/run_loop.cc:45
#33 0x00007f838d8c4234 in MessageLoop::Run (this=<optimized out>) at base/message_loop.cc:310
#34 0x00007f838f0df4f4 in content::BrowserThreadImpl::ProcessLauncherThreadRun (
    this=this@entry=0x7f8393233910, message_loop=message_loop@entry=0x7f837746eac0)
    at content/browser/browser_thread_impl.cc:137
#35 0x00007f838f0e0143 in content::BrowserThreadImpl::Run (this=0x7f8393233910, 
    message_loop=0x7f837746eac0) at content/browser/browser_thread_impl.cc:173
#36 0x00007f838d8f923f in base::Thread::ThreadMain (this=0x7f8393233910)
    at base/threading/thread.cc:195
#37 0x00007f838d8f4b11 in base::(anonymous namespace)::ThreadFunc (params=<optimized out>)
    at base/threading/platform_thread_posix.cc:65
#38 0x00007f838c7af006 in start_thread (arg=0x7f837746f700) at pthread_create.c:305
#39 0x00007f838241bbad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Evan Martin

unread,
Dec 17, 2012, 3:34:17 PM12/17/12
to Paweł Hajdan, Jr., chromium-dev
From the stack, it's hanging while attempting to spawn a process: the "adjust OOM score" mode of the sandbox binary.  Frame 17 indicates __libc_fork() is somehow calling into nvidia's libGL.

Some searching online with keywords in that area indicate that the nvidia libGL drivers are trying to use pthread_atfork() or some equivalent.  If that is true, then the call stack indicates they're allocating memory in an atfork handler, which is definitely not OK.  Some people suggest using preload-like tricks to disable any atfork-like behavior -- it's not exactly clear to me what the nvidia driver could usefully do when the process image is about to get replaced by an exec() anyway.

(It's not clear to me why atfork exists at all.  One man page suggests it's so multi-threaded libraries can attempt to get out of the way of ignorant single-threaded programs that don't know the trickiness between threads and forking.  Since we are careful about it, maybe disabling nvidia's behavior is safe.)

I hope the above at least gives you enough keywords to write a test patch.

William Chan (陈智昌)

unread,
Dec 17, 2012, 3:54:33 PM12/17/12
to Evan Martin, Paweł Hajdan, Jr., chromium-dev, Markus Gutschke
+markus

Yes, as expected, nvidia's libGL sucks. The question is what to do.
1) Try to do evil, hacky things to intercept the pthread_fork()
handler and disable it.
2) Do something quick and easy (timer-based) in ZygoteHostLinux to
detect hung process creation and terminate with prejudice.
3) Turn off TCMalloc on Linux.
4) Wait for nvidia to fix their crap.

(1) sounds crazy and a lot of work for an edge case. I am not in
favor, but maybe markus@ will claim it's easy and doable and he loves
crazy stuff anyways. (2) is easy and will solve 90% of issues with
this edge case anyway. It's not a problem with the shutdown code per
se, but rather just when you create enough processes, we'll hit this
edge case eventually and have hung child processes that become zombies
when the browser process goes away. We could just add a hacky timer to
reap processes that hang on creation. This papers over nvidia's bug.
(3) This papers over nvidia's bug, but isn't a true fix, since even
the default libc malloc() implementation uses global locks, so it can
still deadlock on fork() if we enter malloc(). This also significantly
hurts Chrome performance, so it's a very sucky solution. (4) I'm not
inclined to spend any energy on fixing this, so I like (4), but
depending on how many Linux users encounter this, it may be worth
someone's time to work around nvidia's issue. Pawel, you know more and
care more here, so you can probably evaluate an appropriate fix.

At this point I'm ready to plug my ears and go lalala whenever I
someone mentions nvidia on Linux.

Paweł Hajdan, Jr.

unread,
Dec 17, 2012, 4:11:31 PM12/17/12
to William Chan (陈智昌), Evan Martin, chromium-dev, Markus Gutschke
On Mon, Dec 17, 2012 at 12:54 PM, William Chan (陈智昌) <will...@chromium.org> wrote:
+markus

Yes, as expected, nvidia's libGL sucks. The question is what to do.
1) Try to do evil, hacky things to intercept the pthread_fork()
handler and disable it.
2) Do something quick and easy (timer-based) in ZygoteHostLinux to
detect hung process creation and terminate with prejudice.
3) Turn off TCMalloc on Linux.
4) Wait for nvidia to fix their crap.

(1) sounds crazy and a lot of work for an edge case. I am not in
favor, but maybe markus@ will claim it's easy and doable and he loves
crazy stuff anyways.

I was thinking about that. Note it affects everything in our address space and not just nvidia (I know you know that, just warning about possible additional weirdness).
 
(2) is easy and will solve 90% of issues with
this edge case anyway. It's not a problem with the shutdown code per
se, but rather just when you create enough processes, we'll hit this
edge case eventually and have hung child processes that become zombies
when the browser process goes away. We could just add a hacky timer to
reap processes that hang on creation. This papers over nvidia's bug.

That seems very hacky to me. :-/
 
(3) This papers over nvidia's bug, but isn't a true fix, since even
the default libc malloc() implementation uses global locks, so it can
still deadlock on fork() if we enter malloc(). This also significantly
hurts Chrome performance, so it's a very sucky solution.

Maybe nvidia does something special to make that work with glibc's malloc - nobody could reproduce the issue after turning off tcmalloc. I think that's what nvidia test with - other apps would likely hit similar problems.
 
(4) I'm not inclined to spend any energy on fixing this, so I like (4), but
depending on how many Linux users encounter this, it may be worth
someone's time to work around nvidia's issue. Pawel, you know more and
care more here, so you can probably evaluate an appropriate fix.

Yeah, I can work on that. I've noticed pthread_atfork calls in other malloc libraries - what do you think about adding similar calls to tcmalloc? Example:

src/third_party/jemalloc/chromium/jemalloc.c

	/* Prevent potential deadlock on malloc locks after fork. */
	pthread_atfork(_malloc_prefork, _malloc_postfork, _malloc_postfork);

Evan Martin

unread,
Dec 17, 2012, 4:21:39 PM12/17/12
to William Chan (陈智昌), Paweł Hajdan, Jr., chromium-dev, Markus Gutschke
I don't think this is necessarily a bug.  It may be that there is some good reason for their at-fork handler.  GL drivers are complex and it seems nvidia has tried pretty hard to write a good one, so I'd be hesitant to accuse them until I had more information.

We already hook malloc etc for tcmalloc, so maybe hooking one more function isn't so hard.
Hooking pthread_atfork with a function that just prints "was called" would at least let you definitively judge whether pthread_atfork is at fault.


However, this bit from their docs also implicates some sort of strange behavior in the vicinity of forking:

===
CONTROLLING FORK(2) HANDLING BEHAVIOR

In order to clean up and reinitialize system resources the NVIDIA OpenGL
implementation needs to be aware of fork(2) system calls. The mechanism used
by the NVIDIA OpenGL implementation to detect fork(2) system calls does not
work well on systems using the LinuxThreads implementation of pthreads. For
thread safety the NVIDIA OpenGL implementation disables its fork(2) detection
on LinuxThreads-based systems. Setting the environment variable
__GL_ALWAYS_HANDLE_FORK to a non-zero value will enable fork(2) detection on
all systems. Setting the environment variable __GL_ALWAYS_HANDLE_FORK will
reduce thread safety on LinuxThreads-based systems. It is strongly recommended
to only set the __GL_ALWAYS_HANDLE_FORK environment variable when running
single threaded applications that are known to use fork.
===

Unfortunately, we want to force this off, not on, so I don't think it helps.

I discovered this via:
$ strings /usr/lib/nvidia-current/libGL.so.1  | grep __GL
which I think reveals all the environment variables available.


Evan Martin

unread,
Dec 17, 2012, 4:23:48 PM12/17/12
to Paweł Hajdan, Jr., William Chan (陈智昌), chromium-dev, Markus Gutschke
On Mon, Dec 17, 2012 at 1:11 PM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:

(3) This papers over nvidia's bug, but isn't a true fix, since even
the default libc malloc() implementation uses global locks, so it can
still deadlock on fork() if we enter malloc(). This also significantly
hurts Chrome performance, so it's a very sucky solution.

Maybe nvidia does something special to make that work with glibc's malloc - nobody could reproduce the issue after turning off tcmalloc. I think that's what nvidia test with - other apps would likely hit similar problems.

I found one discussion of related issues in glibc where someone was suggesting the glibc malloc use recursive mutexes.  I already forget the the thread or its conclusion, but it's plausible to me that glibc does extra work to work around these kinds of problems.

Markus Gutschke

unread,
Dec 17, 2012, 4:37:42 PM12/17/12
to Evan Martin, Paweł Hajdan, Jr., William Chan (陈智昌), chromium-dev
This is all really really badly broken. I was all in a nice Christmas'y mood, and all of that is gone now. Arrgh.

First and foremost, calling fork() from a multi-threaded application just does not work correctly, ever. In Google3 code we have an implementation that works, but it is 1) very specific to Linux, 2) requires assembly code, and 3) allows pretty much nothing else to be done between fork() and exec(). Even calling otherwise-safe code in a different compilation unit will fail!

I can port this code to Chrome, if we think it is worthwhile. But it is a non-trivial amount of work. It took us a couple of years to get this right in Google3, and we are careful to not touch it again, because it is incredibly easy to break it inadvertently. The code is extremely subtle.

A Zygote that launches before the first thread is a much better solution.

pthread_atfork() was a response to this problem and sounds like a good solution. In practice, it usually doesn't work and just fails in different ways. And again, within a pthread_atfork() handler, there is very little that can be called safely. Calling malloc() from pthread_atfork() is definitely a very bad idea.

If we think we want to patch things up, I could imagine that we could stack our own pthread_atfork() handler that swaps out the malloc implementation for something really simplistic: a) cannot modify memory allocations made prior to fork(), b) maybe, cannot even free memory at all, c) uses mmap() for all allocations (i.e. doesn't have any user-mode data structures at all).

I still think, it is mostly just papering over the problem, but it is sufficiently easy to implement that it might be worth a try.


Markus

Roland McGrath

unread,
Dec 17, 2012, 4:42:39 PM12/17/12
to ev...@chromium.org, William Chan (陈智昌), Paweł Hajdan, Jr., chromium-dev, Markus Gutschke
On Mon, Dec 17, 2012 at 1:21 PM, Evan Martin <ev...@chromium.org> wrote:
> However, this bit from their docs also implicates some sort of strange
> behavior in the vicinity of forking:
[...]
> on LinuxThreads-based systems. Setting the environment variable

Nobody has a LinuxThreads-based system. It's ancient and obsolete.
I don't think any of that release note applies today.

Roland McGrath

unread,
Dec 17, 2012, 4:53:25 PM12/17/12
to ev...@chromium.org, Paweł Hajdan, Jr., William Chan (陈智昌), chromium-dev, Markus Gutschke
On Mon, Dec 17, 2012 at 1:23 PM, Evan Martin <ev...@chromium.org> wrote:
> I found one discussion of related issues in glibc where someone was
> suggesting the glibc malloc use recursive mutexes. I already forget the the
> thread or its conclusion, but it's plausible to me that glibc does extra
> work to work around these kinds of problems.

Indeed it does. libc's malloc has an internal atfork handler that takes
all the internal locks on the forking thread so their state is consistent
in the child's copied memory image; then installs a hook so that any
malloc/free calls during other atfork handlers use a special path that
doesn't do normal locking, while any calls on other threads just block;
then the corresponding post-fork handler cleans up and releases the locks.
In short, it makes malloc use in atfork handlers safe.

James Robinson

unread,
Dec 17, 2012, 4:59:21 PM12/17/12
to mcgr...@chromium.org, ev...@chromium.org, Paweł Hajdan, Jr., William Chan (陈智昌), chromium-dev, Markus Gutschke
So should we do the same in tcmalloc, or should we ask nvidia to make their atfork handler not call malloc?

- James

Roland McGrath

unread,
Dec 17, 2012, 5:08:41 PM12/17/12
to James Robinson, ev...@chromium.org, Paweł Hajdan, Jr., William Chan (陈智昌), chromium-dev, Markus Gutschke
On Mon, Dec 17, 2012 at 1:59 PM, James Robinson <jam...@google.com> wrote:
> So should we do the same in tcmalloc, or should we ask nvidia to make their
> atfork handler not call malloc?

Probably both? That is, nvidia is doing something that is probably not
really POSIX-compliant (though I can't immediately tell whether it's
technically permissible or not) and certainly quite fiddly. OTOH, if
tcmalloc intends or purports to be a drop-in replacement for the system
malloc, then it should support all the usage patterns that the system does.

Christopher Cameron

unread,
Dec 17, 2012, 5:22:09 PM12/17/12
to chromi...@chromium.org, James Robinson, ev...@chromium.org, Paweł Hajdan, Jr., William Chan (陈智昌), Markus Gutschke
On Monday, December 17, 2012 2:08:41 PM UTC-8, Roland McGrath wrote:
On Mon, Dec 17, 2012 at 1:59 PM, James Robinson <jam...@google.com> wrote:
> So should we do the same in tcmalloc, or should we ask nvidia to make their
> atfork handler not call malloc?

Probably both?  That is, nvidia is doing something that is probably not
really POSIX-compliant (though I can't immediately tell whether it's
technically permissible or not) and certainly quite fiddly.

I'll follow up with NVIDIA's Linux team on this.  Is there anything that authoritatively says "this isn't supposed to work" which I can cite?

Roland McGrath

unread,
Dec 17, 2012, 5:53:15 PM12/17/12
to ccam...@chromium.org, chromi...@chromium.org, James Robinson, ev...@chromium.org, Paweł Hajdan, Jr., William Chan (陈智昌), Markus Gutschke
On Mon, Dec 17, 2012 at 2:22 PM, Christopher Cameron
<ccam...@chromium.org> wrote:
> I'll follow up with NVIDIA's Linux team on this. Is there anything that
> authoritatively says "this isn't supposed to work" which I can cite?

I don't think so. In fact, I think it very may well be kosher by POSIX.
But the whole subject of fork and pthread_atfork is a complex area of the
specification and one has to read several different parts of the standard
and glean how they fit together. Certainly keeping the actions and
interactions one does in atfork handlers as simple and minimal as possible
is a good rule of thumb.

William Chan (陈智昌)

unread,
Dec 17, 2012, 6:24:08 PM12/17/12
to Roland McGrath, ccam...@chromium.org, chromium-dev, James Robinson, Evan Martin, Paweł Hajdan, Jr., Markus Gutschke
OK wow, I learned a lot. Given that all the other malloc
implementations do implement pthread_atfork() handlers here, I'm
inclined to retract my earlier complaint about nvidia and propose
patching tcmalloc instead. That said, I'm still down on the wisdom of
using a pthread_atfork() handler in general for all the reasons
already stated.

Jeffrey Yasskin

unread,
Dec 17, 2012, 6:24:15 PM12/17/12
to ccam...@chromium.org, chromium-dev, James Robinson, Evan Martin, Paweł Hajdan, Jr., William Chan (陈智昌), Markus Gutschke
On Mon, Dec 17, 2012 at 2:22 PM, Christopher Cameron
<ccam...@chromium.org> wrote:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_atfork.html
is the current standard for pthread_atfork(). It and
http://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html
say that in the child process after fork(), you can only call
async-signal-safe functions (this would cover the pthread_atfork
'child' handler), but it doesn't say that for the pthread_atfork
'prepare' or 'parent' handlers.

http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html
lists the async-signal-safe functions, and of course that list doesn't
include malloc().

HTH. If Roland or Markus disagree with me, believe them.
Jeffrey

Markus Gutschke

unread,
Dec 17, 2012, 7:16:16 PM12/17/12
to Jeffrey Yasskin, ccam...@chromium.org, chromium-dev, James Robinson, Evan Martin, Paweł Hajdan, Jr., William Chan (陈智昌)
My understanding is that a) this is very poorly spec'd, and b) even harder to get right in practice. In general I would agree with Jeffrey's statements though. It's best to be conservative in interpreting the standards. This means, as the pthread_atfork() handlers run after calling fork(), it would be best to avoid anything that isn't async signal safe.

What glibc does with it's locks seems quite reasonable though. It will make a lot of pthread_atfork() handlers happy, even if they do things that they really shouldn't be doing. I am still not convinced it fixes all the problems with pthread_atfork(), though. It'll probably just reduce the number of bug reports to a smaller number without eliminating them entirely. Maybe, that's good enough.


Markus

Christopher Cameron

unread,
Dec 17, 2012, 7:49:30 PM12/17/12
to chromi...@chromium.org, Jeffrey Yasskin, ccam...@chromium.org, James Robinson, Evan Martin, Paweł Hajdan, Jr., William Chan (陈智昌)
I talked with some folks at NVIDIA and they see the issue and will investigate it.  I filed a bug report to track it (incident #121217-000242).

C

Evan Martin

unread,
Dec 17, 2012, 7:56:16 PM12/17/12
to Christopher Cameron, chromium-dev, Jeffrey Yasskin, James Robinson, Paweł Hajdan, Jr., William Chan (陈智昌)
Reply all
Reply to author
Forward
0 new messages