Another futures-related bug hunt

44 views
Skip to first unread message

Dominik Pantůček

unread,
May 8, 2020, 3:34:38 AM5/8/20
to Racket Users
Hello fellow Racketeers,

my spare-time out-of-curiosity venture into using HPR (High-Performance
Racket) for creating a software 3D rendering pipeline seems to be
pushing the futures into rough edges.

The scenario is sort of "usual":

* 7 futures + 1 in RTT that form a binary tree
* GUI thread running

But this time, the futures perform not only data-heavy fixnums
operations, but flonums operations as well.

Something along the lines of 2560x1440 fixnums and the same number of
flonums is being handled in 8 threads effectively (give or take some
optimizations that slightly lower the 1440 height usually).

The code in question is relatively short - say 60 lines of code -
however it does not make much sense without the remaining 2k lines :)

If the operation runs without futures in RTT, nothing happens. But under
a heavy load and VERY varying amount of time (seconds to hours), it
completely freezes with:

* 1 CPU being used at 100% (top/htop shows)
* Does not handle socket operations (X11 WM message for closing the window)
* Does not respond to keyboard (or via kill) SIGINT
* Can only be forcibly stopped by SIGKILL (or similar) or forcefully
closing the window from WM which sort of gets handled probably in the
lower-level parts of GDK completely without Racket runtime intervention
(just prints Killed and the exit code is 137)

Based on these observations I can only conclude that it is the RTT that
gets stuck - but that is only the native thread perspective. From Racket
thread perspective, it can be either the "main" application thread that
is in (thread-wait) for the thread that performs the futures stuff and
it can also be the GUI thread which is created with parameterizing the
eventspace (that is just some trickery to allow me to send breaks when I
receive window close event).

Apart from obvious strace (after freeze) and gdb (before/after freeze)
debugging to find possible sources of this bug, is there even a remote
possibility of getting any clue how can this happen based on the
information gathered so far? My thought go along the lines:

* flonums are boxed - but for some operations they may be immediate
* apparently it is a busy-wait loop in RTT, otherwise 100% CPU usage is
impossible with this workload
* unsafe ops are always suspicious, but again, the problem shows up even
when I switch to the safe versions - it just takes longer time
* which means, the most probable cause is a race condition

And that is basically all I can tell right now.

Of course, any suggestions would be really welcome.

Cheers,
Dominik

P.S.: I am really curious, what will I find when I finally put
fsemaphores into the mix...




Matthew Flatt

unread,
May 8, 2020, 8:27:44 AM5/8/20
to Dominik Pantůček, Racket Users
At Fri, 8 May 2020 09:34:32 +0200, Dominik Pantůček wrote:
> Apart from obvious strace (after freeze) and gdb (before/after freeze)
> debugging to find possible sources of this bug, is there even a remote
> possibility of getting any clue how can this happen based on the
> information gathered so far? My thought go along the lines:
>
> * flonums are boxed - but for some operations they may be immediate
> * apparently it is a busy-wait loop in RTT, otherwise 100% CPU usage is
> impossible with this workload
> * unsafe ops are always suspicious, but again, the problem shows up even
> when I switch to the safe versions - it just takes longer time
> * which means, the most probable cause is a race condition

The most useful information here is likely to be a stack trace from
each OS-level thread at the point where the application is stuck.

That could potentially tell us, for example, that it's a problem with
synchronization for a GC (where one of the OS threads that run futures
doesn't cooperate for some reason) or a problem with a the main thread
performing some specific work on a future thread's behalf.

Dominik Pantůček

unread,
May 8, 2020, 9:36:06 AM5/8/20
to racket...@googlegroups.com
Hello,
I am using the build from master branch with the patch for #3145 and
cannot make it run under gdb:

$ gdb ../racket-lang/racket/racket/bin/racket
GNU gdb (Ubuntu 8.3-0ubuntu1) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ../racket-lang/racket/racket/bin/racket...
(No debugging symbols found in ../racket-lang/racket/racket/bin/racket)
(gdb) run
Starting program:
/home/joe/Projects/Programming/racket-lang/racket/racket/bin/racket
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Welcome to Racket v7.7.0.4.
[New Thread 0x7ffff7fcb700 (LWP 6410)]

Thread 1 "racket" received signal SIGSEGV, Segmentation fault.
0x00005555555e14fe in scheme_gmp_tls_unload ()
(gdb)

The same happens for the binary with debug symbols:

gdb ../racket-lang/racket/racket/src/build/racket/racket3m
GNU gdb (Ubuntu 8.3-0ubuntu1) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from
../racket-lang/racket/racket/src/build/racket/racket3m...
(gdb) run
Starting program:
/home/joe/Projects/Programming/racket-lang/racket/racket/src/build/racket/racket3m

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Welcome to Racket v7.7.0.4.
[New Thread 0x7ffff7fcb700 (LWP 6422)]

Thread 1 "racket3m" received signal SIGSEGV, Segmentation fault.
scheme_gmp_tls_unload (s=0x7ffff6114480, data=0x0) at
../../../racket/gc2/../src/gmp/gmp.c:5822
5822 s[0] = 0;
(gdb)

I am running Ubuntu 19.10's default gdb:

$ gdb --version
GNU gdb (Ubuntu 8.3-0ubuntu1) 8.3

I assume gmp is used for bignum implementation (didn't check yet), so it
might be relevant as well:

ii libgmp-dev:amd64 2:6.1.2+dfsg-4
amd64 Multiprecision arithmetic library
developers tools
ii libgmp10:amd64 2:6.1.2+dfsg-4
amd64 Multiprecision arithmetic library
ii libgmp10:i386 2:6.1.2+dfsg-4
i386 Multiprecision arithmetic library
ii libgmpxx4ldbl:amd64 2:6.1.2+dfsg-4
amd64 Multiprecision arithmetic library (C++
bindings)
ii python-gmpy:amd64 1.17-4
amd64 interfaces GMP to Python for fast,
unbound-precision computations


I will pull latest master and re-try, but that is really just a blind guess.


Cheers,
Dominik

Sam Tobin-Hochstadt

unread,
May 8, 2020, 9:45:48 AM5/8/20
to Dominik Pantůček, Racket Users
You will want to do `handle SIGSEGV nostop noprint` when you start
gdb. Racket BC uses the SEGV handler to implement the GC write
barrier, so you'll want to skip those.

Sam
> --
> You received this message because you are subscribed to the Google Groups "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/racket-users/01505bda-23c8-c7b0-398a-ad55fe9d4c6c%40trustica.cz.

Dominik Pantůček

unread,
May 9, 2020, 1:18:21 AM5/9/20
to racket...@googlegroups.com
Hello,


> The most useful information here is likely to be a stack trace from
> each OS-level thread at the point where the application is stuck.
>

would this be enough to open an issue for that?

(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7ffff7c1b300 (LWP 19075) "tut22.rkt"
mark_backpointers (gc=gc@entry=0x5555559d10c0) at
../../../racket/gc2/newgc.c:4078
2 Thread 0x7ffff7fcb700 (LWP 19076) "tut22.rkt"
futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559d7d78)
at ../sysdeps/unix/sysv/linux/futex-internal.h:80
3 Thread 0x7fffe65a6700 (LWP 19077) "gmain"
0x00007ffff7d34c2f in __GI___poll (fds=0x555555b82520, nfds=2,
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
4 Thread 0x7fffe5da5700 (LWP 19078) "gdbus"
0x00007ffff7d34c2f in __GI___poll (fds=0x555555b94ce0, nfds=3,
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
7 Thread 0x7fffd77fe700 (LWP 19082) "dconf worker"
0x00007ffff7d34c2f in __GI___poll (fds=0x555555e9e5e0, nfds=1,
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
8 Thread 0x7fffe40d4800 (LWP 19083) "tut22.rkt"
futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c)
at ../sysdeps/unix/sysv/linux/futex-internal.h:80
9 Thread 0x7fffd4602800 (LWP 19084) "tut22.rkt"
futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c)
at ../sysdeps/unix/sysv/linux/futex-internal.h:80
10 Thread 0x7fffd4586800 (LWP 19085) "tut22.rkt"
futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c)
at ../sysdeps/unix/sysv/linux/futex-internal.h:80
11 Thread 0x7fffd450a800 (LWP 19086) "tut22.rkt"
futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c)
at ../sysdeps/unix/sysv/linux/futex-internal.h:80
12 Thread 0x7fffd448e800 (LWP 19087) "tut22.rkt"
futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c)
at ../sysdeps/unix/sysv/linux/futex-internal.h:80
13 Thread 0x7fffd4412800 (LWP 19088) "tut22.rkt"
futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c)
at ../sysdeps/unix/sysv/linux/futex-internal.h:80
14 Thread 0x7fffd4396800 (LWP 19089) "tut22.rkt"
futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c)
at ../sysdeps/unix/sysv/linux/futex-internal.h:80
15 Thread 0x7fffd431a800 (LWP 19090) "tut22.rkt"
futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c)
at ../sysdeps/unix/sysv/linux/futex-internal.h:80
16 Thread 0x7ffff65b2800 (LWP 21691) "tut22.rkt"
futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c4998)
at ../sysdeps/unix/sysv/linux/futex-internal.h:80
(gdb) bt
#0 0x00005555557f5064 in mark_backpointers (gc=gc@entry=0x5555559d10c0)
at ../../../racket/gc2/newgc.c:4078
#1 0x00005555557edb2b in garbage_collect (gc=gc@entry=0x5555559d10c0,
force_full=force_full@entry=0, no_full=no_full@entry=0,
switching_master=switching_master@entry=0, lmi=lmi@entry=0x0)
at ../../../racket/gc2/newgc.c:5646
#2 0x00005555557f0ff2 in collect_now (nomajor=<optimized out>,
major=<optimized out>, gc=<optimized out>) at
../../../racket/gc2/newgc.c:875
#3 0x00005555557f0ff2 in collect_now (gc=0x5555559d10c0, major=0,
nomajor=0) at ../../../racket/gc2/newgc.c:855
#4 0x00005555557f9124 in allocate_slowpath (newptr=<optimized out>,
allocate_size=<optimized out>, gc=<optimized out>) at
../../../racket/gc2/newgc.c:1607
#5 0x00005555557f9124 in allocate (type=1, request_size=<optimized
out>) at ../../../racket/gc2/newgc.c:1671
#6 0x00005555557f9124 in allocate (type=<optimized out>,
request_size=<optimized out>) at ../../../racket/gc2/newgc.c:1636
#7 0x00005555557f9124 in GC_malloc_atomic (s=<optimized out>) at
../../../racket/gc2/newgc.c:1792
#8 0x00005555557f9124 in GC_malloc_atomic (s=<optimized out>) at
../../../racket/gc2/newgc.c:1792
#9 0x0000555555605406 in prepare_retry_alloc (p2=<optimized out>,
p=<optimized out>) at ../../../racket/gc2/../src/jitalloc.c:47
#10 0x0000555555605406 in ts_prepare_retry_alloc (p=<optimized out>,
p2=<optimized out>) at ../../../racket/gc2/../src/jitalloc.c:73
#11 0x00007ffff7fbe62b in ()
#12 0x0000000000000000 in ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x7ffff7fcb700 (LWP 19076))]
#0 futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559d7d78) at ../sysdeps/unix/sysv/linux/futex-internal.h:80
80 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0 0x00007ffff7e202c6 in futex_wait_cancelable (private=<optimized
out>, expected=0, futex_word=0x5555559d7d78) at
../sysdeps/unix/sysv/linux/futex-internal.h:80
#1 0x00007ffff7e202c6 in __pthread_cond_wait_common (abstime=0x0,
clockid=0, mutex=0x5555559d7d28, cond=0x5555559d7d50) at
pthread_cond_wait.c:508
#2 0x00007ffff7e202c6 in __pthread_cond_wait
(cond=cond@entry=0x5555559d7d50, mutex=mutex@entry=0x5555559d7d28) at
pthread_cond_wait.c:638
#3 0x00005555557209fe in green_thread_timer
(data=data@entry=0x5555559d7d10) at ../../../racket/gc2/../src/port.c:6659
#4 0x00005555556bb8be in mzrt_thread_stub (data=0x5555559d7dc0) at
../../../racket/gc2/../src/mzrt.c:91
#5 0x00007ffff7e19669 in start_thread (arg=<optimized out>) at
pthread_create.c:479
#6 0x00007ffff7d41323 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 3
[Switching to thread 3 (Thread 0x7fffe65a6700 (LWP 19077))]
#0 0x00007ffff7d34c2f in __GI___poll (fds=0x555555b82520, nfds=2,
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0 0x00007ffff7d34c2f in __GI___poll (fds=0x555555b82520, nfds=2,
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007fffe8e4daae in () at /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#2 0x00007fffe8e4dbe3 in g_main_context_iteration () at
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#3 0x00007fffe8e4dc31 in () at /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#4 0x00007fffe8e77181 in () at /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#5 0x00007ffff7e19669 in start_thread (arg=<optimized out>) at
pthread_create.c:479
#6 0x00007ffff7d41323 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 4
[Switching to thread 4 (Thread 0x7fffe5da5700 (LWP 19078))]
#0 0x00007ffff7d34c2f in __GI___poll (fds=0x555555b94ce0, nfds=3,
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29 in ../sysdeps/unix/sysv/linux/poll.c
(gdb) bt
#0 0x00007ffff7d34c2f in __GI___poll (fds=0x555555b94ce0, nfds=3,
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007fffe8e4daae in () at /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#2 0x00007fffe8e4de33 in g_main_loop_run () at
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#3 0x00007fffe79f08da in () at /usr/lib/x86_64-linux-gnu/libgio-2.0.so.0
#4 0x00007fffe8e77181 in () at /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#5 0x00007ffff7e19669 in start_thread (arg=<optimized out>) at
pthread_create.c:479
#6 0x00007ffff7d41323 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 7
[Switching to thread 7 (Thread 0x7fffd77fe700 (LWP 19082))]
#0 0x00007ffff7d34c2f in __GI___poll (fds=0x555555e9e5e0, nfds=1,
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29 in ../sysdeps/unix/sysv/linux/poll.c
(gdb) bt
#0 0x00007ffff7d34c2f in __GI___poll (fds=0x555555e9e5e0, nfds=1,
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007fffe8e4daae in () at /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#2 0x00007fffe8e4dbe3 in g_main_context_iteration () at
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#3 0x00007fffe439c81d in () at
/usr/lib/x86_64-linux-gnu/gio/modules/libdconfsettings.so
#4 0x00007fffe8e77181 in () at /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#5 0x00007ffff7e19669 in start_thread (arg=<optimized out>) at
pthread_create.c:479
#6 0x00007ffff7d41323 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 8
[Switching to thread 8 (Thread 0x7fffe40d4800 (LWP 19083))]
#0 futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c) at ../sysdeps/unix/sysv/linux/futex-internal.h:80
80 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0 0x00007ffff7e202c6 in futex_wait_cancelable (private=<optimized
out>, expected=0, futex_word=0x5555559c499c) at
../sysdeps/unix/sysv/linux/futex-internal.h:80
#1 0x00007ffff7e202c6 in __pthread_cond_wait_common (abstime=0x0,
clockid=0, mutex=0x5555559c4948, cond=0x5555559c4970) at
pthread_cond_wait.c:508
#2 0x00007ffff7e202c6 in __pthread_cond_wait
(cond=cond@entry=0x5555559c4970, mutex=mutex@entry=0x5555559c4948) at
pthread_cond_wait.c:638
#3 0x00005555556bbea3 in mzrt_sema_wait (s=0x5555559c4940) at
../../../racket/gc2/../src/mzrt.c:514
#4 0x00005555555d2c56 in worker_thread_future_loop
(arg=arg@entry=0x7fffffffae80) at ../../../racket/gc2/../src/future.c:2246
#5 0x00005555556bb8be in mzrt_thread_stub (data=0x555555edcc70) at
../../../racket/gc2/../src/mzrt.c:91
#6 0x00007ffff7e19669 in start_thread (arg=<optimized out>) at
pthread_create.c:479
#7 0x00007ffff7d41323 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 9
[Switching to thread 9 (Thread 0x7fffd4602800 (LWP 19084))]
#0 futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c) at ../sysdeps/unix/sysv/linux/futex-internal.h:80
80 in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0 0x00007ffff7e202c6 in futex_wait_cancelable (private=<optimized
out>, expected=0, futex_word=0x5555559c499c) at
../sysdeps/unix/sysv/linux/futex-internal.h:80
#1 0x00007ffff7e202c6 in __pthread_cond_wait_common (abstime=0x0,
clockid=0, mutex=0x5555559c4948, cond=0x5555559c4970) at
pthread_cond_wait.c:508
#2 0x00007ffff7e202c6 in __pthread_cond_wait
(cond=cond@entry=0x5555559c4970, mutex=mutex@entry=0x5555559c4948) at
pthread_cond_wait.c:638
#3 0x00005555556bbea3 in mzrt_sema_wait (s=0x5555559c4940) at
../../../racket/gc2/../src/mzrt.c:514
#4 0x00005555555d2c56 in worker_thread_future_loop
(arg=arg@entry=0x7fffffffae20) at ../../../racket/gc2/../src/future.c:2246
#5 0x00005555556bb8be in mzrt_thread_stub (data=0x555555a5a9a0) at
../../../racket/gc2/../src/mzrt.c:91
#6 0x00007ffff7e19669 in start_thread (arg=<optimized out>) at
pthread_create.c:479
#7 0x00007ffff7d41323 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 10
[Switching to thread 10 (Thread 0x7fffd4586800 (LWP 19085))]
#0 futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c) at ../sysdeps/unix/sysv/linux/futex-internal.h:80
80 in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0 0x00007ffff7e202c6 in futex_wait_cancelable (private=<optimized
out>, expected=0, futex_word=0x5555559c499c) at
../sysdeps/unix/sysv/linux/futex-internal.h:80
#1 0x00007ffff7e202c6 in __pthread_cond_wait_common (abstime=0x0,
clockid=0, mutex=0x5555559c4948, cond=0x5555559c4970) at
pthread_cond_wait.c:508
#2 0x00007ffff7e202c6 in __pthread_cond_wait
(cond=cond@entry=0x5555559c4970, mutex=mutex@entry=0x5555559c4948) at
pthread_cond_wait.c:638
#3 0x00005555556bbea3 in mzrt_sema_wait (s=0x5555559c4940) at
../../../racket/gc2/../src/mzrt.c:514
#4 0x00005555555d2c56 in worker_thread_future_loop
(arg=arg@entry=0x7fffffffadc0) at ../../../racket/gc2/../src/future.c:2246
#5 0x00005555556bb8be in mzrt_thread_stub (data=0x555555a8ce90) at
../../../racket/gc2/../src/mzrt.c:91
#6 0x00007ffff7e19669 in start_thread (arg=<optimized out>) at
pthread_create.c:479
#7 0x00007ffff7d41323 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 11
[Switching to thread 11 (Thread 0x7fffd450a800 (LWP 19086))]
#0 futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c) at ../sysdeps/unix/sysv/linux/futex-internal.h:80
80 in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0 0x00007ffff7e202c6 in futex_wait_cancelable (private=<optimized
out>, expected=0, futex_word=0x5555559c499c) at
../sysdeps/unix/sysv/linux/futex-internal.h:80
#1 0x00007ffff7e202c6 in __pthread_cond_wait_common (abstime=0x0,
clockid=0, mutex=0x5555559c4948, cond=0x5555559c4970) at
pthread_cond_wait.c:508
#2 0x00007ffff7e202c6 in __pthread_cond_wait
(cond=cond@entry=0x5555559c4970, mutex=mutex@entry=0x5555559c4948) at
pthread_cond_wait.c:638
#3 0x00005555556bbea3 in mzrt_sema_wait (s=0x5555559c4940) at
../../../racket/gc2/../src/mzrt.c:514
#4 0x00005555555d2c56 in worker_thread_future_loop
(arg=arg@entry=0x7fffffff8fd0) at ../../../racket/gc2/../src/future.c:2246
#5 0x00005555556bb8be in mzrt_thread_stub (data=0x555555ecbb90) at
../../../racket/gc2/../src/mzrt.c:91
#6 0x00007ffff7e19669 in start_thread (arg=<optimized out>) at
pthread_create.c:479
#7 0x00007ffff7d41323 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 12
[Switching to thread 12 (Thread 0x7fffd448e800 (LWP 19087))]
#0 futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c) at ../sysdeps/unix/sysv/linux/futex-internal.h:80
80 in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0 0x00007ffff7e202c6 in futex_wait_cancelable (private=<optimized
out>, expected=0, futex_word=0x5555559c499c) at
../sysdeps/unix/sysv/linux/futex-internal.h:80
#1 0x00007ffff7e202c6 in __pthread_cond_wait_common (abstime=0x0,
clockid=0, mutex=0x5555559c4948, cond=0x5555559c4970) at
pthread_cond_wait.c:508
#2 0x00007ffff7e202c6 in __pthread_cond_wait
(cond=cond@entry=0x5555559c4970, mutex=mutex@entry=0x5555559c4948) at
pthread_cond_wait.c:638
#3 0x00005555556bbea3 in mzrt_sema_wait (s=0x5555559c4940) at
../../../racket/gc2/../src/mzrt.c:514
#4 0x00005555555d2c56 in worker_thread_future_loop
(arg=arg@entry=0x7fffffffa6e0) at ../../../racket/gc2/../src/future.c:2246
#5 0x00005555556bb8be in mzrt_thread_stub (data=0x555555a11160) at
../../../racket/gc2/../src/mzrt.c:91
#6 0x00007ffff7e19669 in start_thread (arg=<optimized out>) at
pthread_create.c:479
#7 0x00007ffff7d41323 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 13
[Switching to thread 13 (Thread 0x7fffd4412800 (LWP 19088))]
#0 futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c) at ../sysdeps/unix/sysv/linux/futex-internal.h:80
80 in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0 0x00007ffff7e202c6 in futex_wait_cancelable (private=<optimized
out>, expected=0, futex_word=0x5555559c499c) at
../sysdeps/unix/sysv/linux/futex-internal.h:80
#1 0x00007ffff7e202c6 in __pthread_cond_wait_common (abstime=0x0,
clockid=0, mutex=0x5555559c4948, cond=0x5555559c4970) at
pthread_cond_wait.c:508
#2 0x00007ffff7e202c6 in __pthread_cond_wait
(cond=cond@entry=0x5555559c4970, mutex=mutex@entry=0x5555559c4948) at
pthread_cond_wait.c:638
#3 0x00005555556bbea3 in mzrt_sema_wait (s=0x5555559c4940) at
../../../racket/gc2/../src/mzrt.c:514
#4 0x00005555555d2c56 in worker_thread_future_loop
(arg=arg@entry=0x7fffffffaaf0) at ../../../racket/gc2/../src/future.c:2246
#5 0x00005555556bb8be in mzrt_thread_stub (data=0x555555a120b0) at
../../../racket/gc2/../src/mzrt.c:91
#6 0x00007ffff7e19669 in start_thread (arg=<optimized out>) at
pthread_create.c:479
#7 0x00007ffff7d41323 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 14
[Switching to thread 14 (Thread 0x7fffd4396800 (LWP 19089))]
#0 futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c) at ../sysdeps/unix/sysv/linux/futex-internal.h:80
80 in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0 0x00007ffff7e202c6 in futex_wait_cancelable (private=<optimized
out>, expected=0, futex_word=0x5555559c499c) at
../sysdeps/unix/sysv/linux/futex-internal.h:80
#1 0x00007ffff7e202c6 in __pthread_cond_wait_common (abstime=0x0,
clockid=0, mutex=0x5555559c4948, cond=0x5555559c4970) at
pthread_cond_wait.c:508
#2 0x00007ffff7e202c6 in __pthread_cond_wait
(cond=cond@entry=0x5555559c4970, mutex=mutex@entry=0x5555559c4948) at
pthread_cond_wait.c:638
#3 0x00005555556bbea3 in mzrt_sema_wait (s=0x5555559c4940) at
../../../racket/gc2/../src/mzrt.c:514
#4 0x00005555555d2c56 in worker_thread_future_loop
(arg=arg@entry=0x7fffffffab50) at ../../../racket/gc2/../src/future.c:2246
#5 0x00005555556bb8be in mzrt_thread_stub (data=0x5555559df7d0) at
../../../racket/gc2/../src/mzrt.c:91
#6 0x00007ffff7e19669 in start_thread (arg=<optimized out>) at
pthread_create.c:479
#7 0x00007ffff7d41323 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 15
[Switching to thread 15 (Thread 0x7fffd431a800 (LWP 19090))]
#0 futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c499c) at ../sysdeps/unix/sysv/linux/futex-internal.h:80
80 in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0 0x00007ffff7e202c6 in futex_wait_cancelable (private=<optimized
out>, expected=0, futex_word=0x5555559c499c) at
../sysdeps/unix/sysv/linux/futex-internal.h:80
#1 0x00007ffff7e202c6 in __pthread_cond_wait_common (abstime=0x0,
clockid=0, mutex=0x5555559c4948, cond=0x5555559c4970) at
pthread_cond_wait.c:508
#2 0x00007ffff7e202c6 in __pthread_cond_wait
(cond=cond@entry=0x5555559c4970, mutex=mutex@entry=0x5555559c4948) at
pthread_cond_wait.c:638
#3 0x00005555556bbea3 in mzrt_sema_wait (s=0x5555559c4940) at
../../../racket/gc2/../src/mzrt.c:514
#4 0x00005555555d2c56 in worker_thread_future_loop
(arg=arg@entry=0x7fffffffaaf0) at ../../../racket/gc2/../src/future.c:2246
#5 0x00005555556bb8be in mzrt_thread_stub (data=0x555555a5b7f0) at
../../../racket/gc2/../src/mzrt.c:91
#6 0x00007ffff7e19669 in start_thread (arg=<optimized out>) at
pthread_create.c:479
#7 0x00007ffff7d41323 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 16
[Switching to thread 16 (Thread 0x7ffff65b2800 (LWP 21691))]
#0 futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559c4998) at ../sysdeps/unix/sysv/linux/futex-internal.h:80
80 in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0 0x00007ffff7e202c6 in futex_wait_cancelable (private=<optimized
out>, expected=0, futex_word=0x5555559c4998) at
../sysdeps/unix/sysv/linux/futex-internal.h:80
#1 0x00007ffff7e202c6 in __pthread_cond_wait_common (abstime=0x0,
clockid=0, mutex=0x5555559c4948, cond=0x5555559c4970) at
pthread_cond_wait.c:508
#2 0x00007ffff7e202c6 in __pthread_cond_wait
(cond=cond@entry=0x5555559c4970, mutex=mutex@entry=0x5555559c4948) at
pthread_cond_wait.c:638
#3 0x00005555556bbea3 in mzrt_sema_wait (s=0x5555559c4940) at
../../../racket/gc2/../src/mzrt.c:514
#4 0x00005555555d2c56 in worker_thread_future_loop
(arg=arg@entry=0x7fffffffa830) at ../../../racket/gc2/../src/future.c:2246
#5 0x00005555556bb8be in mzrt_thread_stub (data=0x5555560771e0) at
../../../racket/gc2/../src/mzrt.c:91
#6 0x00007ffff7e19669 in start_thread (arg=<optimized out>) at
pthread_create.c:479
#7 0x00007ffff7d41323 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Good thing is that gdb installs its own signal handlers, so it is
possible to break the infinite loop (or whatever that is). Bad thing is,
I just left it running overnight and I have no idea how long did it take
(1 to 8 hours).

The Racket binary was built on x86_64 from commit
6a1328cd775f1e44180ce5c84244483b5df9ca91

Looks like it's going through the list of memory pages (probably), but
nothing immediately suspicious about the pointers:

(gdb) print gc
$1 = (NewGC *) 0x5555559d10c0
(gdb) print gc->modified_next
$2 = (struct mpage *) 0x555555c41610
(gdb) print work
$3 = (mpage *) 0x555555a18ac0
(gdb) print work->modified_next
$4 = (struct mpage *) 0x555555a18ac0

Just in case, I saved a core dump:

(gdb) gcore
warning: target file /proc/19075/cmdline contained unexpected null
characters
warning: Memory read failed for corefile section, 4096 bytes at
0xffffffffff600000.
Saved corefile core.19075

It is however quite large:

-rw-r--r-- 1 joe joe 1.5G May 9 07:14 core.19075


Cheers,
Dominik

Matthew Flatt

unread,
May 9, 2020, 9:15:25 AM5/9/20
to Dominik Pantůček, racket...@googlegroups.com
At Sat, 9 May 2020 07:18:01 +0200, Dominik Pantůček wrote:
> would this be enough to open an issue for that?
>
> (gdb) info threads
> Id Target Id Frame
> * 1 Thread 0x7ffff7c1b300 (LWP 19075) "tut22.rkt"
> mark_backpointers (gc=gc@entry=0x5555559d10c0) at
> ../../../racket/gc2/newgc.c:4078

Yes, this might identify the problem. Being stuck in a linked-list
iteration often means that there was a race updating the list.

The GC's write barrier is implemented by write-protecting pages and
handling SIGSEGV to record the modification (and remove write
protection until the next GC). If that handler is called in different
future threads, though, then there's currently a race on the list of
modified pages.

This race doesn't happen with places, because different places have
different GC instances. And it won't happen on Mac OS, because the
fault is handled at the Mach layer and routes exceptions for all
threads to a single handler thread.

I'll add a lock at lines 1092-1096 of "newgc.c", and we'll see if that
helps.

Thanks very much for your help!
Matthew

Dominik Pantůček

unread,
May 9, 2020, 9:32:31 AM5/9/20
to Matthew Flatt, racket...@googlegroups.com
Hello,

>
> I'll add a lock at lines 1092-1096 of "newgc.c", and we'll see if that
> helps.

should I open the issue or will you do it? (Speaking of race conditions...).

I'll re-run the tests with the lock once it is in the repo - sometimes
it takes hours for this bug to exhibit and with 8 HTs the process in
question consumes slightly more than 500% of CPU time - which means the
computer sounds it's going to take off and fly. I'll keep it up and
running overnight again.


And thank you for the explanation, digging in Racket internals has a
very varying degree of difficulty :)


Cheers,
Dominik
Reply all
Reply to author
Forward
0 new messages