Occasional hard lockup w/ 100% CPU usage in go applications

699 views
Skip to first unread message

Steven Sokol

unread,
Oct 25, 2022, 6:05:25 PM10/25/22
to golang-nuts
Weird one. I have a go application that, on occasion, will suddenly jump from some very low CPU usage (1% - 10%) to 400% (4 core system). I run this app at a heightened priority and it can nearly lock the machine (Pi CM4) up.

I can kill it with SIGKILL and I'm able to get a stack trace by sending SIGQUIT. (Here's a gist of the stack trace.)

The trace would seem to indicate that everything is idle - virtually all the goroutines are in runtime.gopark. None of them ever get out of it. I've tried adding a watchdog timer and it locks up along with all the other goroutines, so it never fires once the runaway event starts.

The app itself isn't anything special. It listens for data on a set of about  16 Redis pubsub channels and forwards any data it receives to a very limited number of clients over UDP. Throughput ranges from 34 - 38 KB/sec, and the message counts are in the hundreds. Most of the time it eats about 5% of one core on the CM4.

I guess this could be an error on my part, but there's nothing in my code that seems likely to cause this - no obvious place for a multi-threaded tight loop. I thought for a while that it might be some sort of an issue in Redigo (golang redis client) but from what that stack trace shows, it looks like all the redis listeners are happy and healthy and stuck waiting in gopark along with all the other goroutines.

Has anyone seen this kind of weird before?

Thanks,

-S

robert engels

unread,
Oct 25, 2022, 7:42:14 PM10/25/22
to Steven Sokol, golang-nuts
Are you certain you haven’t capped the memory and all of the GC threads are spinning trying to claim/allocate memory?

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/5465198d-9a48-4db1-98d5-498922b1ac39n%40googlegroups.com.

Steven Sokol

unread,
Oct 25, 2022, 8:28:30 PM10/25/22
to golang-nuts
I don't think so. I don't have it capped in the systemd configuration and it's not using very much memory even in the runaway state. Here's a screenshot of htop - check out the fvUnisocket process at the top of the list:

Eric Hubbard

unread,
Oct 25, 2022, 9:06:35 PM10/25/22
to Steven Sokol, golang-nuts
What version of GoLang?   tried different ones?

robert engels

unread,
Oct 25, 2022, 9:10:12 PM10/25/22
to Steven Sokol, golang-nuts
I would try an external sampling. The stack trace you are looking at is somewhat artificial. A native sampling will help pin down what exactly is consuming the cpu (switching to thread mode in htop can be a substitute). You might also run strace on the pid when it gets in this state to give additional clues.

Steven Sokol

unread,
Oct 26, 2022, 9:27:49 AM10/26/22
to golang-nuts
I've tried using strace and it just locks up (and requires SIGKILL to break out). What's the best way to get a native sampling, gdb? (Delve won't run on this platform.)

Steven Sokol

unread,
Oct 26, 2022, 9:28:16 AM10/26/22
to golang-nuts
Currently running go 1.9.1, but I've been having this same issue since I started working on this project about two years ago on 1.5.x.

Robert Engels

unread,
Oct 26, 2022, 9:45:34 AM10/26/22
to Steven Sokol, golang-nuts
Trigger a core dump then use gdb on the core file. 

On Oct 26, 2022, at 8:33 AM, Steven Sokol <st...@stevensokol.com> wrote:



Jan Mercl

unread,
Oct 26, 2022, 9:52:16 AM10/26/22
to Steven Sokol, golang-nuts
On Wed, Oct 26, 2022 at 3:28 PM Steven Sokol <st...@stevensokol.com> wrote:

> Currently running go 1.9.1, but I've been having this same issue since I started working on this project about two years ago on 1.5.x.

That version is 5 years old: https://go.dev/doc/devel/release#go1.9

Can you try a recent, supported version? (Go 1.18+)

Steven Sokol

unread,
Oct 26, 2022, 10:02:29 AM10/26/22
to golang-nuts
Typo. 1.19.1

Konstantin Khomoutov

unread,
Oct 26, 2022, 11:03:30 AM10/26/22
to Steven Sokol, Robert Engels, golang-nuts
On Wed, Oct 26, 2022 at 08:45:00AM -0500, Robert Engels wrote:

> Trigger a core dump then use gdb on the core file.

Also, the gdb is usually shipped with the `gcore` helper which can attach to
the specified PID and dump its core.

Steven Sokol

unread,
Oct 31, 2022, 10:55:22 AM10/31/22
to golang-nuts
I tried doing that and it did not seem to work:

root@pi4cm1(rw):/# gcore -o ~/lockup.core 22022
Unable to attach: program exited with code 1.
You can't do that without a process to debug.
The program is not being run.
gcore: failed to create /root/lockup.core.22022

The attempt somehow killed the process.

Robert Engels

unread,
Oct 31, 2022, 11:03:27 AM10/31/22
to Steven Sokol, golang-nuts
I would simply use kill to send a signal that generates a core dump and exits. Make sure you have the shell and security settings setup to all this to happen. 

On Oct 31, 2022, at 9:55 AM, Steven Sokol <st...@stevensokol.com> wrote:

I tried doing that and it did not seem to work:
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Eric Hubbard

unread,
Oct 31, 2022, 11:23:55 AM10/31/22
to Robert Engels, Steven Sokol, golang-nuts
I'm eager to hear what the final answer is.  These are the "fun" ones..:)

Steven Sokol

unread,
Oct 31, 2022, 12:32:21 PM10/31/22
to golang-nuts
I tried sending SIGABRT, but the process did not respond - continued running in the runaway state. I can kill it with SIGKILL but that doesn't generate a core. Can you suggest a specific kill command that will? Thanks!

Steven Sokol

unread,
Oct 31, 2022, 12:52:24 PM10/31/22
to golang-nuts
Ok, here's what perf shows:

root@pi4cm1(rw):/# perf record -a -g sleep 5
Reading /proc/4199/task/4199/maps time out. You may want to increase the time limit by --proc-map-timeout
[ perf record: Woken up 8 times to write data ]
Warning:
1 map information files for pre-existing threads were
not processed, if there are samples for addresses they
will not be resolved, you may find out which are these
threads by running with -v and redirecting the output
to a file.
The time limit to process proc map is too short?
Increase it by --proc-map-timeout
[ perf record: Captured and wrote 6.510 MB perf.data (80088 samples) ]



   ┌─Warning:─────────────────────────────────────────────┐
   │1 map information files for pre-existing threads were │
   │not processed, if there are samples for addresses they│
   │will not be resolved, you may find out which are these│
   │threads by running with -v and redirecting the output │
   │to a file.                                            │
   │The time limit to process proc map is too short?      │
   │Increase it by --proc-map-timeout                     │
   │                                                      │
   │                                                      │
   │Press any key...                                      │
   └──────────────────────────────────────────────────────┘



Samples: 80K of event 'cpu-clock:pppH', Event count (approx.): 20022000000
  Children      Self  Command          Shared Object             Symbol
+   41.96%     0.00%  fvUnisocket      [vectors]                 [.] 0xffff0fc0
+   41.65%    41.65%  fvUnisocket      [vectors]                 [.] 0x00000fc0
+   30.14%    29.94%  fvUnisocket      fvUnisocket               [.] kernelcas
+    5.05%     0.00%  fvUnisocket      [vectors]                 [.] 0xffff0fa0
+    5.04%     5.01%  fvUnisocket      fvUnisocket               [.] runtime/internal/atomic.Cas
+    5.01%     5.01%  fvUnisocket      [vectors]                 [.] 0x00000fa0
+    4.56%     4.53%  fvUnisocket      fvUnisocket               [.] runtime/internal/atomic.goLoad64
+    2.97%     0.00%  fvUnisocket      [vectors]                 [.] 0xffff0fd8
+    2.95%     2.95%  fvUnisocket      [vectors]                 [.] 0x00000fd8
+    2.52%     0.00%  fvUnisocket      [vectors]                 [.] 0xffff0fcc
+    2.50%     2.50%  fvUnisocket      [vectors]                 [.] 0x00000fcc
+    2.26%     0.00%  swapper          [kernel.kallsyms]         [k] cpu_startup_entry
+    2.25%     0.00%  swapper          [kernel.kallsyms]         [k] do_idle
+    2.12%     0.00%  swapper          [kernel.kallsyms]         [k] default_idle_call
+    2.09%     2.09%  swapper          [kernel.kallsyms]         [k] arch_cpu_idle
+    1.74%     0.00%  swapper          [unknown]                 [k] 0x002018ac
+    1.74%     0.00%  swapper          [kernel.kallsyms]         [k] secondary_start_kernel
+    1.48%     1.47%  fvUnisocket      fvUnisocket               [.] cas
+    1.19%     1.19%  fvUnisocket      fvUnisocket               [.] runtime/internal/atomic.goStore64
     0.96%     0.95%  dump1090         dump1090                  [.] convert_uc8_nodc_nopower
+    0.65%     0.00%  fvUnisocket      [kernel.kallsyms]         [k] __irq_usr
+    0.65%     0.00%  fvUnisocket      [kernel.kallsyms]         [k] gic_handle_irq
+    0.65%     0.00%  fvUnisocket      [kernel.kallsyms]         [k] __handle_domain_irq
+    0.65%     0.00%  fvUnisocket      [kernel.kallsyms]         [k] irq_exit
     0.65%     0.17%  fvUnisocket      [kernel.kallsyms]         [k] __softirqentry_text_start
+    0.52%     0.00%  swapper          [kernel.kallsyms]         [k] start_kernel
+    0.52%     0.00%  swapper          [kernel.kallsyms]         [k] arch_call_rest_init
+    0.52%     0.00%  swapper          [kernel.kallsyms]         [k] __noinstr_text_end
     0.38%     0.02%  fvUnisocket      [kernel.kallsyms]         [k] tasklet_action_common.constprop.4
     0.38%     0.00%  fvUnisocket      [kernel.kallsyms]         [k] tasklet_hi_action
     0.36%     0.05%  fvUnisocket      [kernel.kallsyms]         [k] usb_giveback_urb_bh
     0.31%     0.00%  fvUnisocket      [kernel.kallsyms]         [k] __usb_hcd_giveback_urb
     0.29%     0.00%  fvUnisocket      [uvcvideo]                [k] uvc_video_complete
     0.26%     0.00%  swapper          [ipv6].rodata.str1.4      [k] .LC0
     0.24%     0.00%  swapper          [ipv6]_ftrace_events      [k] __event_fib6_table_lookup+0x0
     0.19%     0.00%  perf_5.10        [kernel.kallsyms]         [k] __ret_fast_syscall
     0.18%     0.00%  perf_5.10        [unknown]                 [k] 00000000
     0.18%     0.00%  perf_5.10        libpthread-2.28.so        [.] __libc_write
     0.18%     0.00%  perf_5.10        [kernel.kallsyms]         [k] __se_sys_write
     0.18%     0.00%  perf_5.10        [kernel.kallsyms]         [k] ksys_write
     0.17%     0.00%  perf_5.10        [kernel.kallsyms]         [k] vfs_write
     0.17%     0.00%  perf_5.10        [kernel.kallsyms]         [k] ext4_file_write_iter
     0.17%     0.00%  perf_5.10        [kernel.kallsyms]         [k] ext4_buffered_write_iter
     0.17%     0.00%  perf_5.10        [kernel.kallsyms]         [k] generic_perform_write
     0.17%     0.00%  fvUnisocket      [uvcvideo]                [k] uvc_video_decode_isoc
     0.17%     0.17%  fvUnisocket      [uvcvideo]                [k] uvc_video_decode_start
     0.15%     0.15%  fvUnisocket      [kernel.kallsyms]         [k] _raw_spin_unlock_irqrestore
     0.14%     0.00%  dump1090         [unknown]                 [.] 0x807f7e7f
     0.12%     0.00%  redis-server     [kernel.kallsyms]         [k] __ret_fast_syscall
     0.12%     0.00%  dump1090         [unknown]                 [.] 0x80808180
     0.11%     0.00%  dump1090         [unknown]                 [.] 0x81807e7f
     0.11%     0.00%  dump1090         [unknown]                 [.] 0x7e7f8080
     0.11%     0.00%  dump1090         [unknown]                 [.] 0x807e817e
     0.11%     0.00%  dump1090         [unknown]                 [.] 0x80817f80
     0.10%     0.00%  dump1090         [unknown]                 [.] 0x807e7f7f
     0.10%     0.00%  fvLogger         [unknown]                 [k] 00000000
     0.10%     0.00%  redis-server     [unknown]                 [.] 0x00000011
     0.09%     0.09%  dump1090         dump1090                  [.] demodulate2400
     0.09%     0.00%  fvLogger         [kernel.kallsyms]         [k] __ret_fast_syscall
     0.09%     0.00%  dump1090         [unknown]                 [.] 0x00041002
     0.08%     0.01%  fvUnisocket      [kernel.kallsyms]         [k] usb_submit_urb
     0.08%     0.00%  fvLogger         fvLogger                  [.] 0x000804c4
     0.08%     0.00%  fvHealth         [kernel.kallsyms]         [k] __ret_fast_syscall
     0.08%     0.01%  fvUnisocket      [kernel.kallsyms]         [k] usb_hcd_submit_urb
     0.07%     0.00%  swapper          [kernel.kallsyms]         [k] arch_cpu_idle_exit
     0.07%     0.01%  swapper          [kernel.kallsyms]         [k] ledtrig_cpu
     0.07%     0.01%  fvUnisocket      [dwc2]                    [k] _dwc2_hcd_urb_enqueue
     0.07%     0.00%  swapper          [kernel.kallsyms]         [k] led_trigger_event
     0.06%     0.06%  swapper          [kernel.kallsyms]         [k] _raw_read_unlock_irqrestore


I've no idea what all that means. Any thoughts?

Robert Engels

unread,
Nov 1, 2022, 11:38:12 AM11/1/22
to Steven Sokol, golang-nuts
Perf doesn’t help. Perf only shows a sampling over time (eg where the cpu has spent) and as you see in the logs it doesn’t work for pre-existing threads. 

You need to figure out how to generate a core dump. 

When you go process dies does the system function as normal?

On Oct 31, 2022, at 11:52 AM, Steven Sokol <st...@stevensokol.com> wrote:

Ok, here's what perf shows:

Steven Sokol

unread,
Nov 1, 2022, 12:12:01 PM11/1/22
to golang-nuts
Yep, I need to figure out how to generate a core dump. I've tried all the usual methods without success. Any suggestions for alternate methods would be greatly appreciated.

When the process is killed (SIGKILL) the system does return to its normal load. (Under production conditions systemd immediately restarts it.)

Robert Engels

unread,
Nov 1, 2022, 12:14:54 PM11/1/22
to Steven Sokol, golang-nuts
Are you certain you aren’t integrating with some C library that is interfering with the default signal handling?

On Nov 1, 2022, at 11:12 AM, Steven Sokol <st...@stevensokol.com> wrote:

Yep, I need to figure out how to generate a core dump. I've tried all the usual methods without success. Any suggestions for alternate methods would be greatly appreciated.

Robert Engels

unread,
Nov 1, 2022, 12:30:56 PM11/1/22
to Steven Sokol, golang-nuts
Also, enabling and generating core dumps is somewhat OS version specific - even between Linux versions - so you probably need to check your docs. 

On Nov 1, 2022, at 11:14 AM, Robert Engels <ren...@ix.netcom.com> wrote:



Roland Müller

unread,
Nov 1, 2022, 4:25:07 PM11/1/22
to golang-nuts, Robert Engels, Steven Sokol
Hello,

Am Dienstag, 1. November 2022 schrieb Robert Engels <ren...@ix.netcom.com>:
> Also, enabling and generating core dumps is somewhat OS version specific - even between Linux versions - so you probably need to check your docs. 
>

dumping core files is a kernel feature, but I think most Linux distros have it disabled nowadays.

To allow unlimited core files enter following command as root:
# ulimit -c unlimited

BR,
Roland

Konstantin Khomoutov

unread,
Nov 3, 2022, 5:13:25 AM11/3/22
to golang-nuts
On Mon, Oct 31, 2022 at 09:32:21AM -0700, Steven Sokol wrote:

> I tried sending SIGABRT, but the process did not respond - continued
> running in the runaway state. I can kill it with SIGKILL but that doesn't
> generate a core. Can you suggest a specific kill command that will? Thanks!

Sending SIGKILL is worthless because this signal is never delivered to the
target process: it's actually a request to the kernel to forcibly bring the
process down without any extra fuss.

To make the process drop the core, you should first make sure that this is
allowed for the process - usually by running

ulimit -c unlimited

in the shell which starts the process (well, exactly how to make sure this
ulimit parameter is set for your process heavily depends on how it starts up)
and then sending a process any of the signals with "drop the core" default
disposition - see the signal(7) manual page - and not handled by the process
itself using the os/signal package).
On Linux, SIGQUIT is typically the signal you want.
And on a terminal (including virtual terminals such as Linux's virtual
consoles, XTerm, GNOME Terminal, screen, tmux etc) sending of this signal is
bound to Ctrl-\ by default. So, if you can run this program in the foreground
mode - that is, from a shell session attached to a terminal, you can do

ulimit -c unlimited
/path/to/the/process
# idenfity lockup
# press Ctrl-\ to have the process sent the SIGQUIT

Steven Sokol

unread,
Nov 7, 2022, 12:09:27 PM11/7/22
to golang-nuts
FWIW - I opened a bug with the go core team and after some discussion one of them suggested I build the program specifying "GOARM=7". It appears to have resolved the issue - I've been running the full stack of programs on two different units (Pi 4 Model B, CM4) for the past week without any sign of the runaway. A friend with a great deal more low-level Linux experience has some thoughts on what might be happening:

Here is my speculation. There is a latent locking issue somewhere in the runtime and the much slower workaround implementation that goes through kuser helper makes the problem more statistically likely because the probability of contention is higher. The native version will make the lock action much faster and reduce the probability but if you want to be a gracious go citizen get a stack dump. Likely if you change to v7 it may go away but you'll never be sure :)

With regards to my total failure to create a core dump, he posits:

The reason for sigabort and friends not working is that the user helper may be considered as a syscall or being in the kernel so it will abort as soon as it leaves that call but if it spins forever....

So not a particularly satisfying outcome, but the problem is probably (possibly?) solved. Anyone have any thoughts on either the fix or my friend's theory?

robert engels

unread,
Nov 7, 2022, 7:15:07 PM11/7/22
to Steven Sokol, golang-nuts
All sounds probable reasonable.

https://wiki.archlinux.org/title/Core_dump has additional options that may help generate the core dump.

Still, if the process is stuck in the kernel, the only option should be a kill signal. If the other signals can cause the process to exit, then it doesn’t look like something is configured properly to allow the core to be generated. The core file is generated by the kernel not libc linked to the application.

These are all of the signals that generate a core dump in the kernel when the process exits:

#define SIG_KERNEL_COREDUMP_MASK (\
        rt_sigmask(SIGQUIT)   |  rt_sigmask(SIGILL)    | \
rt_sigmask(SIGTRAP)   |  rt_sigmask(SIGABRT)   | \
        rt_sigmask(SIGFPE)    |  rt_sigmask(SIGSEGV)   | \
rt_sigmask(SIGBUS)    |  rt_sigmask(SIGSYS)    | \
        rt_sigmask(SIGXCPU)   |  rt_sigmask(SIGXFSZ)   | \
SIGEMT_MASK       )

so you might try killing with each of these and see if you can get a core dump.


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages