The program stucked on __vdso_clock_gettime (x86_64)

136 views
Skip to first unread message

Chenhong Liu

unread,
Jul 4, 2022, 3:14:40 PM7/4/22
to golang-nuts
Hi,

I encountered a wiered problem recently.

Environment

Arch: x86_64
CPU Vendor: hygon (AMD Zen1 OEM) (www.hygon.cn) (Hygon C86 7285)
OS: CentOS 7.6
Kernel: 3.10.0-957
Go version: 1.15.6

What happened

Our pure go program, named controllerd, stopped working except looping on one cpu core (from top command, it consumed one cpu core). Before it stopped working, it runs from mid April to Jun 12.

After profiling the process with perf command, the perf data shows the program looped to call __vdso_clock_gettime .

64.20% controllerd [vdso] [.] __vdso_clock_gettime
8.82% controllerd controllerd [.] runtime.procyield
8.82% controllerd controllerd [.] runtime.suspendG
8.01% controllerd controllerd [.] runtime.nanotime1
1.47% controllerd [kernel.kallsyms] [k] __enqueue_entity
0.79% controllerd [kernel.kallsyms] [k] system_call_after_swapgs
0.79% controllerd [kernel.kallsyms] [k] set_next_entity
0.68% controllerd [kernel.kallsyms] [k] _raw_gspin_lock
0.68% controllerd [kernel.kallsyms] [k] change_pte_range
0.65% controllerd [kernel.kallsyms] [k] auditsys
0.57% controllerd [kernel.kallsyms] [k] update_curr
0.45% controllerd [kernel.kallsyms] [k] native_sched_clock
0.45% controllerd [kernel.kallsyms] [k] cpuacct_charge
0.45% controllerd [kernel.kallsyms] [k] __x86_indirect_thunk_+rax
0.45% controllerd [kernel.kallsyms] [k] __audit_syscall_exit
0.34% controllerd [kernel.kallsyms] [k] pick_next_task_fail
0.23% controllerd controllerd [.] runtime.osyield
0.23% controllerd [kernel.kallsyms] [k] native_queuedc_spin_lock_slowpath
0.23% controllerd [kernel.kallsyms] [k] dput
0.23% controllerd [kernel.kallsyms] [k] __schedule
0.23% controllerd [kernel.kallsyms] [k] put_prev_task_fair
0.23% controllerd [kernel.kallsyms] [k] yield_task_fair
0.22% controllerd [kernel.kallsyms] [k] sys_sched_yield
0.11% controllerd [kernel.kallsyms] [k] clear_buddies
0.11% controllerd [kernel.kallsyms] [k] update_rq_clock.part.78
0.11% controllerd [kernel.kallsyms] [k] update_min_vruntime
0.11% controllerd [kernel.kallsyms] [k] tick_do_update_jiffies64
0.11% controllerd [kernel.kallsyms] [k] system_call
0.11% controllerd [kernel.kallsyms] [k] rb_next
0.11% controllerd [kernel.kallsyms] [k] rb_insert_color
0.00% controllerd controllerd [.] runtime.notesleep
0.00% controllerd [kernel.kallsyms] [k] load_balance
0.00% controllerd controllerd [.] runtime.runggrab
0.00% controllerd controllerd [.] runtime.findrunnable
0.00% controllerd [kernel.kallsyms] [k] update_numa_stats
0.00% controllerd [kernel.kallsyms] [k] __lock_task_sighand
0.00% controllerd [kernel.kallsyms] [k] idle_cpu
0.00% controllerd [kernel.kallsyms] [k] kmem_cache_free_bulk
0.00% controllerd [kernel.kallsyms] [k] task_numa_find_cpu
0.00% controllerd [kernel.kallsyms] [k] __queue_work
0.00% controllerd [kernel.kallsyms] [k] __perf_event_task_sched_in
0.00% controllerd [kernel.kallsyms] [k] finish_task_switch
0.00% controllerd [kernel.kallsyms] [k] perf_ctx_unlock
0.00% controllerd [kernel.kallsyms] [k] native_write_msr_safe
0.00% controllerd [kernel.kallsyms] [k] __perf_event_enable

B.T.W, perf command is:

perf record -o controllerd_perf.data -F 99 -p PID sleep 10

I'm not familiar with Go's runtime detail. But after getting the perf data and reading code of function suspendG of Go's runtime. I guessed: The program looping in suspendG and failed to find any other goroutine to execute.

What was did

  • Firstly, I suspected the clock source was not set correctly on the system. But it's set to tsc, and it's right.
  • Then, I tried to send SIGUSR1 to the program to see if it could be revived (the controllerd program can be triggered with SIGUSR1 to flush logs), but nothing changed.
  • Finally, we have to restart the program to recover.

Ask for help

  1. Any body know the reason of the problem?
  2. What can I do to get more information if this problem happened again?
Thanks.

Konstantin Khomoutov

unread,
Jul 5, 2022, 5:56:04 AM7/5/22
to Chenhong Liu, golang-nuts
On Mon, Jul 04, 2022 at 01:57:06AM -0700, Chenhong Liu wrote:

[...]
[...]

For me, this looks like the program is merely idling, and what you're
observing is reading current time by the scheduler implemented by the Go
runtime.

[...]
> 1. Any body know the reason of the problem?

I'm inclined to think that you're merely facing some deadlock or similar
condition which precludes your normal Go code from doing useful progress while
the runtime itself is OK.

> 2. What can I do to get more information if this problem happened again?

Do the following:

- Have a way to collect whatever the process is writing to its stderr.

- When the process wedges in the same way, send it SIGQUIT or SIGABRT so that
it dumps the stacks of all the active goroutines to the stderr and exits.

If that service handles HTTP requests, you might enable the handlers of the
core debug/pprof package and when the process it apparently wedged navigate to
its /debug/pprof/goroutine?debug=2 (unless you have grafted the root of
those endpoints somewhere else) to have the stacks sent in the response.

Neo Liu

unread,
Jul 5, 2022, 9:57:30 AM7/5/22
to Konstantin Khomoutov, golang-nuts


For me, this looks like the program is merely idling, and what you're
observing is reading current time by the scheduler implemented by the Go
runtime.
I didn't try it. The environment located in customer's data center, not allowed to try this kind of things on it.

 
 

[...]
>    1. Any body know the reason of the problem?

I'm inclined to think that you're merely facing some deadlock or similar
condition which precludes your normal Go code from doing useful progress while
the runtime itself is OK.

I agree with this, too. But the program didn't respond to any pprof interaction (We implemented a pprof server using domain socket), so I can not get goroutine information.

 
>    2. What can I do to get more information if this problem happened again?

Do the following:

 - Have a way to collect whatever the process is writing to its stderr. 

 - When the process wedges in the same way, send it SIGQUIT or SIGABRT so that
   it dumps the stacks of all the active goroutines to the stderr and exits.

I  will try this next time.
 
If that service handles HTTP requests, you might enable the handlers of the
core debug/pprof package and when the process it apparently wedged navigate to
its /debug/pprof/goroutine?debug=2 (unless you have grafted the root of
those endpoints somewhere else) to have the stacks sent in the response.

As I mentioned above, pprof is enabled by default, but the program ceased to respond to anything. 

Thanks very much.
 
Reply all
Reply to author
Forward
0 new messages