Small question about the CPU Profiler

586 views
Skip to first unread message

Felix Geisendoerfer

unread,
Jun 25, 2020, 9:06:07 AM6/25/20
to golang-dev
Hi,

I have a small question about the CPU profiler, specifically this function in src/runtime/signal_unix.go:

// setThreadCPUProfiler makes any thread-specific changes required to
// implement profiling at a rate of hz.
func setThreadCPUProfiler(hz int32) {
var it itimerval
if hz == 0 {
setitimer(_ITIMER_PROF, &it, nil)
} else {
it.it_interval.tv_sec = 0
it.it_interval.set_usec(1000000 / hz)
it.it_value = it.it_interval
setitimer(_ITIMER_PROF, &it, nil)
}
_g_ := getg()
_g_.m.profilehz = hz
}

The comment says the function makes a thread-specific change, but from my understanding setitimer(2) sends process-directed signals.

Does this mean the Go profiler will end up sampling a "random" thread [1] each time a signal arrives? Or is it somehow able to sample all threads at the given hz rate?

I feel like I'm missing something.

Cheers
Felix

[1] https://man7.org/linux/man-pages/man7/signal.7.html says "A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. If more than one of the threads has the signal unblocked, then the kernel chooses an arbitrary thread to which to deliver the signal."

Keith Randall

unread,
Jun 25, 2020, 10:54:00 AM6/25/20
to Felix Geisendoerfer, golang-dev
We have both setThreadCPUProfiler and setProcessCPUProfiler. I think the names are somewhat unfortunate. Both are process-wide. The former calls setitimer to request SIGPROF interrupts. The latter actually installs the interrupt handler.
The sampling happens on whatever thread the kernel picks. Hopefully it is picking randomly (among OS threads currently executing the process, or in the kernel on behalf of the process).


--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/5fa67bdc-fd90-45e3-9b40-ef36427087b1n%40googlegroups.com.

Felix Geisendoerfer

unread,
Jun 25, 2020, 11:25:34 AM6/25/20
to Keith Randall, golang-dev
Thanks for clarifying!

The reason I asked is because I published a user-land Go profiler [1] yesterday, and I wanted to confirm that I accurately describe the current implementation [2]. Some conversations on the gopher #performance slack channel [5] and the code comments made me doubt myself.

I've also followed the recent "hardware performance counters for CPU profiling" with great interest [3], which mentions this sampling bias [4] as a potential source of errors that motivated the proposal.

I currently don't have access to a machine with lots of CPUs to verify, but I think my profiler [1] might not suffer from this particular sampling bias. Considering the simplicity and portability of my implementation, I'm wondering if it would make sense to also propose it for inclusion in the Go core. I'd be happy to do the required work.

Either way, I'd also appreciate any feedback, especially why my naive approach might be doomed in different ways so that I can warn potential users. If this feels too off-topic for the mailing list, feel free to open a GH issue on my repo for it.

Cheers
Felix

[1] https://github.com/felixge/fgprof
[2] https://github.com/felixge/fgprof#gos-builtin-cpu-profiler
[3] https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md
[4] https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md#sampling-bias
[5] https://gophers.slack.com/archives/C0VP8EF3R/p1592920757468600

Chris Hines

unread,
Jun 25, 2020, 11:31:33 AM6/25/20
to golang-dev
Felix,

I have access to high CPU count Linux (CentOS) machines I can run tests on. Write me a test procedure and I would be happy to run it and report the results.

Chris

Felix Geisendörfer

unread,
Jun 25, 2020, 11:51:29 AM6/25/20
to Chris Hines, golang-dev
Hey Chris,

thanks for the offer! I've put some code and instructions here:

https://gist.github.com/felixge/a91cc39e44d0699b8151b63c7ea0f50b#file-usage-txt

Let me know what you see on your machine, I'm very curious!

Note: Please use a recent Go 1.14 version, I suspect fgprof might not do well on this test without async preemption.

Cheers
Felix
> You received this message because you are subscribed to a topic in the Google Groups "golang-dev" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-dev/LCJyvL90xv8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to golang-dev+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/42b9cbd8-04fc-4c88-9fa4-43f8d830a86cn%40googlegroups.com.

Michael Pratt

unread,
Jun 25, 2020, 1:04:02 PM6/25/20
to Felix Geisendörfer, Chris Hines, golang-dev
On Thu, Jun 25, 2020 at 11:51 AM Felix Geisendörfer <fe...@felixge.de> wrote:
Hey Chris,

thanks for the offer! I've put some code and instructions here:

https://gist.github.com/felixge/a91cc39e44d0699b8151b63c7ea0f50b#file-usage-txt

Let me know what you see on your machine, I'm very curious!

Note: Please use a recent Go 1.14 version, I suspect fgprof might not do well on this test without async preemption.

Cheers
Felix

> On 25. Jun 2020, at 17:31, Chris Hines <ggr...@cs-guy.com> wrote:
>
> Felix,
>
> I have access to high CPU count Linux (CentOS) machines I can run tests on. Write me a test procedure and I would be happy to run it and report the results.
>
> Chris
>
> On Thursday, June 25, 2020 at 11:25:34 AM UTC-4 haim...@gmail.com wrote:
> Thanks for clarifying!
>
> The reason I asked is because I published a user-land Go profiler [1] yesterday, and I wanted to confirm that I accurately describe the current implementation [2]. Some conversations on the gopher #performance slack channel [5] and the code comments made me doubt myself.
>
> I've also followed the recent "hardware performance counters for CPU profiling" with great interest [3], which mentions this sampling bias [4] as a potential source of errors that motivated the proposal.

The description in [4] is rather short, and while I wouldn't say it is wrong, I think it implies more bias than occurs in a typical program.

On Linux, ITIMER_PROF timers are updated from the scheduler tick interrupt that occurs on each CPU (see tick_periodic -> update_process_times -> run_posix_cpu_timers -> check_process_timers). This accumulates the new CPU time of the current task (==thread) into the total CPU time for the thread group. If this value now exceeds timer threshold, the kernel sends a SIGPROF. The kernel _does_ send a group signal (__group_send_sig_info), which means any thread may technically receive it, however it prefers to deliver to the current thread provided it is running (must be if we interrupted it) and doesn't have the signal masked (I think you'd need to explicitly call sigprocmask to get a Go program to mask SIGPROF). Thus in normal circumstances, we expect the SIGPROF to be delivered to the thread that used the next quantum of CPU. You can also statistically test the fairness of SIGPROF delivery.

Note of course that there are many other sources of bias, such as skid (discussed in the doc), or events happening around the same rate as the timer appearing less often than they should.

 

Felix Geisendörfer

unread,
Jun 25, 2020, 4:14:56 PM6/25/20
to Michael Pratt, Chris Hines, golang-dev
Hi Michael,

On 25. Jun 2020, at 19:03, Michael Pratt <mpr...@google.com> wrote:

The description in [4] is rather short, and while I wouldn't say it is wrong, I think it implies more bias than occurs in a typical program.

On Linux, ITIMER_PROF timers are updated from the scheduler tick interrupt that occurs on each CPU (see tick_periodic -> update_process_times -> run_posix_cpu_timers -> check_process_timers). This accumulates the new CPU time of the current task (==thread) into the total CPU time for the thread group. If this value now exceeds timer threshold, the kernel sends a SIGPROF. The kernel _does_ send a group signal (__group_send_sig_info), which means any thread may technically receive it, however it prefers to deliver to the current thread provided it is running (must be if we interrupted it) and doesn't have the signal masked (I think you'd need to explicitly call sigprocmask to get a Go program to mask SIGPROF). Thus in normal circumstances, we expect the SIGPROF to be delivered to the thread that used the next quantum of CPU. You can also statistically test the fairness of SIGPROF delivery.

Thanks for the explanation.

Unfortunately your link to the statistical test is giving me: Permission denied. Please contact the administrator. Any chance you could unlock it for me?

Thanks
Felix

Michael Pratt

unread,
Jun 25, 2020, 4:21:50 PM6/25/20
to Felix Geisendörfer, Chris Hines, golang-dev
On Thu, Jun 25, 2020 at 4:14 PM Felix Geisendörfer <fe...@felixge.de> wrote:
Hi Michael,

On 25. Jun 2020, at 19:03, Michael Pratt <mpr...@google.com> wrote:

The description in [4] is rather short, and while I wouldn't say it is wrong, I think it implies more bias than occurs in a typical program.

On Linux, ITIMER_PROF timers are updated from the scheduler tick interrupt that occurs on each CPU (see tick_periodic -> update_process_times -> run_posix_cpu_timers -> check_process_timers). This accumulates the new CPU time of the current task (==thread) into the total CPU time for the thread group. If this value now exceeds timer threshold, the kernel sends a SIGPROF. The kernel _does_ send a group signal (__group_send_sig_info), which means any thread may technically receive it, however it prefers to deliver to the current thread provided it is running (must be if we interrupted it) and doesn't have the signal masked (I think you'd need to explicitly call sigprocmask to get a Go program to mask SIGPROF). Thus in normal circumstances, we expect the SIGPROF to be delivered to the thread that used the next quantum of CPU. You can also statistically test the fairness of SIGPROF delivery.

Thanks for the explanation.

Unfortunately your link to the statistical test is giving me: Permission denied. Please contact the administrator. Any chance you could unlock it for me?

Felix Geisendörfer

unread,
Jun 25, 2020, 4:43:08 PM6/25/20
to Michael Pratt, Chris Hines, golang-dev
Hi Michael,

On 25. Jun 2020, at 22:21, Michael Pratt <mpr...@google.com> wrote:



On Thu, Jun 25, 2020 at 4:14 PM Felix Geisendörfer <fe...@felixge.de> wrote:
Hi Michael,

On 25. Jun 2020, at 19:03, Michael Pratt <mpr...@google.com> wrote:

The description in [4] is rather short, and while I wouldn't say it is wrong, I think it implies more bias than occurs in a typical program.

On Linux, ITIMER_PROF timers are updated from the scheduler tick interrupt that occurs on each CPU (see tick_periodic -> update_process_times -> run_posix_cpu_timers -> check_process_timers). This accumulates the new CPU time of the current task (==thread) into the total CPU time for the thread group. If this value now exceeds timer threshold, the kernel sends a SIGPROF. The kernel _does_ send a group signal (__group_send_sig_info), which means any thread may technically receive it, however it prefers to deliver to the current thread provided it is running (must be if we interrupted it) and doesn't have the signal masked (I think you'd need to explicitly call sigprocmask to get a Go program to mask SIGPROF). Thus in normal circumstances, we expect the SIGPROF to be delivered to the thread that used the next quantum of CPU. You can also statistically test the fairness of SIGPROF delivery.

Thanks for the explanation.

Unfortunately your link to the statistical test is giving me: Permission denied. Please contact the administrator. Any chance you could unlock it for me?


Thanks for the new link.

I took a look at the test, and it seems like it's is meant to fail if the error exceed 20%. That's quite a lot, isn't it?

Also, unless I'm missing sth, the calculation in the test seems wrong.


abs(a-b) < ((a + b) * 0.2)

should be sth like:

(1-min(a,b)/max(a,b)) < 0.2

Under the current equation a = 35 and b = 50 would seem to pass the test.

Cheers
Felix

Felix Geisendörfer

unread,
Jun 25, 2020, 5:04:51 PM6/25/20
to Michael Pratt, Chris Hines, golang-dev
Ignore my comment on the calculation. I thought about a more strict definition how what the 20% should mean, but the current meaning seems reasonable.

Ian Lance Taylor

unread,
Jun 25, 2020, 5:37:54 PM6/25/20
to Keith Randall, Felix Geisendoerfer, golang-dev
On Thu, Jun 25, 2020 at 7:53 AM 'Keith Randall' via golang-dev
<golan...@googlegroups.com> wrote:
>
> We have both setThreadCPUProfiler and setProcessCPUProfiler. I think the names are somewhat unfortunate. Both are process-wide. The former calls setitimer to request SIGPROF interrupts. The latter actually installs the interrupt handler.
> The sampling happens on whatever thread the kernel picks. Hopefully it is picking randomly (among OS threads currently executing the process, or in the kernel on behalf of the process).

It's called setThreadCPUProfiler because it sets profilehz for the
current M. It's called on each M as we notice that we haven't yet
updated profilehz for that M.

But I'm not sure why we call setitimer once per thread. That's how
we've done it since Go 1.0, but I'm not sure why.

Ian

Felix Geisendoerfer

unread,
Jun 26, 2020, 9:32:11 AM6/26/20
to Milind Chabbi, Ian Lance Taylor, Keith Randall, golang-dev
Hi Milind,

first of all: Thanks again for your great work on the perf counter proposal. I hope you'll get more feedback and review on it soon.

More inline:

On 26. Jun 2020, at 05:08, Milind Chabbi <mil...@uber.com> wrote:

I am the author of "[3] https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md" and I have a detailed understanding of setThreadCPUProfiler and and setProcessCPUProfiler in Go. 

setProcessCPUProfiler sets up the signal handler and setThreadCPUProfiler arms an itimer once and it is called once by each OS thread. 
This is correct in the following way: if two threads are running, setitmer() will be called twice and two itimer sampling engines will be active and will account for 200% CPU utilization. 

Are you sure? In my testing, calling setitimer from two threads has no impact on the sampling frequency. I think there can only be one active setitimer per process. I think Ian's response is based on the same assumption.

That being said, it's easy to get confused about this. E.g. if you have two CPU intense threads, and a single setitimer(ITIMER_PROF, ...), you'll see signals arrive at twice the rate of the configured interval. That's because you end up utilizing 2s of CPU time for every 1s of wall clock time.

If you want, I can post some sample code that I used to verify this. The man page is a bit vague on what happens if you try to setup multiple setitimer's, but I think the *old_value API variant implies that later calls overwrite previous timers.

Cheers
Felix

This is analogous to how Unix commands such as TOP show CPU utilization.

That said, setitimer() is broken in Go, that is one important reason why I have proposed the aforementioned proposal
setitimer() can deliver the signal to any thread; the examples shown throughout the aforementioned proposal show how inaccurate and imprecise the results are with the itimer.
This is nothing specific to Go, any profiler relying solely on ITIMER suffers from this problem. 

Using ITIMER as a sampling engine suffers from inaccuracy and imprecision due to three reasons:
1. Bias: ITIMER samples can be delivered to any thread.
2. Sample size: the resolution of OS timers is low.
3. Measurement skid: the signal is not guaranteed to be delivered exactly at the timer expiry, but some time after the expiry.

The proposal is in the review stage for a very long time but an implementation is available for you to take a test drive. The proposal makes use of hardware performance counters, which are far more accurate, can be made to deliver signals precisely to the eventing thread. Moreover, you can not only sample CPU cycles but more interesting events such as CPU cache misses, branch misprediction, TLB misses to name just a few, which help pinpoint root causes of performance problems. 


-Milind

 




--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

Milind Chabbi

unread,
Jun 26, 2020, 12:27:27 PM6/26/20
to Ian Lance Taylor, Keith Randall, Felix Geisendoerfer, golang-dev
I am the author of "[3] https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md" and I have a detailed understanding of setThreadCPUProfiler and and setProcessCPUProfiler in Go. 

setProcessCPUProfiler sets up the signal handler and setThreadCPUProfiler arms an itimer once and it is called once by each OS thread. 
This is correct in the following way: if two threads are running, setitmer() will be called twice and two itimer sampling engines will be active and will account for 200% CPU utilization. 
This is analogous to how Unix commands such as TOP show CPU utilization.

That said, setitimer() is broken in Go, that is one important reason why I have proposed the aforementioned proposal
setitimer() can deliver the signal to any thread; the examples shown throughout the aforementioned proposal show how inaccurate and imprecise the results are with the itimer.
This is nothing specific to Go, any profiler relying solely on ITIMER suffers from this problem. 

Using ITIMER as a sampling engine suffers from inaccuracy and imprecision due to three reasons:
1. Bias: ITIMER samples can be delivered to any thread.
2. Sample size: the resolution of OS timers is low.
3. Measurement skid: the signal is not guaranteed to be delivered exactly at the timer expiry, but some time after the expiry.

The proposal is in the review stage for a very long time but an implementation is available for you to take a test drive. The proposal makes use of hardware performance counters, which are far more accurate, can be made to deliver signals precisely to the eventing thread. Moreover, you can not only sample CPU cycles but more interesting events such as CPU cache misses, branch misprediction, TLB misses to name just a few, which help pinpoint root causes of performance problems. 


-Milind

 



On Thu, Jun 25, 2020 at 2:37 PM Ian Lance Taylor <ia...@golang.org> wrote:
--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

Chris Hines

unread,
Jun 26, 2020, 1:59:35 PM6/26/20
to golang-dev
Felix,

I was able to run this test today. I have access to a 56 CPU machine that mostly isn't doing anything. Here is the CPU info:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                56
On-line CPU(s) list:   0-55
Thread(s) per core:    2
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
Stepping:              1
CPU MHz:               1200.703
CPU max MHz:           3200.0000
CPU min MHz:           1200.0000
BogoMIPS:              3999.82
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_pt spec_ctrl ibpb_support tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

Since it has 56 CPUs I extended the test program to use 40 goroutines instead of 10 for a broader sample size.

I built the program with Go 1.14.4:

Here are the results for builtin pprof:

(pprof) top40
Showing nodes accounting for 2.12mins, 99.82% of 2.13mins total
Dropped 10 nodes (cum <= 0.01mins)
      flat  flat%   sum%        cum   cum%
  0.11mins  5.37%  5.37%   0.11mins  5.37%  main.f32
  0.10mins  4.51%  9.88%   0.10mins  4.51%  main.f5
  0.09mins  4.23% 14.11%   0.09mins  4.23%  main.f37
  0.09mins  4.20% 18.31%   0.09mins  4.20%  main.f10
  0.09mins  4.00% 22.32%   0.09mins  4.00%  main.f9
  0.08mins  3.88% 26.19%   0.08mins  3.88%  main.f26
  0.08mins  3.60% 29.79%   0.08mins  3.61%  main.f35
  0.07mins  3.13% 32.92%   0.07mins  3.13%  main.f11
  0.07mins  3.10% 36.02%   0.07mins  3.10%  main.f40
  0.07mins  3.06% 39.09%   0.07mins  3.06%  main.f27
  0.06mins  3.01% 42.09%   0.06mins  3.01%  main.f16
  0.06mins  2.98% 45.08%   0.06mins  2.98%  main.f17
  0.06mins  2.94% 48.01%   0.06mins  2.94%  main.f6
  0.06mins  2.84% 50.86%   0.06mins  2.84%  main.f22
  0.06mins  2.80% 53.65%   0.06mins  2.80%  main.f20
  0.06mins  2.77% 56.43%   0.06mins  2.77%  main.f2
  0.06mins  2.76% 59.18%   0.06mins  2.76%  main.f8
  0.06mins  2.71% 61.89%   0.06mins  2.71%  main.f28
  0.05mins  2.54% 64.43%   0.05mins  2.55%  main.f19
  0.05mins  2.47% 66.90%   0.05mins  2.48%  main.f4
  0.05mins  2.24% 69.14%   0.05mins  2.24%  main.f7
  0.05mins  2.21% 71.35%   0.05mins  2.21%  main.f24
  0.05mins  2.15% 73.50%   0.05mins  2.15%  main.f36
  0.04mins  2.11% 75.61%   0.04mins  2.11%  main.f38
  0.04mins  2.03% 77.64%   0.04mins  2.03%  main.f30
  0.04mins  1.98% 79.62%   0.04mins  2.00%  main.f34
  0.04mins  1.97% 81.59%   0.04mins  1.98%  main.f12
  0.04mins  1.94% 83.54%   0.04mins  1.96%  main.f39
  0.04mins  1.86% 85.40%   0.04mins  1.86%  main.f33
  0.04mins  1.83% 87.22%   0.04mins  1.83%  main.f31
  0.04mins  1.78% 89.00%   0.04mins  1.79%  main.f13
  0.04mins  1.65% 90.66%   0.04mins  1.67%  main.f14
  0.03mins  1.60% 92.25%   0.03mins  1.60%  main.f18
  0.03mins  1.26% 93.51%   0.03mins  1.27%  main.f1
  0.03mins  1.24% 94.75%   0.03mins  1.25%  main.f23
  0.02mins  1.13% 95.88%   0.02mins  1.14%  main.f3
  0.02mins  1.11% 96.99%   0.02mins  1.12%  main.f29
  0.02mins  1.08% 98.07%   0.02mins  1.08%  main.f21
  0.02mins  0.89% 98.96%   0.02mins  0.89%  main.f25
  0.02mins  0.86% 99.82%   0.02mins  0.86%  main.f15

Here are the results for fgprof v0.9.0:

(pprof) top50
Showing nodes accounting for 272.74s, 100% of 272.76s total
Dropped 9 nodes (cum <= 1.36s)
      flat  flat%   sum%        cum   cum%
   266.08s 97.55% 97.55%    266.08s 97.55%  runtime.asyncPreempt2
     6.66s  2.44%   100%      6.66s  2.44%  runtime.gopark
         0     0%   100%      6.66s  2.44%  main.f1
         0     0%   100%      6.66s  2.44%  main.f10
         0     0%   100%      6.66s  2.44%  main.f11
         0     0%   100%      6.64s  2.43%  main.f12
         0     0%   100%      6.66s  2.44%  main.f13
         0     0%   100%      6.66s  2.44%  main.f14
         0     0%   100%      6.66s  2.44%  main.f15
         0     0%   100%      6.66s  2.44%  main.f16
         0     0%   100%      6.66s  2.44%  main.f17
         0     0%   100%      6.65s  2.44%  main.f18
         0     0%   100%      6.66s  2.44%  main.f19
         0     0%   100%      6.65s  2.44%  main.f2
         0     0%   100%      6.64s  2.43%  main.f20
         0     0%   100%      6.65s  2.44%  main.f21
         0     0%   100%      6.66s  2.44%  main.f22
         0     0%   100%      6.64s  2.43%  main.f23
         0     0%   100%      6.66s  2.44%  main.f24
         0     0%   100%      6.66s  2.44%  main.f25
         0     0%   100%      6.66s  2.44%  main.f26
         0     0%   100%      6.66s  2.44%  main.f27
         0     0%   100%      6.65s  2.44%  main.f28
         0     0%   100%      6.64s  2.43%  main.f29
         0     0%   100%      6.66s  2.44%  main.f3
         0     0%   100%      6.66s  2.44%  main.f30
         0     0%   100%      6.66s  2.44%  main.f31
         0     0%   100%      6.66s  2.44%  main.f32
         0     0%   100%      6.66s  2.44%  main.f33
         0     0%   100%      6.62s  2.43%  main.f34
         0     0%   100%      6.66s  2.44%  main.f35
         0     0%   100%      6.66s  2.44%  main.f36
         0     0%   100%      6.66s  2.44%  main.f37
         0     0%   100%      6.66s  2.44%  main.f38
         0     0%   100%      6.66s  2.44%  main.f39
         0     0%   100%      6.66s  2.44%  main.f4
         0     0%   100%      6.66s  2.44%  main.f40
         0     0%   100%      6.66s  2.44%  main.f5
         0     0%   100%      6.66s  2.44%  main.f6
         0     0%   100%      6.66s  2.44%  main.f7
         0     0%   100%      6.66s  2.44%  main.f8
         0     0%   100%      6.66s  2.44%  main.f9
         0     0%   100%      6.66s  2.44%  main.main
         0     0%   100%      6.66s  2.44%  main.run
         0     0%   100%    266.08s 97.55%  runtime.asyncPreempt
         0     0%   100%      6.66s  2.44%  runtime.goparkunlock
         0     0%   100%      6.66s  2.44%  runtime.main
         0     0%   100%      6.66s  2.44%  runtime.semacquire1
         0     0%   100%      6.66s  2.44%  sync.(*WaitGroup).Wait
         0     0%   100%      6.66s  2.44%  sync.runtime_Semacquire

Hope that helps,
Chris

Milind Chabbi

unread,
Jun 26, 2020, 2:09:32 PM6/26/20
to Felix Geisendoerfer, Ian Lance Taylor, Keith Randall, golang-dev
Felix,
You are correct about setitimer() behavior -- calling it more than once makes no difference.
I tried it by changing the runtime code here to invoke setitimer() only once vs. by each OS thread.
The results produced were exactly the same.

The only use of calling setThreadCPUProfiler (and hence setitimer()) on each OS thread is to start the process-wide timer as soon as the first thread that reaches the scheduling point after profiling is requested (i.e., SetCPUProfileRate is called) and stop the process-wide timer as soon as the first thread reaches the scheduling point after profiling is stopped (i.e., SetCPUProfileRate is called).

-Milind

Felix Geisendoerfer

unread,
Jun 28, 2020, 5:51:06 AM6/28/20
to Chris Hines, golang-dev
Hi Chris,

thank you so much for running the test and even adjusting it to suit your crazy machine : ).

From the numbers, it looks like fgprof is far more accurate than pprof on Milind's goroutine test.

This is very exciting, but I suspect fgprof is still vulnerable to measurement skid which is addressed by Milind's proposal.

Another thing that I want to analyze is the overhead of fgprof. Your output for fgprof looks like it might have really slowed down the program compared to pprof. I'm currently working on a tool to run reproducible overhead/accuracy analysis comparisons between the various profilers, I'll post a link here once it's ready.

Cheers
Felix



Felix Geisendoerfer

unread,
Jun 28, 2020, 6:07:05 AM6/28/20
to Milind Chabbi, Ian Lance Taylor, Keith Randall, golang-dev
Hi Milind,

On 26. Jun 2020, at 20:09, Milind Chabbi <mil...@uber.com> wrote:

Felix,
You are correct about setitimer() behavior -- calling it more than once makes no difference.
I tried it by changing the runtime code here to invoke setitimer() only once vs. by each OS thread.
The results produced were exactly the same.

Thanks for confirming!

The only use of calling setThreadCPUProfiler (and hence setitimer()) on each OS thread is to start the process-wide timer as soon as the first thread that reaches the scheduling point after profiling is requested (i.e., SetCPUProfileRate is called) and stop the process-wide timer as soon as the first thread reaches the scheduling point after profiling is stopped (i.e., SetCPUProfileRate is called).

Not sure I follow, I think calling setThreadCPUProfiler() from each thread actually ends up delaying timers.

As far as I can tell, the first `setitimer` call is directly made by runtime.StartCPUProfile() via the following call chain:

StartCPUProfile -> SetCPUProfileRate -> setcpuprofilerate() -> setThreadCPUProfiler() -> setitimer()

I agree that setThreadCPUProfiler() is also called by the runtime.execute() scheduling point again for each thread, but I think these calls will end up delaying the initial timer firings because they cancel the already active timer.

So I'm not sure I can see any use case for calling setThreadCPUProfiler() from every thread. I think it only has (small) negative impacts on profiling.

@Ian: I think the impact is very minor, but do you think it's worth sending a patch upstream that remove the per-thread setitimer() calls?

Cheers
Felix

Ian Lance Taylor

unread,
Jun 28, 2020, 6:13:05 PM6/28/20
to Felix Geisendoerfer, Milind Chabbi, Keith Randall, golang-dev
On Sun, Jun 28, 2020 at 3:06 AM Felix Geisendoerfer <haim...@gmail.com> wrote:
>
> @Ian: I think the impact is very minor, but do you think it's worth sending a patch upstream that remove the per-thread setitimer() calls?

Thanks, I already did that in https://golang.org/cl/240003. We should
be able to land that for the 1.16 release.

Ian
Reply all
Reply to author
Forward
0 new messages