> This looks like what was happening on the Plan 9 builder. There, the
> problem was that stoptheworld called notetsleep called semasleep to sleep
> for 100us, but in the tsemacquire system call that rounded down to 0ms,
> so it was hard-spinning on the cpu instead of sleeping and therefore
> rescheduling. The fix was to round up so that some real sleeping (and
> therefore rescheduling) always happens in that situation. GOMAXPROCS>1
> would exacerbate this.
That was my first hypothesis, but I can't see it. It looks like there's
actual sleeping going on. Also, the problem seems specific to the kernel
version. I can see it on older kernels, but on latest kernel from Joyent
I can't trigger this problem at all.
FWIW, this is on the "bad" machine:
: oos:runtime; time runtime.test -test.cpu=4 -test.run
TestStackGrowth$ runtime
PASS
real 0m31.928s
user 0m4.615s
sys 0m0.393s
: oos:runtime;
And this is on the "good" machine:
real 0m6.064s
user 0m3.954s
sys 0m0.592s
The faulty machine doesn't have higher user time, and the ratio of the two
user times is very close to the ratio of the performance of the two CPUs.
So I ran this:
dtrace -n 'syscall:::entry /pid==$target/ { @num[probefunc] =
count(); }' -c "./runtime.test -test.v -test.cpu=4 -test.run
TestStackGrowth$ runtime"
And compared the output on the bad Solaris machine to the good Solaris
machine. This is the bad machine:
dtrace: description 'syscall:::entry ' matched 231 probes
=== RUN TestStackGrowth-4
--- PASS: TestStackGrowth-4 (29.70 seconds)
PASS
dtrace: pid 13974 has exited
getpid 1
getrlimit 1
rexit 1
brk 2
priocntlsys 3
write 3
sysconfig 4
lwp_continue 8
lwp_create 8
uucopy 8
schedctl 9
sigaltstack 9
setcontext 10
lwp_sigmask 25
mmap 26
sigaction 68
nanosleep 5601
yield 36150
lwp_park 48428
And this is the good machine:
dtrace: description 'syscall:::entry ' matched 233 probes
=== RUN TestStackGrowth-4
--- PASS: TestStackGrowth-4 (5.30 seconds)
PASS
dtrace: pid 95944 has exited
close 1
getpid 1
getrlimit 1
open 1
read 1
rexit 1
brk 2
write 3
sysconfig 4
priocntlsys 5
lwp_continue 9
lwp_create 9
uucopy 9
schedctl 10
sigaltstack 10
setcontext 11
lwp_sigmask 28
mmap 28
sigaction 68
nanosleep 13993
lwp_park 73866
yield 151500
The good machine yields a lot more than the bad machine. This could be
because either the program state is different (for whatever reason),
or the program state is the same but libc yields differently. Let's
test this out. We'll look at lwp_park and yield, and aggregate the
stacks. We'll use this:
dtrace -n 'syscall::lwp_park:entry, syscall::yield:entry
/pid==$target/ { @num[ustack()] = count(); }' -c "./runtime.test
-test.v -test.cpu=4 -test.run TestStackGrowth$ runtime"
This is the bad machine:
libc.so.1`yield+0xa
libc.so.1`sema_post+0x1fc
libc.so.1`sem_post+0x38
runtime.test`runtime.asmsysvicall6+0x50
3141
libc.so.1`__lwp_park+0x17
libc.so.1`sema_reltimedwait+0x28
libc.so.1`sem_reltimedwait_np+0x4b
runtime.test`runtime.asmsysvicall6+0x50
4028
libc.so.1`__lwp_park+0x17
libc.so.1`sema_wait+0x13
libc.so.1`sem_wait+0x38
runtime.test`runtime.asmsysvicall6+0x50
20658
libc.so.1`__lwp_unpark+0x14
libc.so.1`sem_post+0x38
runtime.test`runtime.asmsysvicall6+0x50
24678
libc.so.1`yield+0xa
runtime.test`runtime.asmsysvicall6+0x50
33669
And this is the good machine
libc.so.1`yield+0xa
libc.so.1`sema_post+0x1fc
libc.so.1`sem_post+0x38
runtime.test`runtime.asmsysvicall6+0x50
175
libc.so.1`__lwp_park+0x17
libc.so.1`sema_reltimedwait+0x28
libc.so.1`sem_reltimedwait_np+0x4b
runtime.test`runtime.asmsysvicall6+0x50
4090
libc.so.1`__lwp_park+0x17
libc.so.1`sema_wait+0x13
libc.so.1`sem_wait+0x38
runtime.test`runtime.asmsysvicall6+0x50
32882
libc.so.1`__lwp_unpark+0x14
libc.so.1`sem_post+0x38
runtime.test`runtime.asmsysvicall6+0x50
36964
libc.so.1`yield+0xa
runtime.test`runtime.asmsysvicall6+0x50
162616
So, we call sem_post (which eventually yields) twenty times less often, we
call sem_reltimedwait_np the same number of times, sem_post and sem_wait
is called more on the good machine, but not by much, and yield is called
five times more often on the good machine.
Now we know something in the program calls yield more often. Yield is
called directly only by runtime·osyield. Sem_post is called only by
runtime·semawakeup.
This syscall data is interesting, but we still don't know exactly where
we waste time. Instead of guesssing, we can measure:
/opt/DTT/procsystime -e ./runtime.test -test.v -test.cpu=4 -test.run
TestStackGrowth$ runtime
This is the bad machine:
/opt/DTT/procsystime -eT ./runtime.test -test.v -test.cpu=4 -test.run
TestStackGrowth$ runtime
=== RUN TestStackGrowth-4
--- PASS: TestStackGrowth-4 (27.24 seconds)
PASS
Elapsed Times for command ./runtime.test -test.v -test.cpu=4 -test.run
TestStackGrowth$ runtime,
SYSCALL TIME (ns)
getpid 1672
getrlimit 3374
systeminfo 3375
close 3855
sysconfig 6766
brk 14064
sigaltstack 17449
setcontext 20898
priocntlsys 24026
open 25306
stat 33366
lwp_continue 34503
resolvepath 42401
lwp_sigmask 42990
schedctl 48513
mmapobj 50641
write 73168
getcwd 78745
sigaction 91300
memcntl 155442
lwp_create 195676
uucopy 222416
mmap 251470
yield 166226654
nanosleep 51001870920
lwp_park 190318570199
TOTAL: 241488109189
And this is the good machine:
04757c51-b5c2-4d23-9a69-1e9e305bc4da:runtime$ pfexec
/opt/DTT/procsystime -eT ./runtime.test -test.v -test.cpu=4 -test.run
TestStackGrowth$ runtime
=== RUN TestStackGrowth-4
--- PASS: TestStackGrowth-4 (5.05 seconds)
PASS
Elapsed Times for command ./runtime.test -test.v -test.cpu=4 -test.run
TestStackGrowth$ runtime,
SYSCALL TIME (ns)
getpid 1882
getrlimit 3183
read 4265
systeminfo 5092
sysconfig 10346
brk 11545
close 12225
sigaltstack 17543
lwp_continue 24542
stat 39306
setcontext 40303
resolvepath 49826
lwp_sigmask 50834
write 55122
schedctl 55252
getcwd 64392
open 67225
mmapobj 78281
sigaction 86023
priocntlsys 130477
lwp_create 159157
uucopy 181476
mmap 314897
memcntl 439203
yield 448265812
nanosleep 12854501047
lwp_park 30347023675
TOTAL: 43651692931
You might wonder why the total times are much greater than the real
times. That is because this tool measures time spent in system calls
accross all threads, and there are many threads in a Go process. The
lwp_park numbers are very skewed because many threads in a Go process
are parked almost all the time.
I suspect all threads are created early, they live for the whole process
lifetime, and only 4 threads are doing anything, the reast are parked. I'm
not insane to use this hypothesis without testing it, so let's test it:
dtrace -qn 'syscall::lwp_create:entry, syscall::lwp_exit:entry
/pid==$target/ { printf("%d %d %s\n", timestamp, tid, probefunc) }' -c
"./runtime.test -test.v -test.cpu=4 -test.run TestStackGrowth$
runtime"
This is the bad machine:
=== RUN TestStackGrowth-4
22270995985559 1 lwp_create
22271148086316 1 lwp_create
22271148201956 3 lwp_create
22271866554369 4 lwp_create
22271877971811 1 lwp_create
22271878464099 5 lwp_create
22271890776999 7 lwp_create
22271984380961 4 lwp_create
And this is the good machine:
11721043519613302 1 lwp_create
11721043603792260 1 lwp_create
11721043603941641 3 lwp_create
11721044386989497 4 lwp_create
11721044388789803 1 lwp_create
11721044392657155 6 lwp_create
11721044417728140 7 lwp_create
11721044916077749 4 lwp_create
11721044916149965 8 lwp_create
I wasn't very wrong, but I wasn't perfectly accurate either. Threads are
created early and live long, but they are created in two steps. First,
a bunch of threads are created, then, with a delay of 0.1s for the
bad machine and 0.5s for the good machine, another set of threads is
created. For the bad machine this set only has one thread, for the good
machine two threads are created almost at the same time. What are these
threads? The obvious suspect is the printing this line
=== RUN TestStackGrowth-4
could create a thread, but this doesn't explain why two threads are
created for the good machine. Let's disable that write to eliminate
unknowns.
Bad machine:
23198516208356 1 lwp_create
23198678977029 1 lwp_create
23198679085298 3 lwp_create
23199387572555 4 lwp_create
23199388103588 4 lwp_create
23199388764763 5 lwp_create
23199390491253 5 lwp_create
23199450598099 8 lwp_create
Good machine:
11722471357043810 1 lwp_create
11722471491789019 1 lwp_create
11722471491917276 3 lwp_create
11722472251577077 4 lwp_create
11722472251859397 4 lwp_create
11722472252279168 5 lwp_create
11722472282499772 5 lwp_create
11722472440879247 6 lwp_create
There is still a 0.15 delay, but it's consistent accross machine, and at
least now they have the same number of threads! Unfortunately, we can't
affort not to use -v and still use procsystime, becase we can't time
the tests precisely. We can't use time(1) to call procsystime because
dtrace initialization takes long enough to invalidate the tests on the
good machine (I tried). The best strategy is to abandon procsystime and
use dtrace directly. Let's do this.
We'll just measure lwp_park block time and plot it. We use this:
dtrace -n 'syscall::lwp_park:entry /pid==$target/ { self->ts =
timestamp } syscall::lwp_park:return /self->ts/ { @time[tid] =
quantize(timestamp - self->ts); self->ts = 0; } ' -c "./runtime.test
-test.cpu=4 -test.run TestStackGrowth$ runtime"
We get these nice plots (fixed width font required). For the bad machine:
1
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@ 910
4096 |@@@@@@@@@@@@ 1630
8192 |@ 100
16384 |@@ 229
32768 |@@@@@@ 743
65536 |@@@@@ 736
131072 |@@ 268
262144 |@ 100
524288 | 6
1048576 | 1
2097152 | 0
4194304 | 0
8388608 |@@@@ 515
16777216 |@ 134
33554432 | 7
67108864 | 7
134217728 | 1
268435456 | 1
536870912 | 0
1073741824 | 1
2147483648 | 0
4294967296 | 0
8589934592 | 1
17179869184 | 0
2
value ------------- Distribution ------------- count
4096 | 0
8192 | 4
16384 | 8
32768 | 10
65536 | 23
131072 |@ 53
262144 |@@@@@ 396
524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2347
1048576 |@ 40
2097152 | 7
4194304 | 2
8388608 | 5
16777216 | 0
3
value ------------- Distribution ------------- count
8589934592 | 0
17179869184 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
34359738368 | 0
4
value ------------- Distribution ------------- count
512 | 0
1024 | 2
2048 |@@@@@@@@ 2149
4096 |@@@@@@@@@@@@@@@@@@ 4619
8192 |@@@@ 1140
16384 |@@@@ 1028
32768 |@@@ 826
65536 | 113
131072 | 25
262144 | 45
524288 | 29
1048576 | 15
2097152 | 2
4194304 | 4
8388608 | 88
16777216 |@ 242
33554432 | 41
67108864 | 17
134217728 | 14
268435456 | 7
536870912 | 3
1073741824 | 0
5
value ------------- Distribution ------------- count
512 | 0
1024 | 1
2048 |@@ 231
4096 |@@@@@@@@@@@@@@@ 1499
8192 |@ 110
16384 |@ 125
32768 |@@@ 328
65536 |@@ 197
131072 |@@@ 356
262144 | 42
524288 | 22
1048576 | 50
2097152 | 4
4194304 | 1
8388608 |@@@@@@@@ 857
16777216 |@@ 251
33554432 | 15
67108864 | 6
134217728 | 0
268435456 | 0
536870912 | 1
1073741824 | 0
2147483648 | 1
4294967296 | 1
8589934592 | 0
6
value ------------- Distribution ------------- count
1024 | 0
2048 |@@ 451
4096 |@@@@@@@@@@@@@@@@ 3030
8192 |@ 179
16384 |@@ 315
32768 |@@@ 646
65536 |@@ 327
131072 |@@@ 557
262144 |@@ 342
524288 |@ 135
1048576 | 45
2097152 | 10
4194304 | 2
8388608 |@@@@@@@ 1373
16777216 |@@ 331
33554432 | 25
67108864 | 15
134217728 | 4
268435456 | 0
7
value ------------- Distribution ------------- count
512 | 0
1024 | 1
2048 |@@@@@ 1181
4096 |@@@@@@@@@@@@@@@@ 3972
8192 |@@ 414
16384 |@@ 555
32768 |@@@@ 874
65536 |@ 305
131072 |@@ 513
262144 |@ 221
524288 |@ 128
1048576 | 6
2097152 | 0
4194304 | 2
8388608 |@@@@@ 1246
16777216 |@ 285
33554432 | 28
67108864 | 7
134217728 | 5
268435456 | 1
536870912 | 0
8
value ------------- Distribution ------------- count
512 | 0
1024 | 3
2048 |@@@@@@@ 1631
4096 |@@@@@@@@@@@ 2327
8192 |@@ 394
16384 |@@@ 617
32768 |@@ 397
65536 |@@@ 568
131072 |@@@ 590
262144 |@ 311
524288 | 85
1048576 | 8
2097152 | 0
4194304 | 0
8388608 |@@@@@@@ 1496
16777216 |@ 294
33554432 | 13
67108864 | 12
134217728 | 3
268435456 | 3
536870912 | 0
1073741824 | 1
2147483648 | 0
9
value ------------- Distribution ------------- count
1024 | 0
2048 |@@ 3
4096 |@@@@@@@@@ 16
8192 |@ 1
16384 |@ 1
32768 |@@ 3
65536 | 0
131072 |@@@@@ 9
262144 |@ 1
524288 |@@@ 5
1048576 | 0
2097152 | 0
4194304 | 0
8388608 |@@@@@@@@@@@@@@@ 28
16777216 |@@ 4
33554432 | 0
67108864 |@ 1
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 | 0
2147483648 | 0
4294967296 | 0
8589934592 | 0
17179869184 |@ 1
34359738368 | 0
For the good machine:
1
value ------------- Distribution ------------- count
512 | 0
1024 | 9
2048 |@@@@@@@@@ 1533
4096 |@@@@@@@@@@@@ 1896
8192 |@@ 284
16384 |@@ 375
32768 |@@@@ 675
65536 |@@@@@ 890
131072 |@ 203
262144 |@ 100
524288 |@@ 353
1048576 |@ 130
2097152 | 19
4194304 | 17
8388608 | 6
16777216 | 1
33554432 | 1
67108864 | 1
134217728 | 2
268435456 | 0
536870912 | 2
1073741824 | 1
2147483648 | 0
2
value ------------- Distribution ------------- count
4096 | 0
8192 | 15
16384 | 42
32768 |@ 64
65536 |@ 147
131072 |@@@@@@@ 769
262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3325
524288 |@@ 270
1048576 | 13
2097152 | 0
4194304 | 0
8388608 | 1
16777216 | 0
3
value ------------- Distribution ------------- count
2147483648 | 0
4294967296 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
8589934592 | 0
4
value ------------- Distribution ------------- count
256 | 0
512 | 3
1024 | 27
2048 |@@@@@@@@@ 2661
4096 |@@@@@@@@ 2475
8192 |@ 353
16384 |@@@ 849
32768 |@@@@@ 1461
65536 |@@@@@@@ 2068
131072 |@ 425
262144 | 136
524288 |@@@ 894
1048576 |@ 302
2097152 | 29
4194304 | 32
8388608 | 60
16777216 | 12
33554432 | 2
67108864 | 1
134217728 | 0
268435456 | 1
536870912 | 0
5
value ------------- Distribution ------------- count
512 | 0
1024 | 9
2048 |@@@@@@@ 2047
4096 |@@@@@@@@@@@@@@ 4371
8192 |@ 438
16384 |@@@ 1008
32768 |@@@@@ 1387
65536 |@@@@@ 1562
131072 |@ 321
262144 | 92
524288 |@@ 747
1048576 | 110
2097152 | 31
4194304 | 20
8388608 | 8
16777216 | 6
33554432 | 0
67108864 | 2
134217728 | 5
268435456 | 0
6
value ------------- Distribution ------------- count
512 | 0
1024 | 100
2048 |@@@@@@@@@@@@@@@@@@@@@ 9123
4096 |@@@@@@@ 3155
8192 |@@@ 1363
16384 |@@@ 1478
32768 |@ 626
65536 |@ 352
131072 | 92
262144 | 83
524288 | 121
1048576 |@ 394
2097152 |@ 621
4194304 | 36
8388608 | 6
16777216 | 0
7
value ------------- Distribution ------------- count
512 | 0
1024 | 10
2048 |@@@@@@@ 316
4096 |@@@@ 190
8192 |@ 62
16384 |@@ 111
32768 |@@@@@@ 254
65536 |@@@@@@@ 317
131072 |@@@@ 160
262144 | 22
524288 |@@@@@@ 275
1048576 |@ 50
2097152 | 2
4194304 | 3
8388608 | 3
16777216 | 4
33554432 | 3
67108864 | 0
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 | 0
2147483648 | 1
4294967296 | 0
8
value ------------- Distribution ------------- count
512 | 0
1024 | 14
2048 |@@@@@ 1271
4096 |@@@@@@@@@@@@@ 3326
8192 |@ 253
16384 |@@@ 853
32768 |@@@@@@ 1414
65536 |@@@@@@@ 1755
131072 |@ 374
262144 | 127
524288 |@@@ 657
1048576 |@ 164
2097152 | 25
4194304 | 19
8388608 | 11
16777216 | 4
33554432 | 1
67108864 | 5
134217728 | 1
268435456 | 1
536870912 | 0
9
value ------------- Distribution ------------- count
512 | 0
1024 | 8
2048 |@@@@@@@@@ 2014
4096 |@@@@@@@@@ 2115
8192 |@ 300
16384 |@@@ 615
32768 |@@@@@ 1201
65536 |@@@@@@@ 1677
131072 |@@ 352
262144 |@ 118
524288 |@@@ 615
1048576 |@ 241
2097152 | 29
4194304 | 17
8388608 | 7
16777216 | 2
33554432 | 3
67108864 | 1
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 | 1
2147483648 | 0
Compare the plots for threads 1, 4, 5, 6, 7, 8. Let's take thread 7
for example:
bad:
131072 |@@ 513
262144 |@ 221
524288 |@ 128
1048576 | 6
2097152 | 0
4194304 | 2
8388608 |@@@@@ 1246
16777216 |@ 285
33554432 | 28
good:
131072 |@@@@ 160
262144 | 22
524288 |@@@@@@ 275
1048576 |@ 50
2097152 | 2
4194304 | 3
8388608 | 3
16777216 | 4
33554432 | 3
In general, the statistics for time spent parked match the low numbers,
but the bad machine spends a second "cluster" of time, centered around
10ms wait time. I don't know what that 10ms is yet.
--
Aram Hăvărneanu