code review 84790044: runtime: no longer skip stack growth test in short mode (issue 84790044)

0in...@gmail.com

unread,

Apr 9, 2014, 5:16:46 PM4/9/14

to dvy...@google.com, golang-co...@googlegroups.com, r...@golang.org, re...@codereview-hr.appspotmail.com

Reviewers: dvyukov,

Message:
Hello dvyukov (cc: golang-co...@googlegroups.com, rsc),

I'd like you to review this change to
https://code.google.com/p/go

Description:
runtime: no longer skip stack growth test in short mode

We originally decided to skip this test in short mode
to prevent the parallel runtime test to timeout on the
Plan 9 builder. This should no longer be required since
the issue was fixed in CL 86210043.

Please review this at https://codereview.appspot.com/84790044/

Affected files (+0, -3 lines):
M src/pkg/runtime/stack_test.go

Index: src/pkg/runtime/stack_test.go
===================================================================
--- a/src/pkg/runtime/stack_test.go
+++ b/src/pkg/runtime/stack_test.go
@@ -123,9 +123,6 @@

// Test stack growing in different contexts.
func TestStackGrowth(t *testing.T) {
- if testing.Short() {
- t.Skip("skipping test in short mode")
- }
t.Parallel()
var wg sync.WaitGroup

Dave Cheney

unread,

Apr 9, 2014, 5:18:04 PM4/9/14

to David du Colombier, Dmitry Vyukov, golang-co...@googlegroups.com, Russ Cox, re...@codereview-hr.appspotmail.com

LGTM.

> --
> You received this message because you are subscribed to the Google Groups
> "golang-codereviews" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-coderevi...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Brad Fitzpatrick

unread,

Apr 9, 2014, 5:23:48 PM4/9/14

to re...@codereview-hr.appspotmail.com, golang-co...@googlegroups.com, dvy...@google.com, 0in...@gmail.com, r...@golang.org

LGTM

There might be a few more of these too. At least one in http

On Apr 9, 2014 2:16 PM, <0in...@gmail.com> wrote:

Reviewers: dvyukov,

Message:
Hello dvyukov (cc: golang-codereviews@googlegroups.com, rsc),

--
You received this message because you are subscribed to the Google Groups "golang-codereviews" group.

To unsubscribe from this group and stop receiving emails from it, send an email to golang-codereviews+unsub...@googlegroups.com.

Russ Cox

unread,

Apr 9, 2014, 8:37:29 PM4/9/14

to Brad Fitzpatrick, re...@codereview-hr.appspotmail.com, golang-co...@googlegroups.com, Dmitry Vyukov, David du Colombier

next time you are going to disable a test for one OS, please use a condition checking runtime.GOOS.

0in...@gmail.com

unread,

Apr 10, 2014, 12:37:55 AM4/10/14

to 0in...@gmail.com, dvy...@google.com, da...@cheney.net, brad...@golang.org, golang-co...@googlegroups.com, r...@golang.org, re...@codereview-hr.appspotmail.com

*** Submitted as
https://code.google.com/p/go/source/detail?r=9ef11603cde8 ***

runtime: no longer skip stack growth test in short mode

We originally decided to skip this test in short mode
to prevent the parallel runtime test to timeout on the
Plan 9 builder. This should no longer be required since
the issue was fixed in CL 86210043.

LGTM=dave, bradfitz
R=dvyukov, dave, bradfitz
CC=golang-codereviews, rsc
https://codereview.appspot.com/84790044

https://codereview.appspot.com/84790044/

go...@golang.org

unread,

Apr 10, 2014, 2:00:16 AM4/10/14

to 0in...@gmail.com, dvy...@google.com, da...@cheney.net, brad...@golang.org, golang-co...@googlegroups.com, r...@golang.org, re...@codereview-hr.appspotmail.com

This CL appears to have broken the linux-arm-cheney-imx6 builder.
See http://build.golang.org/log/d18afd8e1ca997a4672687d4d32f9f68d2fb503c

https://codereview.appspot.com/84790044/

David du Colombier

unread,

Apr 10, 2014, 2:03:29 AM4/10/14

to 0in...@gmail.com, go...@golang.org, dvy...@google.com, da...@cheney.net, brad...@golang.org, golang-co...@googlegroups.com, r...@golang.org, re...@codereview-hr.appspotmail.com

The test is failing on both linux/arm and solaris.

The test seems rather slow on arm machines, is
it the cause of the failure?

--
David du Colombier

David du Colombier

unread,

Apr 10, 2014, 2:29:14 AM4/10/14

to 0in...@gmail.com, go...@golang.org, dvy...@google.com, da...@cheney.net, brad...@golang.org, golang-co...@googlegroups.com, r...@golang.org, re...@codereview-hr.appspotmail.com

It also happen sporadically on darwin/amd64 and netbsd/amd64.

--
David du Colombier

Aram Hăvărneanu

unread,

Apr 10, 2014, 5:54:43 AM4/10/14

to David du Colombier, Gobot Gobot, Dmitry Vyukov, Dave Cheney, Brad Fitzpatrick, golang-co...@googlegroups.com, Russ Cox, re...@codereview-hr.appspotmail.com

I can't reproduce builder failures on any of my machines, investigating...

--
Aram Hăvărneanu

Aram Hăvărneanu

unread,

Apr 10, 2014, 6:05:26 AM4/10/14

to David du Colombier, Gobot Gobot, Dmitry Vyukov, Dave Cheney, Brad Fitzpatrick, golang-co...@googlegroups.com, Russ Cox, re...@codereview-hr.appspotmail.com

Hmm, although I can't reproduce the failure, I do see some interesting
performance degradation for TestStackGrowth with cpu >= 2.

: oos:runtime; GOMAXPROCS=2 go test -v -cpu=1,2,4 -run TestStackGrowth runtime
=== RUN TestStackGrowth
=== RUN TestStackGrowthCallback
--- PASS: TestStackGrowthCallback (2.23 seconds)
--- PASS: TestStackGrowth (2.23 seconds)
=== RUN TestStackGrowth-2
=== RUN TestStackGrowthCallback-2
--- PASS: TestStackGrowthCallback-2 (6.07 seconds)
--- PASS: TestStackGrowth-2 (28.45 seconds)
=== RUN TestStackGrowth-4
=== RUN TestStackGrowthCallback-4
--- PASS: TestStackGrowthCallback-4 (12.44 seconds)
--- PASS: TestStackGrowth-4 (21.40 seconds)
PASS
ok runtime 67.345s
: oos:runtime; GOMAXPROCS=2 go test -v -cpu=1,2,4 -run TestStackGrowth runtime
=== RUN TestStackGrowth
=== RUN TestStackGrowthCallback
--- PASS: TestStackGrowthCallback (2.30 seconds)
--- PASS: TestStackGrowth (2.30 seconds)
=== RUN TestStackGrowth-2
=== RUN TestStackGrowthCallback-2
--- PASS: TestStackGrowthCallback-2 (7.99 seconds)
--- PASS: TestStackGrowth-2 (28.72 seconds)
=== RUN TestStackGrowth-4
=== RUN TestStackGrowthCallback-4
--- PASS: TestStackGrowthCallback-4 (12.23 seconds)
--- PASS: TestStackGrowth-4 (21.16 seconds)
PASS
ok runtime 67.270s

Only one one machine though, this is fine:

04757c51-b5c2-4d23-9a69-1e9e305bc4da:aram$ GOMAXPROCS=2
/home/aram/go-up/bin/go test -cpu=1,2,4 -v -run TestStackGrowth
runtime
=== RUN TestStackGrowth
=== RUN TestStackGrowthCallback
--- PASS: TestStackGrowthCallback (2.48 seconds)
--- PASS: TestStackGrowth (2.48 seconds)
=== RUN TestStackGrowth-2
=== RUN TestStackGrowthCallback-2
--- PASS: TestStackGrowthCallback-2 (2.61 seconds)
--- PASS: TestStackGrowth-2 (4.57 seconds)
=== RUN TestStackGrowth-4
=== RUN TestStackGrowthCallback-4
--- PASS: TestStackGrowthCallback-4 (5.63 seconds)
--- PASS: TestStackGrowth-4 (5.73 seconds)
PASS
ok runtime 14.741s
04757c51-b5c2-4d23-9a69-1e9e305bc4da:aram$ GOMAXPROCS=2
/home/aram/go-up/bin/go test -cpu=1,2,4 -v -run TestStackGrowth
runtime
=== RUN TestStackGrowth
=== RUN TestStackGrowthCallback
--- PASS: TestStackGrowthCallback (2.34 seconds)
--- PASS: TestStackGrowth (2.34 seconds)
=== RUN TestStackGrowth-2
=== RUN TestStackGrowthCallback-2
--- PASS: TestStackGrowthCallback-2 (3.07 seconds)
--- PASS: TestStackGrowth-2 (4.76 seconds)
=== RUN TestStackGrowth-4
=== RUN TestStackGrowthCallback-4
--- PASS: TestStackGrowthCallback-4 (5.49 seconds)
--- PASS: TestStackGrowth-4 (5.61 seconds)
PASS
ok runtime 14.344s

I'll test on the builder too.

--
Aram Hăvărneanu

David du Colombier

unread,

Apr 10, 2014, 12:27:30 PM4/10/14

to Aram Hăvărneanu, Brad Fitzpatrick, re...@codereview-hr.appspotmail.com, Russ Cox, Dmitry Vyukov, golang-co...@googlegroups.com, Dave Cheney, Gobot Gobot

It also happen on the Plan 9 builder, but it's much less frequent than before the semasleep fix (CL 86210043).

--
David du Colombier

Russ Cox

unread,

Apr 10, 2014, 1:28:11 PM4/10/14

to Aram Hăvărneanu, David du Colombier, Gobot Gobot, Dmitry Vyukov, Dave Cheney, Brad Fitzpatrick, golang-co...@googlegroups.com, re...@codereview-hr.appspotmail.com

On Thu, Apr 10, 2014 at 6:05 AM, Aram Hăvărneanu <ara...@mgk.ro> wrote:

Hmm, although I can't reproduce the failure, I do see some interesting
performance degradation for TestStackGrowth with cpu >= 2.

This looks like what was happening on the Plan 9 builder. There, the problem was that stoptheworld called notetsleep called semasleep to sleep for 100us, but in the tsemacquire system call that rounded down to 0ms, so it was hard-spinning on the cpu instead of sleeping and therefore rescheduling. The fix was to round up so that some real sleeping (and therefore rescheduling) always happens in that situation. GOMAXPROCS>1 would exacerbate this.

It is entirely possible that the runtime for other systems has the same problem. What system are you seeing this on?

Russ

Aram Hăvărneanu

unread,

Apr 10, 2014, 1:55:39 PM4/10/14

to Russ Cox, David du Colombier, Gobot Gobot, Dmitry Vyukov, Dave Cheney, Brad Fitzpatrick, golang-co...@googlegroups.com, re...@codereview-hr.appspotmail.com

> This looks like what was happening on the Plan 9 builder. There, the
> problem was that stoptheworld called notetsleep called semasleep to sleep
> for 100us, but in the tsemacquire system call that rounded down to 0ms,
> so it was hard-spinning on the cpu instead of sleeping and therefore
> rescheduling. The fix was to round up so that some real sleeping (and
> therefore rescheduling) always happens in that situation. GOMAXPROCS>1
> would exacerbate this.

That was my first hypothesis, but I can't see it. It looks like there's
actual sleeping going on. Also, the problem seems specific to the kernel
version. I can see it on older kernels, but on latest kernel from Joyent
I can't trigger this problem at all.

FWIW, this is on the "bad" machine:

: oos:runtime; time runtime.test -test.cpu=4 -test.run
TestStackGrowth$ runtime
PASS

real 0m31.928s
user 0m4.615s
sys 0m0.393s
: oos:runtime;

And this is on the "good" machine:

real 0m6.064s
user 0m3.954s
sys 0m0.592s

The faulty machine doesn't have higher user time, and the ratio of the two
user times is very close to the ratio of the performance of the two CPUs.

So I ran this:

dtrace -n 'syscall:::entry /pid==$target/ { @num[probefunc] =
count(); }' -c "./runtime.test -test.v -test.cpu=4 -test.run
TestStackGrowth$ runtime"

And compared the output on the bad Solaris machine to the good Solaris
machine. This is the bad machine:

dtrace: description 'syscall:::entry ' matched 231 probes
=== RUN TestStackGrowth-4
--- PASS: TestStackGrowth-4 (29.70 seconds)
PASS
dtrace: pid 13974 has exited

getpid 1
getrlimit 1
rexit 1
brk 2
priocntlsys 3
write 3
sysconfig 4
lwp_continue 8
lwp_create 8
uucopy 8
schedctl 9
sigaltstack 9
setcontext 10
lwp_sigmask 25
mmap 26
sigaction 68
nanosleep 5601
yield 36150
lwp_park 48428

And this is the good machine:

dtrace: description 'syscall:::entry ' matched 233 probes
=== RUN TestStackGrowth-4
--- PASS: TestStackGrowth-4 (5.30 seconds)
PASS
dtrace: pid 95944 has exited

close 1
getpid 1
getrlimit 1
open 1
read 1
rexit 1
brk 2
write 3
sysconfig 4
priocntlsys 5
lwp_continue 9
lwp_create 9
uucopy 9
schedctl 10
sigaltstack 10
setcontext 11
lwp_sigmask 28
mmap 28
sigaction 68
nanosleep 13993
lwp_park 73866
yield 151500

The good machine yields a lot more than the bad machine. This could be
because either the program state is different (for whatever reason),
or the program state is the same but libc yields differently. Let's
test this out. We'll look at lwp_park and yield, and aggregate the
stacks. We'll use this:

dtrace -n 'syscall::lwp_park:entry, syscall::yield:entry
/pid==$target/ { @num[ustack()] = count(); }' -c "./runtime.test
-test.v -test.cpu=4 -test.run TestStackGrowth$ runtime"

This is the bad machine:

libc.so.1`yield+0xa
libc.so.1`sema_post+0x1fc
libc.so.1`sem_post+0x38
runtime.test`runtime.asmsysvicall6+0x50
3141

libc.so.1`__lwp_park+0x17
libc.so.1`sema_reltimedwait+0x28
libc.so.1`sem_reltimedwait_np+0x4b
runtime.test`runtime.asmsysvicall6+0x50
4028

libc.so.1`__lwp_park+0x17
libc.so.1`sema_wait+0x13
libc.so.1`sem_wait+0x38
runtime.test`runtime.asmsysvicall6+0x50
20658

libc.so.1`__lwp_unpark+0x14
libc.so.1`sem_post+0x38
runtime.test`runtime.asmsysvicall6+0x50
24678

libc.so.1`yield+0xa
runtime.test`runtime.asmsysvicall6+0x50
33669

And this is the good machine

libc.so.1`yield+0xa
libc.so.1`sema_post+0x1fc
libc.so.1`sem_post+0x38
runtime.test`runtime.asmsysvicall6+0x50
175

libc.so.1`__lwp_park+0x17
libc.so.1`sema_reltimedwait+0x28
libc.so.1`sem_reltimedwait_np+0x4b
runtime.test`runtime.asmsysvicall6+0x50
4090

libc.so.1`__lwp_park+0x17
libc.so.1`sema_wait+0x13
libc.so.1`sem_wait+0x38
runtime.test`runtime.asmsysvicall6+0x50
32882

libc.so.1`__lwp_unpark+0x14
libc.so.1`sem_post+0x38
runtime.test`runtime.asmsysvicall6+0x50
36964

libc.so.1`yield+0xa
runtime.test`runtime.asmsysvicall6+0x50
162616

So, we call sem_post (which eventually yields) twenty times less often, we
call sem_reltimedwait_np the same number of times, sem_post and sem_wait
is called more on the good machine, but not by much, and yield is called
five times more often on the good machine.

Now we know something in the program calls yield more often. Yield is
called directly only by runtime·osyield. Sem_post is called only by
runtime·semawakeup.

This syscall data is interesting, but we still don't know exactly where
we waste time. Instead of guesssing, we can measure:

/opt/DTT/procsystime -e ./runtime.test -test.v -test.cpu=4 -test.run
TestStackGrowth$ runtime

This is the bad machine:

/opt/DTT/procsystime -eT ./runtime.test -test.v -test.cpu=4 -test.run
TestStackGrowth$ runtime
=== RUN TestStackGrowth-4
--- PASS: TestStackGrowth-4 (27.24 seconds)
PASS

Elapsed Times for command ./runtime.test -test.v -test.cpu=4 -test.run
TestStackGrowth$ runtime,

SYSCALL TIME (ns)
getpid 1672
getrlimit 3374
systeminfo 3375
close 3855
sysconfig 6766
brk 14064
sigaltstack 17449
setcontext 20898
priocntlsys 24026
open 25306
stat 33366
lwp_continue 34503
resolvepath 42401
lwp_sigmask 42990
schedctl 48513
mmapobj 50641
write 73168
getcwd 78745
sigaction 91300
memcntl 155442
lwp_create 195676
uucopy 222416
mmap 251470
yield 166226654
nanosleep 51001870920
lwp_park 190318570199
TOTAL: 241488109189

And this is the good machine:

04757c51-b5c2-4d23-9a69-1e9e305bc4da:runtime$ pfexec
/opt/DTT/procsystime -eT ./runtime.test -test.v -test.cpu=4 -test.run
TestStackGrowth$ runtime
=== RUN TestStackGrowth-4
--- PASS: TestStackGrowth-4 (5.05 seconds)
PASS

Elapsed Times for command ./runtime.test -test.v -test.cpu=4 -test.run
TestStackGrowth$ runtime,

SYSCALL TIME (ns)
getpid 1882
getrlimit 3183
read 4265
systeminfo 5092
sysconfig 10346
brk 11545
close 12225
sigaltstack 17543
lwp_continue 24542
stat 39306
setcontext 40303
resolvepath 49826
lwp_sigmask 50834
write 55122
schedctl 55252
getcwd 64392
open 67225
mmapobj 78281
sigaction 86023
priocntlsys 130477
lwp_create 159157
uucopy 181476
mmap 314897
memcntl 439203
yield 448265812
nanosleep 12854501047
lwp_park 30347023675
TOTAL: 43651692931

You might wonder why the total times are much greater than the real
times. That is because this tool measures time spent in system calls
accross all threads, and there are many threads in a Go process. The
lwp_park numbers are very skewed because many threads in a Go process
are parked almost all the time.

I suspect all threads are created early, they live for the whole process
lifetime, and only 4 threads are doing anything, the reast are parked. I'm
not insane to use this hypothesis without testing it, so let's test it:

dtrace -qn 'syscall::lwp_create:entry, syscall::lwp_exit:entry
/pid==$target/ { printf("%d %d %s\n", timestamp, tid, probefunc) }' -c
"./runtime.test -test.v -test.cpu=4 -test.run TestStackGrowth$
runtime"

This is the bad machine:

=== RUN TestStackGrowth-4
22270995985559 1 lwp_create
22271148086316 1 lwp_create
22271148201956 3 lwp_create
22271866554369 4 lwp_create
22271877971811 1 lwp_create
22271878464099 5 lwp_create
22271890776999 7 lwp_create
22271984380961 4 lwp_create

And this is the good machine:

11721043519613302 1 lwp_create
11721043603792260 1 lwp_create
11721043603941641 3 lwp_create
11721044386989497 4 lwp_create
11721044388789803 1 lwp_create
11721044392657155 6 lwp_create
11721044417728140 7 lwp_create
11721044916077749 4 lwp_create
11721044916149965 8 lwp_create

I wasn't very wrong, but I wasn't perfectly accurate either. Threads are
created early and live long, but they are created in two steps. First,
a bunch of threads are created, then, with a delay of 0.1s for the
bad machine and 0.5s for the good machine, another set of threads is
created. For the bad machine this set only has one thread, for the good
machine two threads are created almost at the same time. What are these
threads? The obvious suspect is the printing this line

=== RUN TestStackGrowth-4

could create a thread, but this doesn't explain why two threads are
created for the good machine. Let's disable that write to eliminate
unknowns.

Bad machine:

23198516208356 1 lwp_create
23198678977029 1 lwp_create
23198679085298 3 lwp_create
23199387572555 4 lwp_create
23199388103588 4 lwp_create
23199388764763 5 lwp_create
23199390491253 5 lwp_create
23199450598099 8 lwp_create

Good machine:

11722471357043810 1 lwp_create
11722471491789019 1 lwp_create
11722471491917276 3 lwp_create
11722472251577077 4 lwp_create
11722472251859397 4 lwp_create
11722472252279168 5 lwp_create
11722472282499772 5 lwp_create
11722472440879247 6 lwp_create

There is still a 0.15 delay, but it's consistent accross machine, and at
least now they have the same number of threads! Unfortunately, we can't
affort not to use -v and still use procsystime, becase we can't time
the tests precisely. We can't use time(1) to call procsystime because
dtrace initialization takes long enough to invalidate the tests on the
good machine (I tried). The best strategy is to abandon procsystime and
use dtrace directly. Let's do this.

We'll just measure lwp_park block time and plot it. We use this:

dtrace -n 'syscall::lwp_park:entry /pid==$target/ { self->ts =
timestamp } syscall::lwp_park:return /self->ts/ { @time[tid] =
quantize(timestamp - self->ts); self->ts = 0; } ' -c "./runtime.test
-test.cpu=4 -test.run TestStackGrowth$ runtime"

We get these nice plots (fixed width font required). For the bad machine:

1
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@ 910
4096 |@@@@@@@@@@@@ 1630
8192 |@ 100
16384 |@@ 229
32768 |@@@@@@ 743
65536 |@@@@@ 736
131072 |@@ 268
262144 |@ 100
524288 | 6
1048576 | 1
2097152 | 0
4194304 | 0
8388608 |@@@@ 515
16777216 |@ 134
33554432 | 7
67108864 | 7
134217728 | 1
268435456 | 1
536870912 | 0
1073741824 | 1
2147483648 | 0
4294967296 | 0
8589934592 | 1
17179869184 | 0

2
value ------------- Distribution ------------- count
4096 | 0
8192 | 4
16384 | 8
32768 | 10
65536 | 23
131072 |@ 53
262144 |@@@@@ 396
524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2347
1048576 |@ 40
2097152 | 7
4194304 | 2
8388608 | 5
16777216 | 0

3
value ------------- Distribution ------------- count
8589934592 | 0
17179869184 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
34359738368 | 0

4
value ------------- Distribution ------------- count
512 | 0
1024 | 2
2048 |@@@@@@@@ 2149
4096 |@@@@@@@@@@@@@@@@@@ 4619
8192 |@@@@ 1140
16384 |@@@@ 1028
32768 |@@@ 826
65536 | 113
131072 | 25
262144 | 45
524288 | 29
1048576 | 15
2097152 | 2
4194304 | 4
8388608 | 88
16777216 |@ 242
33554432 | 41
67108864 | 17
134217728 | 14
268435456 | 7
536870912 | 3
1073741824 | 0
5
value ------------- Distribution ------------- count
512 | 0
1024 | 1
2048 |@@ 231
4096 |@@@@@@@@@@@@@@@ 1499
8192 |@ 110
16384 |@ 125
32768 |@@@ 328
65536 |@@ 197
131072 |@@@ 356
262144 | 42
524288 | 22
1048576 | 50
2097152 | 4
4194304 | 1
8388608 |@@@@@@@@ 857
16777216 |@@ 251
33554432 | 15
67108864 | 6
134217728 | 0
268435456 | 0
536870912 | 1
1073741824 | 0
2147483648 | 1
4294967296 | 1
8589934592 | 0

6
value ------------- Distribution ------------- count
1024 | 0
2048 |@@ 451
4096 |@@@@@@@@@@@@@@@@ 3030
8192 |@ 179
16384 |@@ 315
32768 |@@@ 646
65536 |@@ 327
131072 |@@@ 557
262144 |@@ 342
524288 |@ 135
1048576 | 45
2097152 | 10
4194304 | 2
8388608 |@@@@@@@ 1373
16777216 |@@ 331
33554432 | 25
67108864 | 15
134217728 | 4
268435456 | 0

7
value ------------- Distribution ------------- count
512 | 0
1024 | 1
2048 |@@@@@ 1181
4096 |@@@@@@@@@@@@@@@@ 3972
8192 |@@ 414
16384 |@@ 555
32768 |@@@@ 874
65536 |@ 305
131072 |@@ 513
262144 |@ 221
524288 |@ 128
1048576 | 6
2097152 | 0
4194304 | 2
8388608 |@@@@@ 1246
16777216 |@ 285
33554432 | 28
67108864 | 7
134217728 | 5
268435456 | 1
536870912 | 0

8
value ------------- Distribution ------------- count
512 | 0
1024 | 3
2048 |@@@@@@@ 1631
4096 |@@@@@@@@@@@ 2327
8192 |@@ 394
16384 |@@@ 617
32768 |@@ 397
65536 |@@@ 568
131072 |@@@ 590
262144 |@ 311
524288 | 85
1048576 | 8
2097152 | 0
4194304 | 0
8388608 |@@@@@@@ 1496
16777216 |@ 294
33554432 | 13
67108864 | 12
134217728 | 3
268435456 | 3
536870912 | 0
1073741824 | 1
2147483648 | 0

9
value ------------- Distribution ------------- count
1024 | 0
2048 |@@ 3
4096 |@@@@@@@@@ 16
8192 |@ 1
16384 |@ 1
32768 |@@ 3
65536 | 0
131072 |@@@@@ 9
262144 |@ 1
524288 |@@@ 5
1048576 | 0
2097152 | 0
4194304 | 0
8388608 |@@@@@@@@@@@@@@@ 28
16777216 |@@ 4
33554432 | 0
67108864 |@ 1
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 | 0
2147483648 | 0
4294967296 | 0
8589934592 | 0
17179869184 |@ 1
34359738368 | 0

For the good machine:

1
value ------------- Distribution ------------- count
512 | 0
1024 | 9
2048 |@@@@@@@@@ 1533
4096 |@@@@@@@@@@@@ 1896
8192 |@@ 284
16384 |@@ 375
32768 |@@@@ 675
65536 |@@@@@ 890
131072 |@ 203
262144 |@ 100
524288 |@@ 353
1048576 |@ 130
2097152 | 19
4194304 | 17
8388608 | 6
16777216 | 1
33554432 | 1
67108864 | 1
134217728 | 2
268435456 | 0
536870912 | 2
1073741824 | 1
2147483648 | 0

2
value ------------- Distribution ------------- count
4096 | 0
8192 | 15
16384 | 42
32768 |@ 64
65536 |@ 147
131072 |@@@@@@@ 769
262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3325
524288 |@@ 270
1048576 | 13
2097152 | 0
4194304 | 0
8388608 | 1
16777216 | 0

3
value ------------- Distribution ------------- count
2147483648 | 0
4294967296 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
8589934592 | 0

4
value ------------- Distribution ------------- count
256 | 0
512 | 3
1024 | 27
2048 |@@@@@@@@@ 2661
4096 |@@@@@@@@ 2475
8192 |@ 353
16384 |@@@ 849
32768 |@@@@@ 1461
65536 |@@@@@@@ 2068
131072 |@ 425
262144 | 136
524288 |@@@ 894
1048576 |@ 302
2097152 | 29
4194304 | 32
8388608 | 60
16777216 | 12
33554432 | 2
67108864 | 1
134217728 | 0
268435456 | 1
536870912 | 0

5
value ------------- Distribution ------------- count
512 | 0
1024 | 9
2048 |@@@@@@@ 2047
4096 |@@@@@@@@@@@@@@ 4371
8192 |@ 438
16384 |@@@ 1008
32768 |@@@@@ 1387
65536 |@@@@@ 1562
131072 |@ 321
262144 | 92
524288 |@@ 747
1048576 | 110
2097152 | 31
4194304 | 20
8388608 | 8
16777216 | 6
33554432 | 0
67108864 | 2
134217728 | 5
268435456 | 0

6
value ------------- Distribution ------------- count
512 | 0
1024 | 100
2048 |@@@@@@@@@@@@@@@@@@@@@ 9123
4096 |@@@@@@@ 3155
8192 |@@@ 1363
16384 |@@@ 1478
32768 |@ 626
65536 |@ 352
131072 | 92
262144 | 83
524288 | 121
1048576 |@ 394
2097152 |@ 621
4194304 | 36
8388608 | 6
16777216 | 0

7
value ------------- Distribution ------------- count
512 | 0
1024 | 10
2048 |@@@@@@@ 316
4096 |@@@@ 190
8192 |@ 62
16384 |@@ 111
32768 |@@@@@@ 254
65536 |@@@@@@@ 317
131072 |@@@@ 160
262144 | 22
524288 |@@@@@@ 275
1048576 |@ 50
2097152 | 2
4194304 | 3
8388608 | 3
16777216 | 4
33554432 | 3
67108864 | 0
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 | 0
2147483648 | 1
4294967296 | 0

8
value ------------- Distribution ------------- count
512 | 0
1024 | 14
2048 |@@@@@ 1271
4096 |@@@@@@@@@@@@@ 3326
8192 |@ 253
16384 |@@@ 853
32768 |@@@@@@ 1414
65536 |@@@@@@@ 1755
131072 |@ 374
262144 | 127
524288 |@@@ 657
1048576 |@ 164
2097152 | 25
4194304 | 19
8388608 | 11
16777216 | 4
33554432 | 1
67108864 | 5
134217728 | 1
268435456 | 1
536870912 | 0

9
value ------------- Distribution ------------- count
512 | 0
1024 | 8
2048 |@@@@@@@@@ 2014
4096 |@@@@@@@@@ 2115
8192 |@ 300
16384 |@@@ 615
32768 |@@@@@ 1201
65536 |@@@@@@@ 1677
131072 |@@ 352
262144 |@ 118
524288 |@@@ 615
1048576 |@ 241
2097152 | 29
4194304 | 17
8388608 | 7
16777216 | 2
33554432 | 3
67108864 | 1
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 | 1
2147483648 | 0

Compare the plots for threads 1, 4, 5, 6, 7, 8. Let's take thread 7
for example:

bad:

131072 |@@ 513
262144 |@ 221
524288 |@ 128
1048576 | 6
2097152 | 0
4194304 | 2
8388608 |@@@@@ 1246
16777216 |@ 285
33554432 | 28

good:

131072 |@@@@ 160
262144 | 22
524288 |@@@@@@ 275
1048576 |@ 50
2097152 | 2
4194304 | 3
8388608 | 3
16777216 | 4
33554432 | 3

In general, the statistics for time spent parked match the low numbers,
but the bad machine spends a second "cluster" of time, centered around
10ms wait time. I don't know what that 10ms is yet.

--
Aram Hăvărneanu

Russ Cox

unread,

Apr 10, 2014, 2:35:08 PM4/10/14

to Aram Hăvărneanu, David du Colombier, Gobot Gobot, Dmitry Vyukov, Dave Cheney, Brad Fitzpatrick, golang-co...@googlegroups.com, re...@codereview-hr.appspotmail.com

Please hg sync to pick up CL 86620043 and then run with GODEBUG=gctrace=1.

Aram Hăvărneanu

unread,

Apr 10, 2014, 3:11:06 PM4/10/14

to Russ Cox, David du Colombier, Gobot Gobot, Dmitry Vyukov, Dave Cheney, Brad Fitzpatrick, golang-co...@googlegroups.com, re...@codereview-hr.appspotmail.com

> Please hg sync to pick up CL 86620043 and then run with
> GODEBUG=gctrace=1.

Okay, bad machine:

cpu=1 (fast): http://paste.ubuntu.com/7232041/
cpu=2 (slow): http://paste.ubuntu.com/7232048/
cpu=4 (slow): http://paste.ubuntu.com/7232053/

Good machine:

cpu=1 (fast): http://paste.ubuntu.com/7232065/
cpu=2 (fast): http://paste.ubuntu.com/7232067/
cpu=4 (fast): http://paste.ubuntu.com/7232068/

I also ran prstat -mLc on the process. This prints the microstate
accounting. Every second it prints how much time each thread spent in
each scheduler state (there are 10 scheduler states).

This is the log for the bad machine: http://paste.ubuntu.com/7231991/
This is the log for the good machine: http://paste.ubuntu.com/7231993/

USR and SYS are self explanatory, LAT means runnable, LCK + SLP means
time waiting for something (sleep, locks, idle), VCX are voluntary
context switches and IVX are involuntary context switches.

You can see USR is generally lower for the bad machine, with LCK + SLP
is almost always close to 100%. IVX are the same, but VCX is an older
of magnitude less for the bad machine.

--
Aram Hăvărneanu

Russ Cox

unread,

Apr 10, 2014, 11:56:50 PM4/10/14

to Aram Hăvărneanu, David du Colombier, Gobot Gobot, Dmitry Vyukov, Dave Cheney, Brad Fitzpatrick, golang-co...@googlegroups.com, re...@codereview-hr.appspotmail.com

grep hires /etc/system on both machines.

The delay is being caused by the two or more threads not finishing at roughly the same time during either parfordo or scanblock during phase 3 (my bet is on scanblock). When this happens parfordo eventually does usleep(1) repeatedly in the waiting thread; scanblock (via getfull) does usleep(100) repeatedly instead. But on Solaris, both of those round up to 1 clock tick, so they end up being 10ms on a default system, and 1ms on a system with hires_tick=1 and hires_hz=1000 (a common alternate configuration).

It is possible that the problem is similar on the arm machines.

I think this test is just doing too much. It is written for a fast machine and there are slow machines out there (having a minimum 10ms kernel sleep counts as being slow, sorry).

I sent CL 86550045 to scale back the test on the builders.

Russ

Aram Hăvărneanu

unread,

Apr 11, 2014, 5:53:29 AM4/11/14

to Russ Cox, David du Colombier, Gobot Gobot, Dmitry Vyukov, Dave Cheney, Brad Fitzpatrick, golang-co...@googlegroups.com, re...@codereview-hr.appspotmail.com

On Fri, Apr 11, 2014 at 5:56 AM, Russ Cox <r...@golang.org> wrote:
> grep hires /etc/system on both machines.

It was off on both systems, but read below.

I think we are on the right track. I bumped clock Hz to 2500 on the
bad machine, and the problem disappeared:

oos:runtime$ GOMAXPROCS=2 runtime.test -test.cpu=1,2,4 -test.v

-test.run TestStackGrowth$ runtime
=== RUN TestStackGrowth

--- PASS: TestStackGrowth (2.24 seconds)
=== RUN TestStackGrowth-2
--- PASS: TestStackGrowth-2 (2.75 seconds)
=== RUN TestStackGrowth-4
--- PASS: TestStackGrowth-4 (2.45 seconds)
PASS

So we now know what causes this problem, but we still don't know
why it works on the newer machine, which worries me.

Linux uses HPETs for all type related bussiness if it can find
HPETs. It's conceivable the ARM builders don't have HPETs.

Solaris doesn't use HPETs by default, but it *can* use them if the
programmer asks for them. Do you want me to try changing the Solaris
runtime to alwayse use HPETs? (we only support amd64, and all amd64
machines have HPETs).

--
Aram Hăvărneanu

Aram Hăvărneanu

unread,

Apr 11, 2014, 7:25:15 AM4/11/14

to Russ Cox, David du Colombier, Gobot Gobot, Dmitry Vyukov, Dave Cheney, Brad Fitzpatrick, golang-co...@googlegroups.com, re...@codereview-hr.appspotmail.com

I filled issue 7763 so we have a record of this.

Unfortunately, using HPETs on Solaris requires the proc_clock_highres
privilege. I think this is on by default on regular installs, but it's
off by default in a Solaris zone, so we can't mandate it.

I hate to introduce complexity and optional machinery, but I'd like to
change to runtime to try to use HPETs and fallback to the slower case
if it can't. The improvement is dramatic, and the additional
complexity seems small. How does that sound?

--
Aram Hăvărneanu

Aram Hăvărneanu

unread,

Apr 11, 2014, 8:54:13 AM4/11/14

to Russ Cox, David du Colombier, Gobot Gobot, Dmitry Vyukov, Dave Cheney, Brad Fitzpatrick, golang-co...@googlegroups.com, re...@codereview-hr.appspotmail.com

Ok, so I replaced the sleeps with yields, and problem is solved:
http://paste.ubuntu.com/7234785/

--
Aram Hăvărneanu

Aram Hăvărneanu

unread,

Apr 11, 2014, 8:57:43 AM4/11/14

to Russ Cox, David du Colombier, Gobot Gobot, Dmitry Vyukov, Dave Cheney, Brad Fitzpatrick, golang-co...@googlegroups.com, re...@codereview-hr.appspotmail.com

Also, I decided agains the optional HPET thing, it's complicated and
involves signals. I'll pass.

--
Aram Hăvărneanu

Russ Cox

unread,

Apr 11, 2014, 9:24:56 AM4/11/14

to Aram Hăvărneanu, David du Colombier, Gobot Gobot, Dmitry Vyukov, Dave Cheney, Brad Fitzpatrick, golang-co...@googlegroups.com, re...@codereview-hr.appspotmail.com

It's fine but after the release freeze is over.

As for your bad-vs-good machines, I'm basically certain that the bad machine is using 100 Hz timers and the good machine is using 1000 Hz timers. The only question is how it is that they are configured differently, but I'll leave that to you.

Russ

Aram Hăvărneanu

unread,

Apr 11, 2014, 9:34:53 AM4/11/14

to Russ Cox, David du Colombier, Gobot Gobot, Dmitry Vyukov, Dave Cheney, Brad Fitzpatrick, golang-co...@googlegroups.com, re...@codereview-hr.appspotmail.com

On Fri, Apr 11, 2014 at 3:24 PM, Russ Cox <r...@golang.org> wrote:
> It's fine but after the release freeze is over.

No, it's ridiculously complicated, I'll pass.

> As for your bad-vs-good machines, I'm basically certain that the bad machine
> is using 100 Hz timers and the good machine is using 1000 Hz timers

Yes, you're right, the good machines ticks at 1kHz, I should have
measured before.

--
Aram Hăvărneanu

Reply all

Reply to author

Forward