Distributed go testing (running "go test" as a client)

308 views
Skip to first unread message

Craig Silverstein

unread,
Oct 23, 2020, 7:01:11 PM10/23/20
to golang-nuts
As a way of distributing our go-test run across multiple machines, we are exploring a client-server architecture.  The way we'd like it to work is that a "go test client" runs on the client machine(s), and repeatedly does the following:
    1) make an http GET request to "http://<server>/more_tests"
      -- the http response is a list of go package specs (e.g. `./web/`, `./web/internal`)
    2) run the tests in those specs via the equivalent of `go test`
    3) make another request to /more_tests
    4) when /more_tests responds with an empty list, meaning no more tests, the client emits the test status/summary information
      -- it would also be ok to emit the test status/summary for each run, rather than just at the end

I could put this client code "above" go test, and (2) could literally be just exec-ing `go test <package>`.  But in my experimenting the overhead of calling `go test` is quite high: `go test -count 1 <pkg1>; go test -count 1 <pkg2>` is 50% slower than `go test -count1 <pkg1> <pkg2>`.

So I'd like to put the client code "inside" go test somehow.  But I don't have a good idea how to do that, especially in a way that doesn't depend on go internals.  Alternately, if there is some way to reduce the overhead of calling `go test` -- I don't know where all the time is going -- that would be a good solution too.  Any ideas/suggestions?

Thanks!
craig

Shulhan

unread,
Oct 24, 2020, 7:00:14 AM10/24/20
to Craig Silverstein, golang-nuts
Based on my understanding, my suggestion is you should separate between Go tools (the one that execute "go test") and your test tools (the one that fetch which packages need to tests, run the actual test, and emits the result back to server), not embedded inside the *_test.go files, for the following reason: "go test" should be able to run without depends on test server.

Assume that we created a tool called "go-distributed-test" that client executed on the root of repository,

$ go-distributed-test /path/to/go-module

Client: GET http://<server>/more_tests
Server: pkgA,pkgA/B,pkgC => pkgs
Client:
FOR EACH pkg in pkgs DO
(1) result := "go test "+ $pkg;
(2) POST /$pkg/$result
DONE

On step (1), you can read and store the program exit status with their output (stdout and stderr).

I hope that helps. I am not going ask _why_ but seems like you got an interesting project to build :)


Regards,
Shulhan

Craig Silverstein

unread,
Oct 24, 2020, 11:22:32 AM10/24/20
to Shulhan, golang-nuts
> Client:
> FOR EACH pkg in pkgs DO
> (1) result := "go test "+ $pkg;
> (2) POST /$pkg/$result
> DONE

As I mentioned above: we can certainly do things this way, but there's
a big performance hit to invoking `go test` repeatedly.

To be clear, I'm not suggesting changing `go test`. My first idea was
to write our own tool, that calls out to the API in
https://golang.org/pkg/testing/ to do the actual mechanics of testing.
But that API doesn't seem to be designed to allow such a thing -- as
one example, the `go test` commandline tool seems to do a lot of work
itself, that our tool would need to duplicate -- and it's unclear how
practical it is to force it. There may well be better ideas we don't
know enough to even think of, though.

craig

Shulhan

unread,
Oct 24, 2020, 11:46:07 AM10/24/20
to csil...@khanacademy.org, golang-nuts


> On 24 Oct 2020, at 22.21, Craig Silverstein <csil...@khanacademy.org> wrote:
>
>> Client:
>> FOR EACH pkg in pkgs DO
>> (1) result := "go test "+ $pkg;
>> (2) POST /$pkg/$result
>> DONE
>
> As I mentioned above: we can certainly do things this way, but there's
> a big performance hit to invoking `go test` repeatedly.

Oh, I see. Have you try turn off vet doing test? Probably there are other build/test flags that can also be turned off which I am not aware to speeding test up.

>
> To be clear, I'm not suggesting changing `go test`. My first idea was
> to write our own tool, that calls out to the API in
> https://golang.org/pkg/testing/ to do the actual mechanics of testing.
> But that API doesn't seem to be designed to allow such a thing -- as
> one example, the `go test` commandline tool seems to do a lot of work
> itself, that our tool would need to duplicate -- and it's unclear how
> practical it is to force it. There may well be better ideas we don't
> know enough to even think of, though.

Another solution that I can think of is by building each package tests binaries in server and distributed it to test machines, so the client does not need to re-build (file read/write, compile, linking, and so on).

--
Shulhan

Craig Silverstein

unread,
Oct 25, 2020, 11:33:27 PM10/25/20
to golang-nuts
> But in my experimenting the overhead of calling `go test` is quite high: `go test -count 1 <pkg1>; go test -count 1 <pkg2>` is 50% slower than `go test -count1 <pkg1> <pkg2>`.

Oops, I forgot to take parallelism into account.  Once I do, it drops to 30% slower (which is still significant):

% time go test ./pkg/...
[...]
97.100u 9.512s 0:34.22 311.5% 0+0k 31728+2640816io 79pf+0w

% time sh -c 'find pkg -type d | parallel -iFOO go test -count 1 ./FOO'
[...]
127.725u 18.722s 0:44.26 330.8% 0+0k 164592+2509424io 443pf+0w

I'm curious why it's so much slower to run the tests one at a time than all at once.  What work is being redone? -- I'd think anything slow would have been cached.

craig

Shulhan

unread,
Oct 26, 2020, 7:18:00 AM10/26/20
to Craig Silverstein, golang-nuts


> On 26 Oct 2020, at 10.32, Craig Silverstein <csil...@khanacademy.org> wrote:
>
> > But in my experimenting the overhead of calling `go test` is quite high: `go test -count 1 <pkg1>; go test -count 1 <pkg2>` is 50% slower than `go test -count1 <pkg1> <pkg2>`.
>
> Oops, I forgot to take parallelism into account. Once I do, it drops to 30% slower (which is still significant):
>
> % time go test ./pkg/...
> [...]
> 97.100u 9.512s 0:34.22 311.5% 0+0k 31728+2640816io 79pf+0w
>
> % time sh -c 'find pkg -type d | parallel -iFOO go test -count 1 ./FOO'
> [...]
> 127.725u 18.722s 0:44.26 330.8% 0+0k 164592+2509424io 443pf+0w
>
> I'm curious why it's so much slower to run the tests one at a time than all at once. What work is being redone? -- I'd think anything slow would have been cached.
>

You realize that the flag "-count 1" disable the cache, right?

Also, why not using the "-p" flag if you want it to be parallel?

Craig Silverstein

unread,
Oct 26, 2020, 11:46:33 AM10/26/20
to Shulhan, golang-nuts
> You realize that the flag "-count 1" disable the cache, right?

Well, it disables *a* cache. The docs don't make it clear, but it
seems like it only disables the test-cache. I don't see why it would
affect the build caches -- build time being my theory as to why the
tests are slower. But maybe it does? Or maybe the extra time isn't
being spent in building after all.

> Also, why not using the "-p" flag if you want it to be parallel?

I should have done!

craig

Ian Lance Taylor

unread,
Oct 27, 2020, 1:04:27 AM10/27/20
to csil...@khanacademy.org, Shulhan, golang-nuts
On Mon, Oct 26, 2020 at 8:46 AM Craig Silverstein
<csil...@khanacademy.org> wrote:
>
> > You realize that the flag "-count 1" disable the cache, right?
>
> Well, it disables *a* cache. The docs don't make it clear, but it
> seems like it only disables the test-cache. I don't see why it would
> affect the build caches -- build time being my theory as to why the
> tests are slower. But maybe it does? Or maybe the extra time isn't
> being spent in building after all.

Hi Craig.

That's correct: -count 1 disables the test cache but not the build
cache. But it does take some time to check all the source files and
verify that the build cache is correct. Also, you can use "go test
-x" to see the commands that it is running. Perhaps the build cache
is being invoked after all. If there are commands being run, you can
get some idea of how long they take by using "go test
-toolexec=/usr/bin/time -x".

Ian

Craig Silverstein

unread,
Oct 27, 2020, 1:30:32 PM10/27/20
to golang-nuts
Thanks Ian, that was very helpful.

It looks like ~50% of the time difference is due to `go vet` calls.  I didn't explore that any further we don't need a vet-run for our use case; I just disabled it.

According to my analysis of the `/usr/bin/time` output, the rest of the time difference is entirely due to linking.   (I was also able to confirm that running the tests one at a time didn't cause any extra compiling to be done, so that was already a useful result.)  Here is a sample of the difference:

-- AS PART OF `go test [flags] ./pkg/...`:
/usr/bin/time /usr/local/go/pkg/tool/linux_amd64/link -o $WORK/b674/web.test -importcfg $WORK/b674/importcfg.link -s -w -buildmode=exe -buildid=xlo1rhZ4fSqnq2f0090T/zrKinvuitz15kdis6cgV/TiXPK6tw8c4z-r2XpAzE/xlo1rhZ4fSqnq2f0090T -extld=gcc /home/csilvers/.cache/go-build/04/042d0fef4f6befffaefb8b14240f8d40a40632e7cc3157ae895ab0800369c448-d
0.56user 0.07system 0:00.59elapsed 108%CPU (0avgtext+0avgdata 161812maxresident)k

-- AS PART OF `go test [flags] ./pkg/web`:
/usr/bin/time /usr/local/go/pkg/tool/linux_amd64/link -o $WORK/b001/web.test -importcfg $WORK/b001/importcfg.link -s -w -buildmode=exe -buildid=xlo1rhZ4fSqnq2f0090T/zrKinvuitz15kdis6cgV/TiXPK6tw8c4z-r2XpAzE/xlo1rhZ4fSqnq2f0090T -extld=gcc /home/csilvers/.cache/go-build/04/042d0fef4f6befffaefb8b14240f8d40a40632e7cc3157ae895ab0800369c448-d
2.69user 0.20system 0:03.29elapsed 88%CPU (0avgtext+0avgdata 535280maxresident)k

The commands are identical, well with the possible exception of the contents of importcfg.link, but when I examined them they looked identical as well.

The main difference is user-time, and I don't know how to get more insight into that.  I tried using `ltrace`, but the linker doesn't have the symbol tables it needs.

I did run with `-toolexec=strace -f -c`, and it indicated that the time-difference is due to `futex` calls: between them, the 32 linker calls (for the 32 packages) spent 12.5 seconds in futex() for the ./pkg/... run, and 20 seconds in futex() for the individual-package runs.  This accounts for almost all the difference in the test-run times.  I don't know what to make of that though.

craig

Ian Lance Taylor

unread,
Oct 27, 2020, 1:45:49 PM10/27/20
to Craig Silverstein, golang-nuts
On Tue, Oct 27, 2020 at 10:30 AM Craig Silverstein
<csil...@khanacademy.org> wrote:
>
> Thanks Ian, that was very helpful.
>
> It looks like ~50% of the time difference is due to `go vet` calls. I didn't explore that any further we don't need a vet-run for our use case; I just disabled it.
>
> According to my analysis of the `/usr/bin/time` output, the rest of the time difference is entirely due to linking. (I was also able to confirm that running the tests one at a time didn't cause any extra compiling to be done, so that was already a useful result.) Here is a sample of the difference:
>
> -- AS PART OF `go test [flags] ./pkg/...`:
> /usr/bin/time /usr/local/go/pkg/tool/linux_amd64/link -o $WORK/b674/web.test -importcfg $WORK/b674/importcfg.link -s -w -buildmode=exe -buildid=xlo1rhZ4fSqnq2f0090T/zrKinvuitz15kdis6cgV/TiXPK6tw8c4z-r2XpAzE/xlo1rhZ4fSqnq2f0090T -extld=gcc /home/csilvers/.cache/go-build/04/042d0fef4f6befffaefb8b14240f8d40a40632e7cc3157ae895ab0800369c448-d
> 0.56user 0.07system 0:00.59elapsed 108%CPU (0avgtext+0avgdata 161812maxresident)k
>
> -- AS PART OF `go test [flags] ./pkg/web`:
> /usr/bin/time /usr/local/go/pkg/tool/linux_amd64/link -o $WORK/b001/web.test -importcfg $WORK/b001/importcfg.link -s -w -buildmode=exe -buildid=xlo1rhZ4fSqnq2f0090T/zrKinvuitz15kdis6cgV/TiXPK6tw8c4z-r2XpAzE/xlo1rhZ4fSqnq2f0090T -extld=gcc /home/csilvers/.cache/go-build/04/042d0fef4f6befffaefb8b14240f8d40a40632e7cc3157ae895ab0800369c448-d
> 2.69user 0.20system 0:03.29elapsed 88%CPU (0avgtext+0avgdata 535280maxresident)k
>
> The commands are identical, well with the possible exception of the contents of importcfg.link, but when I examined them they looked identical as well.
>
> The main difference is user-time, and I don't know how to get more insight into that. I tried using `ltrace`, but the linker doesn't have the symbol tables it needs.
>
> I did run with `-toolexec=strace -f -c`, and it indicated that the time-difference is due to `futex` calls: between them, the 32 linker calls (for the 32 packages) spent 12.5 seconds in futex() for the ./pkg/... run, and 20 seconds in futex() for the individual-package runs. This accounts for almost all the difference in the test-run times. I don't know what to make of that though.

The linker is definitely slow, though it got faster in 1.15 and will
get faster again in the future 1.16 (I don't know which version you
are running).

The build cache only applies to object files. Fully linked
executables are not cached. Or, more precisely, they are not put into
the build cache. If you run "go build -o x PKG" for some main package
PKG, and x already exists, and is fully up to date, then it will not
be relinked. But test binaries are not normally saved anywhere.

That said, I don't know if this will help for your use case, but you
can use "go test -c PKG" to build the test binary for PKG. Then you
can just run the resulting PKG.test file directly to run the tests.
But if you want to distribute the test building time as well as the
test running time, that may not help. But it might help to save the
test binaries across runs, and use "go test -c -o saved-test".

Ian



> On Monday, October 26, 2020 at 10:04:27 PM UTC-7 Ian Lance Taylor wrote:
>>
>> On Mon, Oct 26, 2020 at 8:46 AM Craig Silverstein
>> <csil...@khanacademy.org> wrote:
>> >
>> > > You realize that the flag "-count 1" disable the cache, right?
>> >
>> > Well, it disables *a* cache. The docs don't make it clear, but it
>> > seems like it only disables the test-cache. I don't see why it would
>> > affect the build caches -- build time being my theory as to why the
>> > tests are slower. But maybe it does? Or maybe the extra time isn't
>> > being spent in building after all.
>>
>> Hi Craig.
>>
>> That's correct: -count 1 disables the test cache but not the build
>> cache. But it does take some time to check all the source files and
>> verify that the build cache is correct. Also, you can use "go test
>> -x" to see the commands that it is running. Perhaps the build cache
>> is being invoked after all. If there are commands being run, you can
>> get some idea of how long they take by using "go test
>> -toolexec=/usr/bin/time -x".
>>
>> Ian
>
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/69498b58-7907-48ab-b721-4790db8f983bn%40googlegroups.com.

Craig Silverstein

unread,
Oct 27, 2020, 1:54:18 PM10/27/20
to golang-nuts
On Tuesday, October 27, 2020 at 10:45:49 AM UTC-7 Ian Lance Taylor wrote:
The linker is definitely slow, though it got faster in 1.15 and will
get faster again in the future 1.16 (I don't know which version you
are running).

We're on go 1.13.1.

It's not so much the linker speed I'm concerned with, it's why is the linker faster (in aggregate) when being called 32 times via `go test ./pkg/...` vs being called 32 times via `go test <one package>` 32 times.  I can understand linker times fluctuating a bit from run to run for Reasons, but this aggregate result is very consistent.  It's surprising to me, though may not be worth looking into if the linker has changed a lot since 1.13.

In any case, turning off go vet was a useful speedup in itself; the remaining time difference may be small enough to allow the project to proceed.  I'd love more insight into the weird linker-time behavior if anyone has it!, but we're not blocked on that anymore.  Thanks again for your help.

craig
Reply all
Reply to author
Forward
0 new messages