What happened:
The new k8s.io/kubernetes/cmd/kube-apiserver/app/testing.StartTestServerOrDie
function (introduced in #46865) returns a teardown function which doesn't cleanly shut down the test server. This results in the accumulation of goroutines and log spam which prevents effective use of the test server across multiple test functions within the same process.
What you expected to happen:
Calling the teardown function should gracefully terminate everything that started up when StartTestServerOrDie
was called.
How to reproduce it (as minimally and precisely as possible):
Using the following sample integration test code:
import ( "fmt" "runtime" "testing" apitesting "k8s.io/kubernetes/cmd/kube-apiserver/app/testing" ) func TestTeardown(t *testing.T) { _, tearDown := apitesting.StartTestServerOrDie(t) tearDown() stack := make([]byte, 8196) size := 0 for { size = runtime.Stack(stack, true) if size < len(stack) { break } stack = make([]byte, len(stack)*2) } fmt.Printf("%s\n", string(stack[0:size])) }
After tearDown()
returns, there are several lingering goroutines which forever attempt to maintain etcd connections to the terminated etcd instance. Here's an example stack dump:
https://gist.github.com/ironcladlou/52b3e3306948db3943b426c70ce7f85b
Among all the etcd connection threads, some things you'll notice are lingering Cacher
instances (which are created due to the default EnableWatchCache
storage setting) which seem to try and hold watches, and configuration_manager (which may or may not hold connections; I'm not sure yet). This seems to indicate various components started during apiserver setup aren't actually shutting down.
Anything else we need to know?:
This is important for enabling integration testing of custom resource garbage collection (#47665).
Environment:
kubectl version
): master (as of 088141c)uname -a
):/cc @sttts @caesarxuchao @deads2k @liggitt @kubernetes/sig-api-machinery-bugs
/kind bug
/sig api-machinery
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
/cc @frobware
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Prevent issues from auto-closing with an /lifecycle frozen
comment.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or @fejta
.
/lifecycle stale
/remove-lifecycle stale
We don't use this testing framework in the service-catalog, but we have a similar problem.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
—
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
—
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/lifecycle frozen
still an issue. updated reproducer:
package apiserver import ( "math" "runtime" "testing" "time" apitesting "k8s.io/kubernetes/cmd/kube-apiserver/app/testing" "k8s.io/kubernetes/test/integration/framework" ) func TestCleanTeardown(t *testing.T) { etcdConfig := framework.SharedEtcd() // Watch goroutines for 5 seconds before starting the server to get a high-water mark before := 0 for i := 0; i < 500; i++ { if goroutines := runtime.NumGoroutine(); goroutines > before { before = goroutines } time.Sleep(10 * time.Millisecond) } beforeStack := make([]byte, 1024*1024) size := runtime.Stack(beforeStack, true) beforeStack = beforeStack[:size] s := apitesting.StartTestServerOrDie(t, nil, nil, etcdConfig) s.TearDownFn() // Watch goroutines for 5 seconds after stopping the server to get a low-water mark after := math.MaxInt32 for i := 0; i < 500; i++ { if goroutines := runtime.NumGoroutine(); goroutines < after { after = goroutines } time.Sleep(10 * time.Millisecond) } if after > before { afterStack := make([]byte, 1024*1024) size := runtime.Stack(afterStack, true) afterStack = afterStack[:size] t.Logf("before:\n%s", string(beforeStack)) t.Logf("after:\n%s", string(afterStack)) t.Errorf("expected %d or fewer, got %d", before, after) } else { t.Logf("expected %d or fewer, got %d", before, after) } }
go test ./test/integration/apiserver -run TestCleanTeardown -v
teardown_test.go:47: expected 6 or fewer, got 60
Seems much better than before, not sure if this is within the threshold of considering resolved.
go test ./test/integration/apiserver -run TestCleanTeardown -v
teardown_test.go:47: expected 9 or fewer, got 10
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are on a team that was mentioned.
It's a lot closer, definitely. Can we identify the particular goroutine still hanging, and make sure that's in the list of specifically recorded exceptions to goroutine checking (should be enumerated in the integration test leak detection code, I think)
At the very least, updating this to point at the specific goroutines would help clarify the remaining work, and then we can decide how important the remaining ones are
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are on a team that was mentioned.
Hanging goroutine:
goroutine 156 [select]:
k8s.io/kubernetes/test/utils/ktesting.withTimeout.func2()
/kubernetes/test/utils/ktesting/contexthelper.go:56 +0x9a
created by k8s.io/kubernetes/test/utils/ktesting.withTimeout in goroutine 155
/kubernetes/test/utils/ktesting/contexthelper.go:55 +0x20b
63aa261 The new TContext interface combines a normal context and the testing interface, then adds some helper methods. The context gets canceled when the test is done, but that can also be requested earlier via Cancel.
Doesn't seem to be in the goroutine checking exception list.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are on a team that was mentioned.
@pohly has better insights on the ktesting plumbing
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are on a team that was mentioned.
/assign
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are on a team that was mentioned.
ah, the reproducer in #49489 (comment) runs before things that hooked into test cleanup ... I think the ktesting context is cleaned up properly as part of that
I think GoleakCheck / EtcdMain goleak detection covers the remaining goroutines, so we can probably close this issue
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are on a team that was mentioned.
Indeed. We now use go-leak in each tests's TestMain via EtcdMain. I cleaned up some more leaks a while back and recorded the remaining known exceptions with comments on goleak.IgnoreTopFunction
:
https://github.com/kubernetes/kubernetes/blob/7f4f1c29a97de6712d341bf3d724fb122c828772/test/integration/framework/etcd.go#L254-L297
I think those exceptions are acceptable, so we could close this issue.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are on a team that was mentioned.
Closed #49489 as completed.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are on a team that was mentioned.