PSA: stress timeouts and decreasing parallelism for TC stressrace builds

4 views
Skip to first unread message

Andrei Matei

unread,
Sep 7, 2018, 3:20:05 PM9/7/18
to CockroachDB, Nikhil Benesch
Comrades,

TLDR; we're reducing the parallelism of the nightly stressrace build; we should all close TC timeouts for that build assigned to us and start fresh tonight.

For a while, and continuing, we seem to have had a continuous avalanche of test timeout flake issues filed from the nightly Team City stress jobs - in particular the "stressrace" job.
This build stresses packages in isolation and currently uses a 40min timeout (for the whole package, not for individual tests).
In particular, the sql package seems to have trouble staying within the 40min timeout sometimes. On my laptop, it runs under race in ~8min, and under stressrace (default parallelism of 8) it just timed out at 25 min (not completely surprising, since an unscientific glance suggests that the package is using around ~3 CPUs when it can).
Do these timeouts mean that our tests are slow / have high variance, or that the machine gets really overloaded and some processes starve? Could be the former (e.g. we've recently had 28395 causing high variance) but there's also indications of the latter - Peter seems to be looking at 29144 carefully and the explanation there seems to be overloaded machines causing slow server startup; I've also seen a particular test taking 5m on one of these TC runs but I can't repro anything close to that.

Going with the overload theory, Nikhil just sent out 29819 which reduces the stress' parallelism from the current 8 (the default, = the number of hw threads) to 4. Thus, I think we should all just close all the 40min timeout issues (and 30min, the older timeout) and start with a fresh slate tonight.

Separately, I've historically been frustrated because these timeout github issues are random and hard to parse - the test named in the issue title is the random loser that happened to be running at the time the guillotine fell. In all the cases I've looked at, the named test had not been running for particularly suspicious amounts of time. Thus, the automatic assignee of the issue is even more random than usual. To add assault, it's not trivial to figure out, for a particular timed out package run, which test(s) did take abnormal amounts of time. I'm going to attempt to do something about it by modifying the github issue poster script to parse the times of the various tests and choose a more legitimate test for the subject / report gluttonous tests in the issue. This would serve the same purpose as the Tests tab from other TC builds, which lists time per test (although it appears to be broken at the moment). Tell me if you have ideas / opinions on this.

Vivek Menezes

unread,
Sep 7, 2018, 3:34:44 PM9/7/18
to Andrei Matei, CockroachDB, Nikhil Benesch
Since these run in the night, is there any value to subjecting them to the timeout of 40m? My experience has been that this timeout is only valuable because sometimes a  test deadlocks and needs the guillotine. But a much higher timeout like 60 or even 90m would catch such problems without wasting everyone's time.

Has a lower timeout ever been useful?

--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/CAPqkKgnAG%2BqVRXe9LtGYsXkNKUbf-s_QKaBuMKRv_a%2BcNODavA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Andrei Matei

unread,
Sep 7, 2018, 3:41:41 PM9/7/18
to Vivek Menezes, CockroachDB, Nikhil Benesch
Well, we've had issues of systemic variance (I was quoting 28395 before for example), and even absent variance, my opinion is that we should guard some against general test slowdowns. So I think some timeout that can be hit by non-deadlocked tests is probably useful.
For another example, while trolling some of these reports, I saw an example of TestTruncateCompletion taking 5 minutes (and that's one case where the test was technically successful and was not named in the issue). And so there might be a legit bug there, so some timeout can again prove useful - although that can also be addressed sometime by looking at test run times explicitly and not just through a global timeout - which is why I want to play a little bit with generating some sort of report and seeing what info can be extracted from it.

Vivek Menezes

unread,
Sep 7, 2018, 7:23:36 PM9/7/18
to Andrei Matei, CockroachDB, Nikhil Benesch
There is clearly no support in golang for checking if a particular test is making progress. We do call stopper.Stop() on a lot of tests so I'm wondering if we should set a deadline on it being called for tests that use TestServer or TestCluster. That would cover a lot of tests.

Peter Mattis

unread,
Sep 8, 2018, 9:30:37 AM9/8/18
to Vivek Menezes, Andrei Matei, CockroachDB, Nikhil Benesch
I think a longer timeout in and of itself is insufficient. The problem is not that the timeout is too short, but that on a severely overloaded machine cockroach might not make any progress at all. For example, I've seen evidence that on severely overloaded stressrace runs liveness heartbeats start to fail. Without liveness, nothing in the system will make progress. There is an argument that this should be fixed. I think this overloaded configuration is unusual enough that it shouldn't be something we should be spend significant time on now.

PS Not that I'm talking about stressrace which uses the Go race detector and thus already imposes a significant slowdown. Cockroach should handle overload situations gracefully, but stressrace is a particularly brutal one.

To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/CAC8yG3EGR5Hjf3NfdNNLcVv95zkqZL-BFUx2v3Lu%3DzeCWRccGA%40mail.gmail.com.

Andrei Matei

unread,
Sep 10, 2018, 2:18:45 PM9/10/18
to Peter Mattis, Vivek Menezes, CockroachDB, Nikhil Benesch
Orthogonally (should I stop using this word?) to the discussion about what package timeout should be, I've worked in https://reviewable.io/reviews/cockroachdb/cockroach/29987 on making the reporting we get from the nightly stress runs better when the timeouts do happen.
Starting tonight, there should be a slow-tests-report.txt in the artifacts of any failed TC stress job listing slow tests.
If a timeout is hit, then the report is also included in the github issue created. The issue also discriminates between two cases:
1) if the test that was running when the timeout hit had been running for longer than any other test, it considers that the culprit and the issue is assigned as per the old heuristics (whomever has the blame on the TestFoo() line, or something).
2) otherwise, the test is considered an innocent bystander and the issue doesn't name it. Instead the issue just references the package and the bug is assigned to... me.

Unrelatedly, if the stress issue reports suddenly stop working, I might know who did it.




On Sat, Sep 8, 2018 at 9:30 AM Peter Mattis <pe...@cockroachlabs.com> wrote:
I think a longer timeout in and of itself is insufficient. The problem is not that the timeout is too short, but that on a severely overloaded machine cockroach might not make any progress at all. For example, I've seen evidence that on severely overloaded stressrace runs liveness heartbeats start to fail. Without liveness, nothing in the system will make progress. There is an argument that this should be fixed. I think this overloaded configuration is unusual enough that it shouldn't be something we should be spend significant time on now.

PS Not that I'm talking about stressrace which uses the Go race detector and thus already imposes a significant slowdown. Cockroach should handle overload situations gracefully, but stressrace is a particularly brutal one.
On Fri, Sep 7, 2018 at 3:34 PM, Vivek Menezes <vivek....@gmail.com> wrote:
Since these run in the night, is there any value to subjecting them to the timeout of 40m? My experience has been that this timeout is only valuable because sometimes a  test deadlocks and needs the guillotine. But a much higher timeout like 60 or even 90m would catch such problems without wasting everyone's time.
Has a lower timeout ever been useful?
On Fri, 7 Sep, 2018, 3:20 PM Andrei Matei, <and...@cockroachlabs.com> wrote:
Comrades,

TLDR; we're reducing the parallelism of the nightly stressrace build; we should all close TC timeouts for that build assigned to us and start fresh tonight.

For a while, and continuing, we seem to have had a continuous avalanche of test timeout flake issues filed from the nightly Team City stress jobs - in particular the "stressrace" job.
This build stresses packages in isolation and currently uses a 40min timeout (for the whole package, not for individual tests).
In particular, the sql package seems to have trouble staying within the 40min timeout sometimes. On my laptop, it runs under race in ~8min, and under stressrace (default parallelism of 8) it just timed out at 25 min (not completely surprising, since an unscientific glance suggests that the package is using around ~3 CPUs when it can).
Do these timeouts mean that our tests are slow / have high variance, or that the machine gets really overloaded and some processes starve? Could be the former (e.g. we've recently had 28395 causing high variance) but there's also indications of the latter - Peter seems to be looking at 29144 carefully and the explanation there seems to be overloaded machines causing slow server startup; I've also seen a particular test taking 5m on one of these TC runs but I can't repro anything close to that.

Going with the overload theory, Nikhil just sent out 29819 which reduces the stress' parallelism from the current 8 (the default, = the number of hw threads) to 4. Thus, I think we should all just close all the 40min timeout issues (and 30min, the older timeout) and start with a fresh slate tonight.

Separately, I've historically been frustrated because these timeout github issues are random and hard to parse - the test named in the issue title is the random loser that happened to be running at the time the guillotine fell. In all the cases I've looked at, the named test had not been running for particularly suspicious amounts of time. Thus, the automatic assignee of the issue is even more random than usual. To add assault, it's not trivial to figure out, for a particular timed out package run, which test(s) did take abnormal amounts of time. I'm going to attempt to do something about it by modifying the github issue poster script to parse the times of the various tests and choose a more legitimate test for the subject / report gluttonous tests in the issue. This would serve the same purpose as the Tests tab from other TC builds, which lists time per test (although it appears to be broken at the moment). Tell me if you have ideas / opinions on this.

--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages