--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/CAPqkKgnAG%2BqVRXe9LtGYsXkNKUbf-s_QKaBuMKRv_a%2BcNODavA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/CAPqkKgnAG%2BqVRXe9LtGYsXkNKUbf-s_QKaBuMKRv_a%2BcNODavA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.
To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/CAC8yG3EGR5Hjf3NfdNNLcVv95zkqZL-BFUx2v3Lu%3DzeCWRccGA%40mail.gmail.com.To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db+unsubscribe@googlegroups.com.
I think a longer timeout in and of itself is insufficient. The problem is not that the timeout is too short, but that on a severely overloaded machine cockroach might not make any progress at all. For example, I've seen evidence that on severely overloaded stressrace runs liveness heartbeats start to fail. Without liveness, nothing in the system will make progress. There is an argument that this should be fixed. I think this overloaded configuration is unusual enough that it shouldn't be something we should be spend significant time on now.PS Not that I'm talking about stressrace which uses the Go race detector and thus already imposes a significant slowdown. Cockroach should handle overload situations gracefully, but stressrace is a particularly brutal one.
On Fri, Sep 7, 2018 at 3:34 PM, Vivek Menezes <vivek....@gmail.com> wrote:
Since these run in the night, is there any value to subjecting them to the timeout of 40m? My experience has been that this timeout is only valuable because sometimes a test deadlocks and needs the guillotine. But a much higher timeout like 60 or even 90m would catch such problems without wasting everyone's time.
Has a lower timeout ever been useful?
On Fri, 7 Sep, 2018, 3:20 PM Andrei Matei, <and...@cockroachlabs.com> wrote:
--Comrades,
TLDR; we're reducing the parallelism of the nightly stressrace build; we should all close TC timeouts for that build assigned to us and start fresh tonight.For a while, and continuing, we seem to have had a continuous avalanche of test timeout flake issues filed from the nightly Team City stress jobs - in particular the "stressrace" job.This build stresses packages in isolation and currently uses a 40min timeout (for the whole package, not for individual tests).
In particular, the sql package seems to have trouble staying within the 40min timeout sometimes. On my laptop, it runs under race in ~8min, and under stressrace (default parallelism of 8) it just timed out at 25 min (not completely surprising, since an unscientific glance suggests that the package is using around ~3 CPUs when it can).
Do these timeouts mean that our tests are slow / have high variance, or that the machine gets really overloaded and some processes starve? Could be the former (e.g. we've recently had 28395 causing high variance) but there's also indications of the latter - Peter seems to be looking at 29144 carefully and the explanation there seems to be overloaded machines causing slow server startup; I've also seen a particular test taking 5m on one of these TC runs but I can't repro anything close to that.
Going with the overload theory, Nikhil just sent out 29819 which reduces the stress' parallelism from the current 8 (the default, = the number of hw threads) to 4. Thus, I think we should all just close all the 40min timeout issues (and 30min, the older timeout) and start with a fresh slate tonight.
Separately, I've historically been frustrated because these timeout github issues are random and hard to parse - the test named in the issue title is the random loser that happened to be running at the time the guillotine fell. In all the cases I've looked at, the named test had not been running for particularly suspicious amounts of time. Thus, the automatic assignee of the issue is even more random than usual. To add assault, it's not trivial to figure out, for a particular timed out package run, which test(s) did take abnormal amounts of time. I'm going to attempt to do something about it by modifying the github issue poster script to parse the times of the various tests and choose a more legitimate test for the subject / report gluttonous tests in the issue. This would serve the same purpose as the Tests tab from other TC builds, which lists time per test (although it appears to be broken at the moment). Tell me if you have ideas / opinions on this.
You received this message because you are subscribed to the Google Groups "CockroachDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/CAPqkKgnAG%2BqVRXe9LtGYsXkNKUbf-s_QKaBuMKRv_a%2BcNODavA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db...@googlegroups.com.