[llvm-dev] False positive notifications around commit notifications

Philip Reames via llvm-dev

unread,

Sep 9, 2021, 6:18:18 PM9/9/21

to llvm...@lists.llvm.org

I've been noticing a trend where there is more and more false positive email notifications sent out on valid commits. This is getting really problematic as real signal is being lost in the noise. I've had several cases in the last few weeks where I did not see a "real" failure notice because it was buried in a bunch of false positives.

Let me run through a few sources of what I consider false positives, and suggest a couple things we could do to clean these up. Note that the recommendations here are entirely independent and we can adopt any subset.

Slow Try Bots

ex: "This revision was landed with ongoing or failed builds." on https://reviews.llvm.org/D109091

Someone - I'm not really sure who - enabled builds for all reviews, and this notice on landed commits. Given it's utterly routine to make a last few style fixes before landing an LGTMed change, I consider this notice complete noise. In practice, almost review gets tagged this way. To be clear, there is value in being told about changes which don't build. The false positive part is only around the "ongoing" builds.

Recommendation: Disable this message for the "ongoing" build case, and if we can't, disable them entirely.

Flaky Builders

ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250

We have many build bots which are not entirely stable. It's gotten to the point where I *expect* failure notifications on literally every change I land. I've been trying to reach out to individual build bot owners to get issues resolved, and to their credit, most owners have been very responsive. However, we have enough builders that the situation isn't getting meaningful better.

Recommendation: Introduce specific "test commits" whose only purpose is to run the CI infrastructure. Any builder which notifies of failure on such a commit (and only said commit) is disabled without discussion until human action is taken by the bot owner to re-enable. The idea here is to a) automate the process, and b) shift the responsibility of action to the bot owner for any flaky bot.

Note: By "disabled", I specifically mean that *notification* is disabled. Leaving it in the waterfall view is fine, as long as we're not sending out email about it.

Aside: It's really tempting to attempt to separate builders which are "still failing" (e.g. a rare configuration which has been broken for a few days) from "flaky" ones. I'd argue any bot notifying on a "still failing" case is buggy, and thus it's fine to treat them the same as a "flaky" bot.

Slow Builders and Redundant Notices

ex: https://lab.llvm.org/buildbot#builders/67/builds/4128

Occasionally, we have a bad commit land which breaks every (or nearly every) builder. That happens. If you happen to land a change just before or after it, you then get on the blame list for every slow running builder we have (since they tend to have large commit windows) if they happen to cycle before the fix is committed. This is particularly annoying since the root issue is likely fixed quickly, but due to cycle times on the builders, you may be getting emails for 24 hours to come.

Recommendation: Introduce a new requirement for "slow" builders (say cycle time of > 30 minutes) either a) have a maximum commit window of ~15 commits, or b) use a staged builder model. Personally, I'd prefer the staged model, but the max commit window at least helps to limit the damage.

By "staged builder model", I mean that slow builders only build points in the history which have already been successfully build by one of the fast builders. This eliminates redundant build failures, at the cost of delaying the slow builder slightly. As long as the slow builder uses the "last good commit" as opposed to waiting until the current fast builder finishes, the delay should be very minimal for most commits.

Philip

David Blaikie via llvm-dev

unread,

Sep 9, 2021, 6:39:29 PM9/9/21

to Philip Reames, llvm...@lists.llvm.org

I think most of these discussions are readily agreed upon, but that sending a broad email like this is unlikely to reach the right people/result in action.

At least for the try bots stuff - finding the owner of that system, and seeing if it can be made to not put that notice in the emails.

Yes, the buildbot configuration is intended not to send mail on already-red. If you're seeing those, specifically looking at which bot/configuration, and fixing the configuration - that shouldn't be a per-bot issue, but a buildbot server configuration that's broken in some way.

Slow builders - yeah, I'm up for having an upper bound on size of a blame list. I wonder if that could be implemented in the buildbot config itself - if the blame list is over a certain size, just don't send mail. That way we don't even have to classify bots - move them between groups if they become slower/faster - makes it easier for bot owners to fix the issue themselves - allocate more compute resources (faster system, or multiple systems running in parallel) and it'll naturally start sending fail-mail. But that does still leave the duplicate and very delayed results - slow builder with lots of machines dedicated could have small blame lists but still produce an answer hours later, possibly just redundant with some faster builder - hence a staged system would be much nicer instead of or in addition to.

A staged building system would be great, but requires a bunch of work to build that no one's signed up to do as yet. I don't think anyone would object to such a thing being implemented. (Apple internally had/has a system that uses a staged build system that reused the build product of earlier builds, even - so like a baseline builder, then other builders that can consume that build to run stage 2 and another that consume the build to run test-suite, etc - so less duplicate, more throughput, and less noise - was hoping that's what the greendragon stuff would become, but doesn't seem like it has)

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

via llvm-dev

unread,

Sep 9, 2021, 9:36:20 PM9/9/21

to dbla...@gmail.com, list...@philipreames.com, llvm...@lists.llvm.org

Maybe a round-table at the dev meeting? which might collect more of the relevant folks.

Among my immediate crowd, it’s a cause for (ironic) concern if no bot fail-mail shows up (it’s that rare). Did the commit actually go in? Did I accidentally push a development branch?

--paulr

Mehdi AMINI via llvm-dev

unread,

Sep 10, 2021, 2:37:40 PM9/10/21

to Philip Reames, llvm...@lists.llvm.org

On Thu, Sep 9, 2021 at 3:18 PM Philip Reames via llvm-dev <llvm...@lists.llvm.org> wrote:

I've been noticing a trend where there is more and more false positive email notifications sent out on valid commits. This is getting really problematic as real signal is being lost in the noise. I've had several cases in the last few weeks where I did not see a "real" failure notice because it was buried in a bunch of false positives.

Let me run through a few sources of what I consider false positives, and suggest a couple things we could do to clean these up. Note that the recommendations here are entirely independent and we can adopt any subset.

Slow Try Bots

ex: "This revision was landed with ongoing or failed builds." on https://reviews.llvm.org/D109091

Someone - I'm not really sure who - enabled builds for all reviews, and this notice on landed commits. Given it's utterly routine to make a last few style fixes before landing an LGTMed change

I do such "few style fixes", but I don't re-upload a revision before landing, so I don't see this "false positive" in general.

What I frequently see is that the pre-merge config is broken for some other reason, and that's quite annoying. One aspect of the issue is that the is no buildbot tracking the pre-merge configuration so it can be broken without notification (there is a buildkite job tracking it, but buildkite does not support blamelist notifications).

, I consider this notice complete noise. In practice, almost review gets tagged this way. To be clear, there is value in being told about changes which don't build. The false positive part is only around the "ongoing" builds.

Recommendation: Disable this message for the "ongoing" build case, and if we can't, disable them entirely.

Flaky Builders

ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250

We have many build bots which are not entirely stable. It's gotten to the point where I *expect* failure notifications on literally every change I land. I've been trying to reach out to individual build bot owners to get issues resolved, and to their credit, most owners have been very responsive. However, we have enough builders that the situation isn't getting meaningful better.

Recommendation: Introduce specific "test commits" whose only purpose is to run the CI infrastructure. Any builder which notifies of failure on such a commit (and only said commit) is disabled without discussion until human action is taken by the bot owner to re-enable. The idea here is to a) automate the process, and b) shift the responsibility of action to the bot owner for any flaky bot.

Note: By "disabled", I specifically mean that *notification* is disabled. Leaving it in the waterfall view is fine, as long as we're not sending out email about it.

Aside: It's really tempting to attempt to separate builders which are "still failing" (e.g. a rare configuration which has been broken for a few days) from "flaky" ones. I'd argue any bot notifying on a "still failing" case is buggy, and thus it's fine to treat them the same as a "flaky" bot.

Slow Builders and Redundant Notices

ex: https://lab.llvm.org/buildbot#builders/67/builds/4128

Occasionally, we have a bad commit land which breaks every (or nearly every) builder. That happens. If you happen to land a change just before or after it, you then get on the blame list for every slow running builder we have (since they tend to have large commit windows) if they happen to cycle before the fix is committed. This is particularly annoying since the root issue is likely fixed quickly, but due to cycle times on the builders, you may be getting emails for 24 hours to come.

Recommendation: Introduce a new requirement for "slow" builders (say cycle time of > 30 minutes) either a) have a maximum commit window of ~15 commits, or b) use a staged builder model. Personally, I'd prefer the staged model, but the max commit window at least helps to limit the damage.

By "staged builder model", I mean that slow builders only build points in the history which have already been successfully build by one of the fast builders. This eliminates redundant build failures, at the cost of delaying the slow builder slightly. As long as the slow builder uses the "last good commit" as opposed to waiting until the current fast builder finishes, the delay should be very minimal for most commits.

Does buildbot support staged builders? That would really be ideal indeed!

If we could also disable notification to the blamelist when it is larger than 5, that'd be great!

Cheers,

--

Mehdi

Philip Reames via llvm-dev

unread,

Sep 10, 2021, 2:53:10 PM9/10/21

to Mehdi AMINI, llvm...@lists.llvm.org

On 9/10/21 11:36 AM, Mehdi AMINI wrote:

On Thu, Sep 9, 2021 at 3:18 PM Philip Reames via llvm-dev <llvm...@lists.llvm.org> wrote:

I've been noticing a trend where there is more and more false positive email notifications sent out on valid commits. This is getting really problematic as real signal is being lost in the noise. I've had several cases in the last few weeks where I did not see a "real" failure notice because it was buried in a bunch of false positives.

Let me run through a few sources of what I consider false positives, and suggest a couple things we could do to clean these up. Note that the recommendations here are entirely independent and we can adopt any subset.

Slow Try Bots

ex: "This revision was landed with ongoing or failed builds." on https://reviews.llvm.org/D109091

Someone - I'm not really sure who - enabled builds for all reviews, and this notice on landed commits. Given it's utterly routine to make a last few style fixes before landing an LGTMed change

I do such "few style fixes", but I don't re-upload a revision before landing, so I don't see this "false positive" in general.

I don't explicit upload the final patch either, but something in the close automation does.

What I frequently see is that the pre-merge config is broken for some other reason, and that's quite annoying. One aspect of the issue is that the is no buildbot tracking the pre-merge configuration so it can be broken without notification (there is a buildkite job tracking it, but buildkite does not support blamelist notifications).

Hm, maybe I misinterpreted the cause of these entirely? Your explanation sounds plausible as well.

If your explanation is correct, that would lean strongly to the "just disable" option.

Do you know who to contact about this? (i.e. Who owns the automation here? Or where is the appropriate code to adjust?)

, I consider this notice complete noise. In practice, almost review gets tagged this way. To be clear, there is value in being told about changes which don't build. The false positive part is only around the "ongoing" builds.

Recommendation: Disable this message for the "ongoing" build case, and if we can't, disable them entirely.

Flaky Builders

ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250

We have many build bots which are not entirely stable. It's gotten to the point where I *expect* failure notifications on literally every change I land. I've been trying to reach out to individual build bot owners to get issues resolved, and to their credit, most owners have been very responsive. However, we have enough builders that the situation isn't getting meaningful better.

Recommendation: Introduce specific "test commits" whose only purpose is to run the CI infrastructure. Any builder which notifies of failure on such a commit (and only said commit) is disabled without discussion until human action is taken by the bot owner to re-enable. The idea here is to a) automate the process, and b) shift the responsibility of action to the bot owner for any flaky bot.

Note: By "disabled", I specifically mean that *notification* is disabled. Leaving it in the waterfall view is fine, as long as we're not sending out email about it.

Aside: It's really tempting to attempt to separate builders which are "still failing" (e.g. a rare configuration which has been broken for a few days) from "flaky" ones. I'd argue any bot notifying on a "still failing" case is buggy, and thus it's fine to treat them the same as a "flaky" bot.

Slow Builders and Redundant Notices

ex: https://lab.llvm.org/buildbot#builders/67/builds/4128

Occasionally, we have a bad commit land which breaks every (or nearly every) builder. That happens. If you happen to land a change just before or after it, you then get on the blame list for every slow running builder we have (since they tend to have large commit windows) if they happen to cycle before the fix is committed. This is particularly annoying since the root issue is likely fixed quickly, but due to cycle times on the builders, you may be getting emails for 24 hours to come.

Recommendation: Introduce a new requirement for "slow" builders (say cycle time of > 30 minutes) either a) have a maximum commit window of ~15 commits, or b) use a staged builder model. Personally, I'd prefer the staged model, but the max commit window at least helps to limit the damage.

By "staged builder model", I mean that slow builders only build points in the history which have already been successfully build by one of the fast builders. This eliminates redundant build failures, at the cost of delaying the slow builder slightly. As long as the slow builder uses the "last good commit" as opposed to waiting until the current fast builder finishes, the delay should be very minimal for most commits.

Does buildbot support staged builders? That would really be ideal indeed!

If we could also disable notification to the blamelist when it is larger than 5, that'd be great!

I'll be honest here and say I don't know what buildbot natively supports. Even if it doesn't, there are "easy" process workarounds to achieve the same effect. Just as an example (i.e. definitely not proposing this as technical solution to be implemented right now), we could introduce a new branch in git called e.g. "buildbot-tracking-slow" and have a specific fast builder do a fast forward merge from main into this branch. All "slow" builders would simply follow this branch and not main.

If we get consensus that this is the right approach, I am willing to put some of my own time to figuring out how to implement this. For my own volunteer time, I'd probably start with the flaky bot test commit piece just because that's much easier to do manually first and then automate, and because I find them personally more annoying.

Your point about disabling notification on a blamelist larger than 5 seems reasonable to me, but I'd definitely consider "build chunks of no more than N commits" and "build arbitrary sets, but only notify if less than M people" as distinct possibilities to be evaluated independently.

David Blaikie via llvm-dev

unread,

Sep 10, 2021, 5:03:18 PM9/10/21

to Philip Reames, Mikhail Goncharov, Christian Kühnel, llvm...@lists.llvm.org

The folks on the Infrastructure Working Group: https://foundation.llvm.org/docs/infrastructure-wg/ might have some context on this (I think Christian Kuhnel is part of that & he's on this list).

On Fri, Sep 10, 2021 at 11:53 AM Philip Reames via llvm-dev <llvm...@lists.llvm.org> wrote:

On 9/10/21 11:36 AM, Mehdi AMINI wrote:

On Thu, Sep 9, 2021 at 3:18 PM Philip Reames via llvm-dev <llvm...@lists.llvm.org> wrote:

I've been noticing a trend where there is more and more false positive email notifications sent out on valid commits. This is getting really problematic as real signal is being lost in the noise. I've had several cases in the last few weeks where I did not see a "real" failure notice because it was buried in a bunch of false positives.

Let me run through a few sources of what I consider false positives, and suggest a couple things we could do to clean these up. Note that the recommendations here are entirely independent and we can adopt any subset.

Slow Try Bots

ex: "This revision was landed with ongoing or failed builds." on https://reviews.llvm.org/D109091

Someone - I'm not really sure who - enabled builds for all reviews, and this notice on landed commits. Given it's utterly routine to make a last few style fixes before landing an LGTMed change

I do such "few style fixes", but I don't re-upload a revision before landing, so I don't see this "false positive" in general.

I don't explicit upload the final patch either, but something in the close automation does.

What I frequently see is that the pre-merge config is broken for some other reason, and that's quite annoying. One aspect of the issue is that the is no buildbot tracking the pre-merge configuration so it can be broken without notification (there is a buildkite job tracking it, but buildkite does not support blamelist notifications).

Hm, maybe I misinterpreted the cause of these entirely? Your explanation sounds plausible as well.

If your explanation is correct, that would lean strongly to the "just disable" option.

Do you know who to contact about this? (i.e. Who owns the automation here? Or where is the appropriate code to adjust?)

Looks like: https://reviews.llvm.org/harbormaster/plan/5/ indicates it was setup by https://reviews.llvm.org/p/goncharov/ (added them to the "to" line on this email) Also looks like Kuhnel ( https://reviews.llvm.org/H576 ) has sometthing to do with it, so I've added them too.

, I consider this notice complete noise. In practice, almost review gets tagged this way. To be clear, there is value in being told about changes which don't build. The false positive part is only around the "ongoing" builds.

Recommendation: Disable this message for the "ongoing" build case, and if we can't, disable them entirely.

Flaky Builders

ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250

We have many build bots which are not entirely stable. It's gotten to the point where I *expect* failure notifications on literally every change I land. I've been trying to reach out to individual build bot owners to get issues resolved, and to their credit, most owners have been very responsive. However, we have enough builders that the situation isn't getting meaningful better.

Recommendation: Introduce specific "test commits" whose only purpose is to run the CI infrastructure. Any builder which notifies of failure on such a commit (and only said commit) is disabled without discussion until human action is taken by the bot owner to re-enable. The idea here is to a) automate the process, and b) shift the responsibility of action to the bot owner for any flaky bot.

Note: By "disabled", I specifically mean that *notification* is disabled. Leaving it in the waterfall view is fine, as long as we're not sending out email about it.

Aside: It's really tempting to attempt to separate builders which are "still failing" (e.g. a rare configuration which has been broken for a few days) from "flaky" ones. I'd argue any bot notifying on a "still failing" case is buggy, and thus it's fine to treat them the same as a "flaky" bot.

Slow Builders and Redundant Notices

ex: https://lab.llvm.org/buildbot#builders/67/builds/4128

Occasionally, we have a bad commit land which breaks every (or nearly every) builder. That happens. If you happen to land a change just before or after it, you then get on the blame list for every slow running builder we have (since they tend to have large commit windows) if they happen to cycle before the fix is committed. This is particularly annoying since the root issue is likely fixed quickly, but due to cycle times on the builders, you may be getting emails for 24 hours to come.

Recommendation: Introduce a new requirement for "slow" builders (say cycle time of > 30 minutes) either a) have a maximum commit window of ~15 commits, or b) use a staged builder model. Personally, I'd prefer the staged model, but the max commit window at least helps to limit the damage.

By "staged builder model", I mean that slow builders only build points in the history which have already been successfully build by one of the fast builders. This eliminates redundant build failures, at the cost of delaying the slow builder slightly. As long as the slow builder uses the "last good commit" as opposed to waiting until the current fast builder finishes, the delay should be very minimal for most commits.

Does buildbot support staged builders? That would really be ideal indeed!

If we could also disable notification to the blamelist when it is larger than 5, that'd be great!

I'll be honest here and say I don't know what buildbot natively supports. Even if it doesn't, there are "easy" process workarounds to achieve the same effect. Just as an example (i.e. definitely not proposing this as technical solution to be implemented right now), we could introduce a new branch in git called e.g. "buildbot-tracking-slow" and have a specific fast builder do a fast forward merge from main into this branch. All "slow" builders would simply follow this branch and not main.

If we get consensus that this is the right approach, I am willing to put some of my own time to figuring out how to implement this. For my own volunteer time, I'd probably start with the flaky bot test commit piece just because that's much easier to do manually first and then automate, and because I find them personally more annoying.

There was some prototype buildbot based (so far as I can tell) staged builder setup years ago, it seems: https://marc.info/?l=cfe-dev&m=136442525121902&w=2 - perhaps there's some commit history in the zorg repo that shows how it was configured.

Your point about disabling notification on a blamelist larger than 5 seems reasonable to me, but I'd definitely consider "build chunks of no more than N commits" and "build arbitrary sets, but only notify if less than M people" as distinct possibilities to be evaluated independently.

"chunks of no more than N commits" risks slower builders getting behind - presumably you'd need some catchup mechanism (run a build with all the available changes, but don't notify) if they just kept falling further and further behind.

- Dave

Mehdi AMINI via llvm-dev

unread,

Sep 10, 2021, 8:58:25 PM9/10/21

to David Blaikie, llvm...@lists.llvm.org, Mikhail Goncharov

Slightly off-topic, but relevant for slow builders: switching to using `ccache` made a huge difference in for one of our builder ( https://lab.llvm.org/buildbot/#/builders/61 ) ; it'll occasionally spend 25-30 min but otherwise is mostly taking only ~4min. Maybe we could advise this more widely as well?

--

Mehdi

David Blaikie via llvm-dev

unread,

Sep 10, 2021, 9:02:17 PM9/10/21

to Mehdi AMINI, llvm...@lists.llvm.org, Mikhail Goncharov

Sounds alright. I'd love to see incremental builds be made reliable enough that builders would depend on them - not that that would completely invalidate benefits of ccache, but presumably a lot of the benefit comes when doing clean builds (& I get why clean builds are desirable - changes to CMake, etc, don't always work perfectly without a clean rebuild today)

Florian Hahn via llvm-dev

unread,

Sep 22, 2021, 5:45:35 AM9/22/21

to Philip Reames, llvm...@lists.llvm.org

Hi Philip,

On Sep 9, 2021, at 23:18, Philip Reames via llvm-dev <llvm...@lists.llvm.org> wrote:
Flaky Builders
ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250
We have many build bots which are not entirely stable. It's gotten to the point where I *expect* failure notifications on literally every change I land. I've been trying to reach out to individual build bot owners to get issues resolved, and to their credit, most owners have been very responsive. However, we have enough builders that the situation isn't getting meaningful better.
Recommendation: Introduce specific "test commits" whose only purpose is to run the CI infrastructure. Any builder which notifies of failure on such a commit (and only said commit) is disabled without discussion until human action is taken by the bot owner to re-enable. The idea here is to a) automate the process, and b) shift the responsibility of action to the bot owner for any flaky bot.

Thanks for raising this issue! My experience matches what you are describing. The false positive rate for me is seems to be at least 10 false positives due to flakiness to 1 real failure.

I think it would be good to have some sort of policy spelling out the requirements for having notification enabled for a buildbot, with a process that makes it easy to disable flaky bots until the owners can make them more stable. It would be good if notifications could be disabled without requiring contacting/interventions from individual owners, but I am not sure if that’s possible with buildbot.

Cheers,

Florian

Martin Storsjö via llvm-dev

unread,

Sep 22, 2021, 5:50:40 AM9/22/21

to Florian Hahn, llvm...@lists.llvm.org

On Wed, 22 Sep 2021, Florian Hahn via llvm-dev wrote:

> Thanks for raising this issue! My experience matches what you are
> describing. The false positive rate for me is seems to be at least 10 false
> positives due to flakiness to 1 real failure.
> I think it would be good to have some sort of policy spelling out the
> requirements for having notification enabled for a buildbot, with a process
> that makes it easy to disable flaky bots until the owners can make them more
> stable. It would be good if notifications could be disabled without
> requiring contacting/interventions from individual owners, but I am not sure
> if that’s possible with buildbot.

Another aspect is that some tests can be flakey - they might work
seemingly fine in local testing but start showing up as timeouts/spurious
failures when run in a CI/buildbot setting. And due to their flakiness,
it's not evident when the breakage is introduced, but over time, such
flakey tests/setups do add up, to the situation we have today.

// Martin

Nemanja Ivanovic via llvm-dev

unread,

Oct 6, 2021, 7:08:23 AM10/6/21

to Martin Storsjö, llvm...@lists.llvm.org

I wonder if it would be possible to make some recommendations for improvements based on data rather than our collective anecdotal experience. Much as anyone else, I feel that the vast majority of the failure emails I get are not related, but I would have a lot of trouble quantifying it any better than a "gut feeling".

Would it be possible to somehow acquire historical data from buildbots to help identify things that can improve. Perhaps:

- Bot failures where none of the commits were reverted before the bot went back to green

- For those failures, collect the test cases that failed - those might be flaky test cases if they show up frequently and/or on multiple bots

- For bots that have many such instances (especially with different test cases every time), perhaps the bot itself is somehow flaky

This is definitely an annoying problem that has significant consequences (real failures being missed due to many false failures), but it is a difficult problem to solve.

David Blaikie via llvm-dev

unread,

Oct 11, 2021, 1:56:59 PM10/11/21

to Nemanja Ivanovic, llvm...@lists.llvm.org

Here's a fun one: https://lab.llvm.org/buildbot/#/builders/164/builds/3428 - a buildbot failure with a single blame (me) - but I hadn't committed in the last few days, so I was confused. Turns out its from a change committed 3 months ago - and the failure is a timeout.

Given the number of buildbot timeout false positives, I honestly wouldn't be averse to saying timeouts shouldn't produce fail-mail & are the responsibility of buildbot owners to triage. I realize we can actually submit code that leads to timeouts, but on balance that seems rare compared to the number of times its a buildbot configuration issue instead. (though open to debate on that for sure)

Michael Kruse via llvm-dev

unread,

Oct 11, 2021, 3:07:55 PM10/11/21

to David Blaikie, llvm...@lists.llvm.org

Am Mo., 11. Okt. 2021 um 12:57 Uhr schrieb David Blaikie via llvm-dev
<llvm...@lists.llvm.org>:

> Here's a fun one: https://lab.llvm.org/buildbot/#/builders/164/builds/3428 - a buildbot failure with a single blame (me) - but I hadn't committed in the last few days, so I was confused. Turns out its from a change committed 3 months ago - and the failure is a timeout.
>
> Given the number of buildbot timeout false positives, I honestly wouldn't be averse to saying timeouts shouldn't produce fail-mail & are the responsibility of buildbot owners to triage. I realize we can actually submit code that leads to timeouts, but on balance that seems rare compared to the number of times its a buildbot configuration issue instead. (though open to debate on that for sure)

Wow, that bot does not collapse buildrequests and is indeed 3 months
behind due to not being fast enough to keep up with LLVM's commit
rate. Even if the bot was reliable, getting notified 3 months later
isn't useful.
From the wildly varying duration the test step takes (5 - 33 minutes;
not the build step, it is doing incremental builds), I assume that the
worker is running other things in parallel, maybe another worker, such
that the buildjob sometimes is starving and causing the timeout. IMHO
buildbots should not run other heavy jobs in parallel.

Michael

Chris Lattner via llvm-dev

unread,

Oct 12, 2021, 12:53:52 PM10/12/21

to Michael Kruse, llvm...@lists.llvm.org

> On Oct 11, 2021, at 12:06 PM, Michael Kruse via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Am Mo., 11. Okt. 2021 um 12:57 Uhr schrieb David Blaikie via llvm-dev
> <llvm...@lists.llvm.org>:
>> Here's a fun one: https://lab.llvm.org/buildbot/#/builders/164/builds/3428 - a buildbot failure with a single blame (me) - but I hadn't committed in the last few days, so I was confused. Turns out its from a change committed 3 months ago - and the failure is a timeout.
>>
>> Given the number of buildbot timeout false positives, I honestly wouldn't be averse to saying timeouts shouldn't produce fail-mail & are the responsibility of buildbot owners to triage. I realize we can actually submit code that leads to timeouts, but on balance that seems rare compared to the number of times its a buildbot configuration issue instead. (though open to debate on that for sure)
>
> Wow, that bot does not collapse buildrequests and is indeed 3 months
> behind due to not being fast enough to keep up with LLVM's commit
> rate. Even if the bot was reliable, getting notified 3 months later
> isn't useful.
> From the wildly varying duration the test step takes (5 - 33 minutes;
> not the build step, it is doing incremental builds), I assume that the
> worker is running other things in parallel, maybe another worker, such
> that the buildjob sometimes is starving and causing the timeout. IMHO
> buildbots should not run other heavy jobs in parallel.

I agree with David re: timeouts should only go to the owner of the bot.

Separately, is the arc-builder builder actually useful? Should we remove it?

-Chris

Philip Reames via llvm-dev

unread,

Oct 12, 2021, 2:20:30 PM10/12/21

to Chris Lattner, Michael Kruse, llvm...@lists.llvm.org

At a minimum, we should suppress notification for any builder 3 months
behind ToT. Removal might be a step too far - maybe the bot owner is
monitoring for their own purposes - but it definitely should not be
notifying.

Philip Reames via llvm-dev

unread,

Oct 28, 2021, 4:56:45 PM10/28/21

to Florian Hahn, llvm...@lists.llvm.org

https://reviews.llvm.org/D112755 adds the first pieces of some documented policy around build bot expectations. It does not address the point you raise as the intent was to be a minimal documentation of existing practice, and thus hopefully be non-controversial, but assuming this moves forward, I plan to revisit this topic in its own review.

Cheers,

Florian

Reply all

Reply to author

Forward