[llvm-dev] Responsibilities of a buildbot owner

Stella Stamenova via llvm-dev

unread,

Jan 8, 2022, 3:06:49 PM1/8/22

to llvm-dev

Hey all,

I have a couple of questions about what the responsibilities of a buildbot owner are. I’ve been maintaining a couple of buildbots for lldb and mlir for some time now and I thought I had a pretty good idea of what is required based on the documentation here: How To Add Your Build Configuration To LLVM Buildbot Infrastructure — LLVM 13 documentation

My understanding was that there are some things that are *expected* of the owner. Namely:

Make sure that the buildbot is connected and has the right infrastructure (e.g. the right version of Python, or tools, etc.). Update as needed.
Make sure that the build configuration is one that is supported (e.g. supported flavor or cmake variables). Update as needed.

There are also a couple of things that are *optional*, but nice to have:

If the buildbot stays red for a while (where “a while” is completely subjective), figure out the patch or patches that are causing an issue and either revert them or notify the authors, so they can take action.
If someone is having trouble investigating a failure that only happens on the buildbot (or the buildbot is a rare configuration), help them out (e.g. collect logs if possible).

Up to now, I’ve not had any issues with this and the community has been very good at fixing issues with builds and tests when I point them out, or more often than not, without me having to do anything but the occasional test re-run and software update (like this one, for example, ⚙ D114639 Raise the minimum Visual Studio version to VS2019 (llvm.org)). lldb has some tests that are flaky because of the nature of the product, so there is some noise, but mostly things work well and everyone seems happy.

I’ve recently run into a situation that makes me wonder whether there are other expectations of a buildbot owner that are not explicitly listed in the llvm documentation. Someone reached out to me some time ago to let me know their unhappiness at the flakiness of some of the lldb tests and demanded that I either fix them or disable them. I let them know that there are some tests that are known to be flaky, that my expectation is that it is not my responsibility to fix all such issues and that the community would be very happy to have their contribution in the form of a fix or a change to disable the tests. I didn’t get a response from this person, but I did disable a couple of particularly flaky tests since it seemed like the nice thing to do.

The real excitement happened yesterday when I received an email that *the build bot had been turned off*. This same person reached out to the powers that be (without letting me know) and asked them explicitly to silence it *without my active involvement* because of the flakiness.

I have a couple of issues with this approach but perhaps I’ve misunderstood what my responsibilities are as the buildbot owner. I know it is frustrating to see a bot fail because of flaky tests and it is nice to have someone to ask to resolve them all – is that really the expectation of a buildbot owner? Where is the line between maintenance of the bot and fixing build and test issues for the community?

I’d like to understand what the general expectations are and if there are things missing from the documentation, I propose that we add them, so that it is clear for everyone what is required.

Thanks,

-Stella

Philip Reames via llvm-dev

unread,

Jan 8, 2022, 4:01:42 PM1/8/22

to Stella Stamenova, llvm-dev

Stella,

Thank you for raising the question. This is a great discussion for us to have publicly.

So folks know, I am the individual Stella mentioned below. I'll start with a bit of history so that everyone's on the same page, then dive into the policy question.

My general take is that buildbots are only useful if failure notifications are generally actionable. A couple months back, I was on the edge of setting up mail filter rules to auto-delete a bunch of bots because they were regularly broken, and decided I should try to be constructive first. In the first wave of that, I emailed a couple of bot owners about things which seemed like false positives.

At the time, I thought it was the bot owners responsibility to not be testing a flaky configuration. I got a bit of push back on that from a couple sources - Stella was one - and put that question on hold. This thread is a great opportunity to decide what our policy actually is, and document it.

In the meantime, I've been working with Galina to document existing practice where we could, and to try to identify best practices on setting up bots. These changes have been posted publicly, and reviewed through the normal process. We've been deliberately trying to stick to non-controversial stuff as we got the docs improved. I've been actively reaching out to bot owners to gather feedback in this process, but Stella had not, yet, been one.

Separately, this week I noticed a bot which was repeatedly toggling between red and green. I forget the exact ratio, but in the recent build history, there were multiple transitions, seemingly unrelated to the changes being committed. I emailed Galina asking her to address, and she removed the buildbot until it could be moved to the staging buildmaster, addressed, and then restored. I left Stella off the initial email. Sorry about that, no ill intent, just written in a hurry.

Now, transitioning into a bit of policy discussion...

From my conversations with existing bot owners, there is a general agreement that bots should only be notifying the community if they are stable enough. There's honest disagreement on what the bar for stable enough is, and disagreement about exactly whose responsibility addressing new instability is. (To be clear, I'd separate instability from a clear deterministic breakage caused by a commit - we have a lot more agreement on that.)

My personal take is that for a bot to be publicly notifying, "someone" needs to take the responsibility to backstop the normal revert to green process. This "someone" can be developers who work in a particular area, the bot owner, or some combination thereof. I view the responsibility of the bot config owner as being the person responsible for making sure that backstopping is happening. Not necessarily by doing it themselves, but by having the contacts with developers who can, and following up when the normal flow is not working.

In this particular example, we appear to have a bunch of flaky lldb tests. I personally know absolutely nothing about lldb. I have no idea whether the tests are badly designed, the system they're being run on isn't yet supported by lldb, or if there's some recent code bug introduced which causes the failure. "Someone" needs to take the responsibility of figuring that out, and in the meantime spaming developers with inactionable failure notices seems undesirable.

For context, the bot was disabled until it could be moved to the staging buildmaster. Moving to staging is required (currently) to disable developer notification. In the email from Galina, it seems clear that the bot would be fine to move back to production once the issue was triaged. This seems entirely reasonable to me.

Philip

p.s. One thing I'll note as a definite problem with the current system is that a lot of this happens in private email, and it's hard to share so that everyone has a good picture of what's going on. It makes miscommunications all too easy. Last time I spoke with Galina, we were tentative planning to start using github issues for bot operation matters to address that, but as that was in the middle of the transition from bugzilla, we deferred and haven't gotten back to that yet.

p.p.s. The bot in question is https://lab.llvm.org/buildbot/#/builders/83 if folks want to examine the history themselves.

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Mehdi AMINI via llvm-dev

unread,

Jan 8, 2022, 8:15:29 PM1/8/22

to Philip Reames, llvm-dev

Hi,

First: thanks a lot Stella for being a bot owner and providing valuable resources to the community. The sequence of even is really unfortunate here, and thank you for bringing it up to everyone's attention, let's try to improve our processes.

On Sat, Jan 8, 2022 at 1:01 PM Philip Reames via llvm-dev <llvm...@lists.llvm.org> wrote:

Stella,

Thank you for raising the question. This is a great discussion for us to have publicly.

So folks know, I am the individual Stella mentioned below. I'll start with a bit of history so that everyone's on the same page, then dive into the policy question.

My general take is that buildbots are only useful if failure notifications are generally actionable. A couple months back, I was on the edge of setting up mail filter rules to auto-delete a bunch of bots because they were regularly broken, and decided I should try to be constructive first. In the first wave of that, I emailed a couple of bot owners about things which seemed like false positives.

At the time, I thought it was the bot owners responsibility to not be testing a flaky configuration. I got a bit of push back on that from a couple sources - Stella was one - and put that question on hold. This thread is a great opportunity to decide what our policy actually is, and document it.

In the meantime, I've been working with Galina to document existing practice where we could, and to try to identify best practices on setting up bots. These changes have been posted publicly, and reviewed through the normal process. We've been deliberately trying to stick to non-controversial stuff as we got the docs improved. I've been actively reaching out to bot owners to gather feedback in this process, but Stella had not, yet, been one.

Separately, this week I noticed a bot which was repeatedly toggling between red and green. I forget the exact ratio, but in the recent build history, there were multiple transitions, seemingly unrelated to the changes being committed. I emailed Galina asking her to address, and she removed the buildbot until it could be moved to the staging buildmaster, addressed, and then restored. I left Stella off the initial email. Sorry about that, no ill intent, just written in a hurry.

Now, transitioning into a bit of policy discussion...

From my conversations with existing bot owners, there is a general agreement that bots should only be notifying the community if they are stable enough. There's honest disagreement on what the bar for stable enough is, and disagreement about exactly whose responsibility addressing new instability is. (To be clear, I'd separate instability from a clear deterministic breakage caused by a commit - we have a lot more agreement on that.)

My personal take is that for a bot to be publicly notifying, "someone" needs to take the responsibility to backstop the normal revert to green process. This "someone" can be developers who work in a particular area, the bot owner, or some combination thereof. I view the responsibility of the bot config owner as being the person responsible for making sure that backstopping is happening. Not necessarily by doing it themselves, but by having the contacts with developers who can, and following up when the normal flow is not working.

In this particular example, we appear to have a bunch of flaky lldb tests. I personally know absolutely nothing about lldb. I have no idea whether the tests are badly designed, the system they're being run on isn't yet supported by lldb, or if there's some recent code bug introduced which causes the failure. "Someone" needs to take the responsibility of figuring that out, and in the meantime spaming developers with inactionable failure notices seems undesirable.

I generally agree with the overall sentiment. I would add that something worse differentiating is that the source of flakiness can be coming from the bot itself (flaky hardware / fragile setup), or from the test/codebase itself (a flaky bot may just be a deterministic ASAN failure).

Of course from Philip's point of view it does not matter: the effect on the developer is similar, we get undesirable and unactionable notifications. From the maintenance flow however, it matters in that the "someone" who has to take responsibility is often not the same group of folks.

Also when encountering flaky tests, the best action may not be to disable the bot itself but instead to disable the test itself! (and file a bug against the test owner...).

One more dimension that seems to surface here may be different practices or expectations across subprojects, for example here the LLDB folks may be used to having some flaky tests, but they trigger on changes to LLVM itself, where we may not expect any flakiness (or so).

For context, the bot was disabled until it could be moved to the staging buildmaster. Moving to staging is required (currently) to disable developer notification. In the email from Galina, it seems clear that the bot would be fine to move back to production once the issue was triaged. This seems entirely reasonable to me.

Something quite annoying with staging is that it does not have (as far as I know) a way to continue to notify the buildbot owner. I don't really care about staging vs prod as much as having a mode to just "not notify the blame list" / "only notify the owner".

--

Mehdi

David Blaikie via llvm-dev

unread,

Jan 9, 2022, 9:07:19 PM1/9/22

to Mehdi AMINI, llvm-dev

+1 to most of what Mehdi's said here - I'd love to see improvements in stability, though probably having some rigid delegation of responsibility (rather than relying on developers to judge whether it's a flaky test or flaky bot - that isn't always obvious, maybe it's only flaky on a particular configuration that that buildbot happens to test and the developer doesn't have access to - then which is it?) might help (eg: if it's at all unclear, then the assumption is that it's always the test or always the buildbot owner - and an expectation that the author or owner then takes responsibility for working with the other party to address the issue, etc).

That all said, disabling individual tests may risk no one caring enough to re-enable them, especially when the flakiness is found long after the change is made that introduced the test or flakiness (usually the case with flakiness - it takes a while to become apparent) - I don't really know how to address that issue. The "convenience" with disabling a buildbot is that there's other value to the buildbot (other than the flaky test that was providing negative value), so buildbot owners have more motivation to get the bot back online - though I don't want to burden buildbot owners unduly either (because they'd eventually give up on them) :/

- Dave

Philip Reames via llvm-dev

unread,

Jan 10, 2022, 6:34:07 PM1/10/22

to Stella Stamenova, llvm-dev, clay...@gmail.com, jin...@apple.com, jmol...@apple.com, ztu...@google.com

+CC lldb code owners

This bot appears to have been restored to the primary buildmaster, but is failing something like 1 in 5 builds due to lldb tests which are flaky.

https://lab.llvm.org/buildbot/#/builders/83

Specifically, this test is the one failing:

commands/watchpoints/hello_watchlocation/TestWatchLocation.py

Can someone with LLDB context please either a) address the cause of the flakiness or b) disable the test?

Philip

p.s. Please restrict this sub-thread to the topic of stabilizing this bot. Policy questions can be addressed in the other sub-threads to keep this vaguely understandable.

Jim Ingham via llvm-dev

unread,

Jan 10, 2022, 8:26:34 PM1/10/22

to Philip Reames, llvm-dev, ztu...@google.com

This situation is somewhat complicated by the fact that Zachary - the only listed code owner for Windows support - hasn’t worked on lldb for quite a while now. Various people have been helping with the Windows port, but I’m not sure that there’s someone taking overall responsibility for the Windows port.

Greg may have access to a Windows system, but neither Jason nor I work on Windows at all. In fact, I don’t think anybody listed in the Code Owner’s file for lldb does much work on Windows. For the health of that port, we probably do need someone to organize the effort and help sort out this sort of thing.

Anyway, looking at the current set of bot failures for this Windows bot, I saw three basic classes of failures (besides the build breaks).

1) Watchpoint Support:

TestWatchLocation.py wasn’t the only or even the most common Watchpoint failure in these test runs:

For instance in:

https://lab.llvm.org/buildbot/#/builders/83/builds/13600

https://lab.llvm.org/buildbot/#/builders/83/builds/13543

The failing test is TestWatchpointMultipleThreads.py.

On:

https://lab.llvm.org/buildbot/#/builders/83/builds/13579

https://lab.llvm.org/buildbot/#/builders/83/builds/13576

https://lab.llvm.org/buildbot/#/builders/83/builds/13565

https://lab.llvm.org/buildbot/#/builders/83/builds/13538

it’s TestSetWatchlocation.py

On:

https://lab.llvm.org/buildbot/#/builders/83/builds/13550

https://lab.llvm.org/buildbot/#/builders/83/builds/13508

It’s TestWatchLocationWithWatchSet.py

On:

https://lab.llvm.org/buildbot/#/builders/83/builds/13528

It’s TestTargetWatchAddress.py

These are all in one way or another failing because we set a watchpoint, and expected to hit it, and did not. In the failing tests, we do verify that we got a valid watchpoint back. We just “continue” expecting to hit it and don't. The tests don’t seem to be doing anything suspicious that would cause inconsistent behavior, and they aren’t failing on other systems. It sounds more like the way lldb-server for Windows implements watchpoint setting is flakey in some way.

So these really are “tests correctly showing flakey behavior in the underlying code”. We could just skip all these watchpoint tests, but we already have 268-some odd tests that are marked as skipIfWindows, most with annotations that some behavior or other is flakey on Windows. It is not great for the platform support to just keep adding to that count, but if nobody is available to dig into the Windows watchpoint code, we probably need to declare Watchpoint support “in a beta state” and turn off all the tests for it. But that seems like a decision that should be made by someone with more direct responsibility for the Windows port.

Does our bot strategy cover how to deal with incomplete platform support on some particular platform? Is the only choice really just turning off all the tests that are uncovering flaws in the underlying implementation?

2) Random mysterious failure:

I also saw one failure here:

https://lab.llvm.org/buildbot/#/builders/83/builds/13513

functionalities/load_after_attach/TestLoadAfterAttach.py

In that one, lldb sets a breakpoint, confirms that the breakpoint got a valid location, then continues and runs to completion w/o hitting the breakpoint. Again, that test is quite straightforward, and it looks like the underlying implementation, not the test, is what is at fault.

3) lldb-server for Windows test failures:

In these runs:

https://lab.llvm.org/buildbot/#/builders/83/builds/13594

https://lab.llvm.org/buildbot/#/builders/83/builds/13580

https://lab.llvm.org/buildbot/#/builders/83/builds/13550

https://lab.llvm.org/buildbot/#/builders/83/builds/13535

https://lab.llvm.org/buildbot/#/builders/83/builds/13526

https://lab.llvm.org/buildbot/#/builders/83/builds/13525

https://lab.llvm.org/buildbot/#/builders/83/builds/13511

https://lab.llvm.org/buildbot/#/builders/83/builds/13498

The failure was in the Windows’ lldb-server implementation here:

tools/lldb-server/tests/./LLDBServerTests.exe/StandardStartupTest.TestStopReplyContainsThreadPcs

And there were a couple more lldb-server test fails:

https://lab.llvm.org/buildbot/#/builders/83/builds/13527

https://lab.llvm.org/buildbot/#/builders/83/builds/13524

Where the failure is:

tools/lldb-server/TestGdbRemoteExpeditedRegisters.py

MacOS doesn’t use lldb-server, so I am not particularly familiar with it, and didn’t look into these failures further.

Jim

Stella Stamenova via llvm-dev

unread,

Jan 10, 2022, 8:47:54 PM1/10/22

to Jim Ingham, Philip Reames, llvm-dev, ztu...@google.com

1) Watchpoint Support:

I disabled a couple of the watchpoint tests that are occasionally failing this morning. I think there may be one or two more that fail as well and we could disable those also. I am not sure whether the issue here is with watchpoint support or lldb-server. I actually think the issue is with lldb-server, but I haven’t worked on lldb in years (besides the buildbot), so I haven’t investigated in more details. I think some of these tests became flaky recently (possibly since the upgrade to VS2019?)

2) Random mysterious failure:

I’ve noticed a class of failures in llvm, lld, clang, and lldb (mostly lldb and lld) that have to do with running multiple threads on Windows. I think the underlying issue is that code in the product as well as the tests doesn’t account for the way Windows behaves with regards to new threads and the order of events ends up being non-deterministic. The lld failure in particular was incredibly frustrating because it would only occur occasionally, never on a buildbot as far as I could tell, and the comments in the code seem to indicate that it should work (but it doesn’t): llvm-project/Parallel.cpp at e356027016c6365b3d8924f54c33e2c63d931492 · llvm/llvm-project (github.com). Ideally, someone familiar with Windows threading would address the issue across the board.

3) lldb-server for Windows test failures:

tools/lldb-server/tests/./LLDBServerTests.exe/StandardStartupTest.TestStopReplyContainsThreadPcs

This particular failure is definitely more recent (in the last couple of months) and I would hate for us to disable this test instead of having someone who works on lldb-server investigate.

Thanks,

-Stella

Pavel Labath via llvm-dev

unread,

Jan 11, 2022, 6:32:42 AM1/11/22

to Stella Stamenova, Jim Ingham, Philip Reames, llvm-dev, ztu...@google.com

I am afraid I too have to say that I believe the real problem here is
the lack active developers with interest in/commitment to the windows
port of lldb. While I appreciate having Stella's windows buildbot
around, and it prevents windows from bitrotting completely, it would
take a much more active involvement to resolve the multitude of systemic
issues affecting windows support. Like, if we tried to apply the current
llvm support policy guidelines to the windows (host-side, at least)
support code, I don't think it would even meet the criteria for
inclusion in the peripheral tier (active sub-community).

Now for something slightly more constructive:

While I am not familiar with the windows-specific parts of the
watchpoint code, I think I can say without exaggerating that I have a
*lot* of experience in fixing flaky tests. That experience tells me that
flaky watchpoint tests are often/usually caused by factors outside lldb.
(due to watchpoints being a global, scarce, hardware resource).
Virtualization is particularly tricky here -- every virtualization
technology that I've tried has had (at some point in time at least) a
watchpoint-related bug. The problem described here sounds a lot like the
issue I observed on Google Compute Engine, which could also miss some
watchpoints "randomly". So, if this bot is running in any kind of a
virtualized environment, the first thing I'd do is check whether the
issue happens on physical hardware.

Relatedly to that, I also want to mention that we also have the ability
to skip categories of tests in lldb. All the watchpoint tests are
(should be) annotated by the watchpoint category, and so you can easily
skip all of them, either by hard-disabling the category for windows in
the source code (if this is an lldb issue) or externally through the
buildbot config (if this is due to the bot environment =>
LLDB_TEST_USER_ARGS="--skip-category watchpoint").

hope that helps,
pl

On 11/01/2022 02:47, Stella Stamenova via llvm-dev wrote:
> 1) Watchpoint Support:
>
> I disabled a couple of the watchpoint tests that are occasionally
> failing this morning. I think there may be one or two more that fail as
> well and we could disable those also. I am not sure whether the issue
> here is with watchpoint support or lldb-server. I actually think the
> issue is with lldb-server, but I haven’t worked on lldb in years
> (besides the buildbot), so I haven’t investigated in more details. I
> think some of these tests became flaky recently (possibly since the
> upgrade to VS2019?)
>
> 2) Random mysterious failure:
>
> I’ve noticed a class of failures in llvm, lld, clang, and lldb (mostly
> lldb and lld) that have to do with running multiple threads on Windows.
> I think the underlying issue is that code in the product as well as the
> tests doesn’t account for the way Windows behaves with regards to new
> threads and the order of events ends up being non-deterministic. The lld
> failure in particular was incredibly frustrating because it would only
> occur occasionally, never on a buildbot as far as I could tell, and the
> comments in the code seem to indicate that it should work (but it
> doesn’t): llvm-project/Parallel.cpp at
> e356027016c6365b3d8924f54c33e2c63d931492 · llvm/llvm-project
> (github.com)

> <https://github.com/llvm/llvm-project/blob/e356027016c6365b3d8924f54c33e2c63d931492/llvm/lib/Support/Parallel.cpp>.

> Ideally, someone familiar with Windows threading would address the issue
> across the board.
>
> 3) lldb-server for Windows test failures:
>
> tools/lldb-server/tests/./LLDBServerTests.exe/StandardStartupTest.TestStopReplyContainsThreadPcs
>
> This particular failure is definitely more recent (in the last couple of
> months) and I would hate for us to disable this test instead of having
> someone who works on lldb-server investigate.
>
> Thanks,
>
> -Stella
>

> *From:* Jim Ingham <jin...@apple.com>
> *Sent:* Monday, January 10, 2022 5:26 PM
> *To:* Philip Reames <list...@philipreames.com>
> *Cc:* Stella Stamenova <sti...@microsoft.com>; llvm-dev
> <llvm...@lists.llvm.org>; clay...@gmail.com; jmol...@apple.com;
> ztu...@google.com
> *Subject:* [EXTERNAL] Re: [llvm-dev] Responsibilities of a buildbot owner

>
> This situation is somewhat complicated by the fact that Zachary - the
> only listed code owner for Windows support - hasn’t worked on lldb for
> quite a while now. Various people have been helping with the Windows
> port, but I’m not sure that there’s someone taking overall
> responsibility for the Windows port.
>
> Greg may have access to a Windows system, but neither Jason nor I work
> on Windows at all. In fact, I don’t think anybody listed in the Code
> Owner’s file for lldb does much work on Windows. For the health of that
> port, we probably do need someone to organize the effort and help sort
> out this sort of thing.
>
> Anyway, looking at the current set of bot failures for this Windows bot,
> I saw three basic classes of failures (besides the build breaks).
>
> 1) Watchpoint Support:
>
> TestWatchLocation.py wasn’t the only or even the most common Watchpoint
> failure in these test runs:
>
> For instance in:
>
> https://lab.llvm.org/buildbot/#/builders/83/builds/13600

> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83%2Fbuilds%2F13600&data=04%7C01%7Cstilis%40microsoft.com%7Ca1c8a1d4e51740696ff008d9d4a16243%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637774611884787489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=en7t1pHwgIzShGF0G4azkpwW06xfxCmJpxhWiOFjiZg%3D&reserved=0>
>
> https://lab.llvm.org/buildbot/#/builders/83/builds/13543
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83%2Fbuilds%2F13543&data=04%7C01%7Cstilis%40microsoft.com%7Ca1c8a1d4e51740696ff008d9d4a16243%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637774611884787489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=d64KQWUAUBt1dyWV4E1mhdJfraNRUPSqAKbSaJvWxUM%3D&reserved=0>

>
> The failing test is TestWatchpointMultipleThreads.py.
>
> On:
>
> https://lab.llvm.org/buildbot/#/builders/83/builds/13579

> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83%2Fbuilds%2F13579&data=04%7C01%7Cstilis%40microsoft.com%7Ca1c8a1d4e51740696ff008d9d4a16243%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637774611884837479%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=6QuRoDFgy89SSDxT%2FtHIshqSdZEFkCJQ1btOgLXZ2U4%3D&reserved=0>
>
> https://lab.llvm.org/buildbot/#/builders/83/builds/13576
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83%2Fbuilds%2F13576&data=04%7C01%7Cstilis%40microsoft.com%7Ca1c8a1d4e51740696ff008d9d4a16243%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637774611884837479%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gcDA1Z75UuhoeqMoCTcC%2B2OJIsRGkBfO6icOruuYyiM%3D&reserved=0>
>
> https://lab.llvm.org/buildbot/#/builders/83/builds/13565
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83%2Fbuilds%2F13565&data=04%7C01%7Cstilis%40microsoft.com%7Ca1c8a1d4e51740696ff008d9d4a16243%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637774611884837479%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=c3ielcfvwh5hNJBjTQlOJqAqLQ7vDgm58B5STsabncE%3D&reserved=0>
>
> https://lab.llvm.org/buildbot/#/builders/83/builds/13538
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83%2Fbuilds%2F13538&data=04%7C01%7Cstilis%40microsoft.com%7Ca1c8a1d4e51740696ff008d9d4a16243%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637774611884837479%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=cz82bluf1Gn3ED4s8oBX1Wq3oIixTqqtKpfhLqNPjX8%3D&reserved=0>

>
> it’s TestSetWatchlocation.py
>
> On:
>
> https://lab.llvm.org/buildbot/#/builders/83/builds/13550

> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83%2Fbuilds%2F13550&data=04%7C01%7Cstilis%40microsoft.com%7Ca1c8a1d4e51740696ff008d9d4a16243%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637774611884837479%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=DoU09empZFVXcSdEqWLTAKJeqavyisnM3%2ByyRsQpfAg%3D&reserved=0>
>
> https://lab.llvm.org/buildbot/#/builders/83/builds/13508
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83%2Fbuilds%2F13508&data=04%7C01%7Cstilis%40microsoft.com%7Ca1c8a1d4e51740696ff008d9d4a16243%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637774611884837479%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=TJfw9KplEcyRe7CKHZAp0Zz8mgSup%2Fg0pQ8LELXjbQ4%3D&reserved=0>

>
> It’s TestWatchLocationWithWatchSet.py
>
> On:
>
> https://lab.llvm.org/buildbot/#/builders/83/builds/13528

> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83%2Fbuilds%2F13528&data=04%7C01%7Cstilis%40microsoft.com%7Ca1c8a1d4e51740696ff008d9d4a16243%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637774611884837479%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=yKD1Z1DjFPby7cf8DKI5YE8fK1e%2F7bSFOmbFwrXx0UA%3D&reserved=0>

>
> It’s TestTargetWatchAddress.py
>
> These are all in one way or another failing because we set a watchpoint,
> and expected to hit it, and did not. In the failing tests, we do verify
> that we got a valid watchpoint back. We just “continue” expecting to
> hit it and don't. The tests don’t seem to be doing anything suspicious
> that would cause inconsistent behavior, and they aren’t failing on other
> systems. It sounds more like the way lldb-server for Windows implements
> watchpoint setting is flakey in some way.
>
> So these really are “tests correctly showing flakey behavior in the
> underlying code”. We could just skip all these watchpoint tests, but we
> already have 268-some odd tests that are marked as skipIfWindows, most
> with annotations that some behavior or other is flakey on Windows. It
> is not great for the platform support to just keep adding to that count,
> but if nobody is available to dig into the Windows watchpoint code, we
> probably need to declare Watchpoint support “in a beta state” and turn
> off all the tests for it. But that seems like a decision that should be
> made by someone with more direct responsibility for the Windows port.
>
> Does our bot strategy cover how to deal with incomplete platform support
> on some particular platform? Is the only choice really just turning off
> all the tests that are uncovering flaws in the underlying implementation?
>
> 2) Random mysterious failure:
>
> I also saw one failure here:
>
> https://lab.llvm.org/buildbot/#/builders/83/builds/13513

> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83%2Fbuilds%2F13513&data=04%7C01%7Cstilis%40microsoft.com%7Ca1c8a1d4e51740696ff008d9d4a16243%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637774611884837479%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=b8c0i1EgqGLm61xrB33VRpO4EasdIO4PcnKbDGRNRhQ%3D&reserved=0>

>
> functionalities/load_after_attach/TestLoadAfterAttach.py
>
> In that one, lldb sets a breakpoint, confirms that the breakpoint got a
> valid location, then continues and runs to completion w/o hitting the
> breakpoint. Again, that test is quite straightforward, and it looks
> like the underlying implementation, not the test, is what is at fault.
>
> 3) lldb-server for Windows test failures:
>
> In these runs:
>
> https://lab.llvm.org/buildbot/#/builders/83/builds/13594

>
> The failure was in the Windows’ lldb-server implementation here:
>
> tools/lldb-server/tests/./LLDBServerTests.exe/StandardStartupTest.TestStopReplyContainsThreadPcs
>
> And there were a couple more lldb-server test fails:
>
> https://lab.llvm.org/buildbot/#/builders/83/builds/13527

> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83%2Fbuilds%2F13527&data=04%7C01%7Cstilis%40microsoft.com%7Ca1c8a1d4e51740696ff008d9d4a16243%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637774611884887489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=oBGs6jpECuwfge9zqvuVFxQoYzVkzxov8bNoJQHbqq0%3D&reserved=0>
>
> https://lab.llvm.org/buildbot/#/builders/83/builds/13524
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83%2Fbuilds%2F13524&data=04%7C01%7Cstilis%40microsoft.com%7Ca1c8a1d4e51740696ff008d9d4a16243%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637774611884887489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ZcjSxB%2FI2S5D6RZfrhVhuPd9oIRgC%2Fb1FdAdpbwGhY0%3D&reserved=0>

>
> Where the failure is:
>
> tools/lldb-server/TestGdbRemoteExpeditedRegisters.py
>
> MacOS doesn’t use lldb-server, so I am not particularly familiar with
> it, and didn’t look into these failures further.
>
> Jim
>
>
>
> On Jan 10, 2022, at 3:33 PM, Philip Reames

> <list...@philipreames.com <mailto:list...@philipreames.com>> wrote:
>
> +CC lldb code owners
>
> This bot appears to have been restored to the primary buildmaster,
> but is failing something like 1 in 5 builds due to lldb tests which
> are flaky.
>
> https://lab.llvm.org/buildbot/#/builders/83

> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83&data=04%7C01%7Cstilis%40microsoft.com%7Ca1c8a1d4e51740696ff008d9d4a16243%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637774611884887489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=sMLjLgtzcxeqBDUwdHPbVA%2F7RN9VE0ACvg%2FHetKXjhU%3D&reserved=0>

>
> Specifically, this test is the one failing:
>
> commands/watchpoints/hello_watchlocation/TestWatchLocation.py
>
> Can someone with LLDB context please either a) address the cause of
> the flakiness or b) disable the test?
>
> Philip
>
> p.s. Please restrict this sub-thread to the topic of stabilizing
> this bot. Policy questions can be addressed in the other
> sub-threads to keep this vaguely understandable.
>
> On 1/8/22 1:01 PM, Philip Reames via llvm-dev wrote:
>
> In this particular example, we appear to have a bunch of flaky
> lldb tests. I personally know absolutely nothing about lldb. I
> have no idea whether the tests are badly designed, the system
> they're being run on isn't yet supported by lldb, or if there's
> some recent code bug introduced which causes the failure.
> "Someone" needs to take the responsibility of figuring that out,
> and in the meantime spaming developers with inactionable failure
> notices seems undesirable.
>
>

Philip Reames via llvm-dev

unread,

Jan 11, 2022, 12:23:11 PM1/11/22

to Pavel Labath, Stella Stamenova, Jim Ingham, llvm-dev, ztu...@google.com

Would it be reasonable to recommend that all of our windows bots testing
lldb add this flag? Or maybe even check something in so that all builds
default to not running these tests on Windows? The former would make
sense if we primarily think this is virtualization related, the later if
we think it's more likely a code problem.

I noticed last night that we have a couple of other windows bots which
seem to be hitting the same false positives. Much lower frequencies,
but it does seem this is not specific to the particular bot.

Otherwise, it seems like our only option (per current policy) is to
disable lldb testing on windows bots entirely, and I really hate to do
that.

Pavel Labath via llvm-dev

unread,

Jan 11, 2022, 12:45:49 PM1/11/22

to Philip Reames, Stella Stamenova, Jim Ingham, llvm-dev, ztu...@google.com

If that question was meant for me, then my answer is yes. I think those
tests should be disabled regardless of the cause. I actually tried to
say the same thing, but I may not have succeeded in getting it across.
Stella, can you share what kind of environment is that bot running in?

> I noticed last night that we have a couple of other windows bots which
> seem to be hitting the same false positives. Much lower frequencies,
> but it does seem this is not specific to the particular bot.

Hmm.. do you have a link to those bots or something? Stella's bot is the
only windows (lldb) bot I am aware of and I'd be surprised if there were
more of them.

Philip Reames via llvm-dev

unread,

Jan 11, 2022, 12:51:31 PM1/11/22

to Pavel Labath, Stella Stamenova, Jim Ingham, llvm-dev, ztu...@google.com

I went back and checked. Turns out I was wrong here. I had a couple of
build failures with similar messages, but they were from this bot.

Stella Stamenova via llvm-dev

unread,

Jan 11, 2022, 12:59:38 PM1/11/22

to Pavel Labath, Philip Reames, Jim Ingham, llvm-dev, ztu...@google.com

The windows lldb bot is running on a Hyper-V virtual machine, so it would make sense that if watchpoints don't work correctly in virtual environments they would be failing there. On the rare occasion I've had to run these tests locally, I have also seen them fail though, so that's not the only source of issues.

Since I disabled the couple of tests yesterday, there's only one watchpoint test that is still failing randomly. One option would be to disable just this test and let the remaining few watchpoint tests continue to run on Windows (I prefer this option since some tests would continue to run). Alternatively, all the watchpoint tests can be skipped via the category flag, but in that case, I'd like us to undo the individual skips.

I did notice while going through the watchpoint tests to see what is still enabled on Windows, that the same watchpoint tests that are disabled/failing on Windows are disabled on multiple other platforms as well. The tests passing on Windows are also the ones that are not disabled on other platforms. A third option would be to add a separate category for the watchpoint tests that don't run correctly everywhere and use that to disable them instead. This would be a more generic way to disable the tests instead of adding multiple `skipIf` statements to each test.

Thanks,
-Stella

Pavel Labath via llvm-dev

unread,

Jan 11, 2022, 1:31:10 PM1/11/22

to Stella Stamenova, Philip Reames, Jim Ingham, llvm-dev

On 11/01/2022 18:59, Stella Stamenova wrote:
> The windows lldb bot is running on a Hyper-V virtual machine, so it would make sense that if watchpoints don't work correctly in virtual environments they would be failing there. On the rare occasion I've had to run these tests locally, I have also seen them fail though, so that's not the only source of issues.
>
> Since I disabled the couple of tests yesterday, there's only one watchpoint test that is still failing randomly. One option would be to disable just this test and let the remaining few watchpoint tests continue to run on Windows (I prefer this option since some tests would continue to run). Alternatively, all the watchpoint tests can be skipped via the category flag, but in that case, I'd like us to undo the individual skips.

For better or worse, you're currently the most (only?) interested person
in keeping windows host support working, so I think you can manage the
windows skips/fails in any way you see fit. The rest of us are mostly
interested in having green builds. :)

Hyper-V is _not_ among the virtualization systems I've tried using with
lldb, so I cannot conclusively say anything about it (though I still
have my doubts).

>
> I did notice while going through the watchpoint tests to see what is still enabled on Windows, that the same watchpoint tests that are disabled/failing on Windows are disabled on multiple other platforms as well. The tests passing on Windows are also the ones that are not disabled on other platforms. A third option would be to add a separate category for the watchpoint tests that don't run correctly everywhere and use that to disable them instead. This would be a more generic way to disable the tests instead of adding multiple `skipIf` statements to each test.

On non-x86 architectures, watchpoints tend to be available only on
special (developer) hardware or similar (x86 is the outlier in having
universal support), which is why these tests tend to accumulate various
annotations. However, I don't think we need to solve this problem (how
to skip the tests "nicely") here...

pl

Greg Clayton via llvm-dev

unread,

Jan 11, 2022, 7:41:53 PM1/11/22

to Pavel Labath, Stella Stamenova, Jim Ingham, llvm-dev

Does windows use lldb-server by default or does it use ProcessWindows? ProcessWindows is the native process debugger, and lldb-server is the way we want debugging to work. If we look at ProcessWindows.cpp:

static bool ShouldUseLLDBServer() {
llvm::StringRef use_lldb_server = ::getenv("LLDB_USE_LLDB_SERVER");
return use_lldb_server.equals_insensitive("on") ||
use_lldb_server.equals_insensitive("yes") ||
use_lldb_server.equals_insensitive("1") ||
use_lldb_server.equals_insensitive("true");
}

void ProcessWindows::Initialize() {
if (!ShouldUseLLDBServer()) {
static llvm::once_flag g_once_flag;

llvm::call_once(g_once_flag, []() {
PluginManager::RegisterPlugin(GetPluginNameStatic(),
GetPluginDescriptionStatic(),
CreateInstance);
});
}
}

We can see it is enabled if LLDB_USE_LLDB_SERVER is set the "on", "yes", "1", or "true". If this is not set then this is using the built in ProcessWindows.cpp native process plug-in which I believe was never fully fleshed out and had issues.

Can someone verify if we are testing with ProcessWindows or lldb-server on the build bot?

Stella Stamenova via llvm-dev

unread,

Jan 12, 2022, 12:06:57 PM1/12/22

to Greg Clayton, Pavel Labath, Jim Ingham, llvm-dev

> Can someone verify if we are testing with ProcessWindows or lldb-server on the build bot?

Since I didn't set LLDB_USE_LLDB_SERVER on the buildbot itself and this is not in the zorg configuration, the buildbot is using ProcessWindows.

I've never tried setting LLDB_USE_LLDB_SERVER to on when running the tests, so I am not sure what to expect from the results though. If I have time, I'll try it out locally this week to see what happens.

-----Original Message-----
From: Greg Clayton <clay...@gmail.com>
Sent: Tuesday, January 11, 2022 4:42 PM
To: Pavel Labath <pa...@labath.sk>; Stella Stamenova <sti...@microsoft.com>
Cc: Philip Reames <list...@philipreames.com>; Jim Ingham <jin...@apple.com>; llvm-dev <llvm...@lists.llvm.org>
Subject: Re: [llvm-dev] [EXTERNAL] Re: Responsibilities of a buildbot owner

> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7Ccf844e7b138a4ff1528508d9d5644fd6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637775449084507662%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=I1V2ok9gFIW%2BV%2F38Z6IPZt%2Bf%2Bd%2Fg6gPfH6VVgkJ6hQ0%3D&reserved=0

Galina Kistanova via llvm-dev

unread,

Jan 12, 2022, 10:33:40 PM1/12/22

to David Blaikie, llvm-dev

Hello everyone,

In continuation of the Responsibilities of a buildbot owner thread.

First of all, thank you very much for being buildbot owners! This is much appreciated.

Thank you for bringing good points to the discussion.

It is expected that buildbot owners own bots which are reliable, informative and helpful to the community.

Effectively that means if a problem is detected by a builder and it is hard to pinpoint the reason of the issue and a commit to blame, a buildbot owner is natively on the escalation path. Someone has to get to the root of the problem and fix it one way or another (by reverting the commit, or by proposing a patch, or by working with the author of the commit which introduced the issue). In the majority of the cases someone takes care of an issue. But sometimes it takes a buildbot owner to push. Every buildbot owner does this from time to time.

Hi Mehdi,

> Something quite annoying with staging is that it does not have (as far as I know) a way

> to continue to notify the buildbot owner.

You mentioned this recently in one of the reviews. With https://github.com/llvm/llvm-zorg/commit/3c5b8f5bbc37076036997b3dd8b0137252bcb826 in place, you can add the tag "silent" to your production builder, and it will not send notifications to the blame list. You can set the exact notifications you want in the master/config/status.py for that builder. Hope this helps you.

I do not want to have the staging even able to send emails. We debug and test many things there, including notifications, and there is always a risk of spam.

Thanks

Galina

Mehdi AMINI via llvm-dev

unread,

Jan 13, 2022, 12:19:43 AM1/13/22

to Galina Kistanova, llvm-dev

On Wed, Jan 12, 2022 at 7:33 PM Galina Kistanova <gkist...@gmail.com> wrote:

Hello everyone,

In continuation of the Responsibilities of a buildbot owner thread.

First of all, thank you very much for being buildbot owners! This is much appreciated.
Thank you for bringing good points to the discussion.

It is expected that buildbot owners own bots which are reliable, informative and helpful to the community.

Effectively that means if a problem is detected by a builder and it is hard to pinpoint the reason of the issue and a commit to blame, a buildbot owner is natively on the escalation path. Someone has to get to the root of the problem and fix it one way or another (by reverting the commit, or by proposing a patch, or by working with the author of the commit which introduced the issue). In the majority of the cases someone takes care of an issue. But sometimes it takes a buildbot owner to push. Every buildbot owner does this from time to time.

Hi Mehdi,

> Something quite annoying with staging is that it does not have (as far as I know) a way
> to continue to notify the buildbot owner.

You mentioned this recently in one of the reviews. With https://github.com/llvm/llvm-zorg/commit/3c5b8f5bbc37076036997b3dd8b0137252bcb826 in place, you can add the tag "silent" to your production builder, and it will not send notifications to the blame list. You can set the exact notifications you want in the master/config/status.py for that builder. Hope this helps you.

Fantastic! I'll use this for the next steps for my bots (when I get back to it, I slacked on this recently...) :)

We may also use this on flaky bots in the future?

Thanks,

--

Mehdi

Galina Kistanova via llvm-dev

unread,

Jan 13, 2022, 2:24:36 AM1/13/22

to Mehdi AMINI, llvm-dev

> We may also use this on flaky bots in the future?

Yes, we may.

Or we may try to do our best to fix them. :)

Moving workers to the staging temporarily to investigate and address an issue is fine. Gives a bit more elbow room for experimenting, as we can apply experimental patches there, restart the staging as needed and often, and so on. Which is not the case with the production. It does not take much effort to move a worker between the staging and the production areas - a simple edit of the buildbot.tac file and a worker restart.

Tagging a builder "silent" means there is a designated person or a team who is actively fixing the detected issues or acting as a proxy to handle the blame list. This could be a way to dial with flaky bots, indeed, assuming there is somebody taking care of those builders, not just a way to skip the annoyance and keep the status quo.

By the way, thanks everyone for the constructive and polite discussion! It seems we are going to have a more stable and informative Windows LLDB builder.

Galina

Stella Stamenova via llvm-dev

unread,

Jan 13, 2022, 4:42:04 PM1/13/22

to Galina Kistanova, Mehdi AMINI, David Blaikie, Philip Reames, llvm-dev

There are a couple of things on this thread that sound nice in general, but have not been clarified either in the discussion or in the documentation. Since the devil is in the details, I’d like to see us agree on the details and then have them added to the documentation.

At the end of the day, there should be no surprises in the process and everything that can be should be quantified.

We want to encourage people to be responsible code and buildbot owners, not discourage them from contributing at all.

> It is expected that buildbot owners own bots which are reliable, informative and helpful to the community.

In my experience, every buildbot has occasional “flakiness” – be it because of code failures that don’t happen every time or because of connectivity issues, etc. Some bots are also often broken not because of any flakiness, but because with the large number of commits, there are bound to be failures.

So what makes a bot not reliable enough? Some percentage of builds failing? Some percentage of false positives? Does it vary per project or is there a single expectation for all of llvm?

I think it makes sense to say that false positives above a certain threshold make a buildbot not reliable enough and the threshold should be documented. It also makes sense to say that failures above a certain threshold make a bot not reliable enough – if the codebase is fragile enough that most commits cause breaks, it is possible that a reliable buildbot for it cannot exist.

> "someone" needs to take the responsibility to backstop the normal revert to green process.

As Mehdi pointed out earlier, the root cause of the failure might mean that the buildbot owner or that a code owner is better suited to addressing it. Philip’s argument is that at the end of the day, it is always the buildbot owner if a code owner hasn’t come forward. It makes sense to have someone who is ultimately responsible and it also makes sense that everyone needs to be given time and notice to act on the failures.

There has also been some mention of different ways to “silence” a buildbot – either by turning it off entirely and waiting for a bot owner to reconnect it to staging or production, or by tagging it as “silent”. In my experience, there’s a huge difference between using the “silent” tag and turning a bot off. In the first case, the bot owners will continue to receive notifications and the builds will continue to run. Even if the bot is red already, there’s some chance that new commits that add breaks will be possible to figure out by looking at the logs either by other interested parties, or by the bot owners themselves. When a bot is turned off for any period of time, there’s nothing that can be used to determine when new failures were checked in (aside from local builds, so many local builds) and it can be incredibly painful to track down. I think bots should only be forcefully turned off very rarely and when nothing else can be done and with plenty of notice.

So then, what is the flow when a bot starts having issues? I would propose that it be something like this:

Code owners have to address issues in X amount of time.
If the code owners has failed to address the situation, it falls to the buildbot owners. Perhaps at the beginning or in the middle of this period, the bot owners get an email that says: “Hey, so and so, we’re close to tagging the bot “silent”, can you have a look?”
If both the code owners and the buildbot owners have failed to address the situation, the bot gets tagged “silent”. The buildbot owner gets notified that this happened and the notification spells out how much longer they have before the bot gets turned off.
If both the code owners and the buildbot owners have failed to address the situation for some time longer, the bot gets turned off.

Each of this steps should be allowed a pre-determined amount of time. A few hours? A few days? Ideally, each of the transitions (but definitely 2->3->4) come with notifications. If it was possible for a bot to be moved to staging automatically, we could even have an extra step where it gets moved to staging before it gets turned off. I don’t think that’s currently possible though.

> The main problem with flaky tests is random false blames. People get annoyed and stop paying attention to failures on a particular builder, and other builders as well, arguing that build bot in general is not reliable.

Galina made a good point to me that people get annoyed by failures and stop paying attention to all buildbots. I can see how flaky tests/bots contribute to the general ignoring of the buildbots, but I would argue that the root cause is the sheer volume of build breaks that are not the fault of a committer. The few times I’ve made commits to llvm, for example, I’ve always gotten at least one email about a break that was unrelated to my change (because my changes are perfect, thank you very much). This larger problem of build breaks is much harder to address than flaky bots or tests, but I think would improve the health of llvm & friends significantly more (and in the meantime, we could tolerate some “flakiness”).

Thanks,

-Stella

via llvm-dev

unread,

Jan 13, 2022, 5:09:03 PM1/13/22

to sti...@microsoft.com, gkist...@gmail.com, joke...@gmail.com, dbla...@gmail.com, list...@philipreames.com, llvm...@lists.llvm.org

Stella wrote:
> The few times I’ve made commits to llvm, for example, I’ve always
> gotten at least one email about a break that was unrelated to my
> change (because my changes are perfect, thank you very much). This
> larger problem of build breaks is much harder to address than flaky
> bots or tests, but I think would improve the health of llvm &
> friends significantly more (and in the meantime, we could tolerate
> some “flakiness”).

This is consistent enough that if I don’t get a bot email, I wonder
if my “git push” failed. 😊 The project very much needs a functioning
pre-commit sanity check of some kind. What we have now is Phabricator
running something that basically always fails, making it largely
useless. But that is straying from the topic of bot-owner
responsibilities.

On that topic, however, I would like to request a way to actively get
help with a bot failure. A little while ago I tried to commit a patch
to lit, which after a couple of tries, passed everywhere except *one*
test on *one* bot. I asked for help on llvm-dev and got no reply.
The patch is reverted and remains on a back burner because I couldn’t
get help. At some point I will try again, but I suspect there won’t
be a way to solve the problem without the active help of the bot owner,
whoever it is. How do you find a bot owner, anyway?

Thanks,
--paulr

Philip Reames via llvm-dev

unread,

Jan 13, 2022, 5:27:07 PM1/13/22

to Stella Stamenova, Galina Kistanova, Mehdi AMINI, David Blaikie, llvm-dev

On 1/13/22 1:41 PM, Stella Stamenova wrote:

There are a couple of things on this thread that sound nice in general, but have not been clarified either in the discussion or in the documentation. Since the devil is in the details, I’d like to see us agree on the details and then have them added to the documentation.

At the end of the day, there should be no surprises in the process and everything that can be should be quantified.

We want to encourage people to be responsible code and buildbot owners, not discourage them from contributing at all.

> It is expected that buildbot owners own bots which are reliable, informative and helpful to the community.

In my experience, every buildbot has occasional “flakiness” – be it because of code failures that don’t happen every time or because of connectivity issues, etc. Some bots are also often broken not because of any flakiness, but because with the large number of commits, there are bound to be failures.

So what makes a bot not reliable enough? Some percentage of builds failing? Some percentage of false positives? Does it vary per project or is there a single expectation for all of llvm?

I think it makes sense to say that false positives above a certain threshold make a buildbot not reliable enough and the threshold should be documented. It also makes sense to say that failures above a certain threshold make a bot not reliable enough – if the codebase is fragile enough that most commits cause breaks, it is possible that a reliable buildbot for it cannot exist.

This is a hard thing to specify, but I'm going to take a shot at some draft wording.

We generally expect that publicly notifying builders are stable - meaning they do not report failures unless those failures are related to the commit being built. Note that our requirement here is specific to notification, not the existence of the builder on the waterfall.

In general, we expect a buildbot to be able to report an average of no more than one false positive failure per day. We will sometimes allow bots with higher failure rates due to special circumstances - e.g. unstable hardware combined with limited hardware availability for a platform - but these exceptions are just that: exceptions. They need to be widely discussed before such a bot is allowed to notify, and the build config must make it apparent to casual users that the bot may be unstable.

> "someone" needs to take the responsibility to backstop the normal revert to green process.

As Mehdi pointed out earlier, the root cause of the failure might mean that the buildbot owner or that a code owner is better suited to addressing it. Philip’s argument is that at the end of the day, it is always the buildbot owner if a code owner hasn’t come forward. It makes sense to have someone who is ultimately responsible and it also makes sense that everyone needs to be given time and notice to act on the failures.

There has also been some mention of different ways to “silence” a buildbot – either by turning it off entirely and waiting for a bot owner to reconnect it to staging or production, or by tagging it as “silent”. In my experience, there’s a huge difference between using the “silent” tag and turning a bot off. In the first case, the bot owners will continue to receive notifications and the builds will continue to run. Even if the bot is red already, there’s some chance that new commits that add breaks will be possible to figure out by looking at the logs either by other interested parties, or by the bot owners themselves. When a bot is turned off for any period of time, there’s nothing that can be used to determine when new failures were checked in (aside from local builds, so many local builds) and it can be incredibly painful to track down. I think bots should only be forcefully turned off very rarely and when nothing else can be done and with plenty of notice.

I completely agree. Up until this thread, I was not aware of an option to silence a buildbot on the main builder. In fact, it looks like that mechanism only exists as of the 8th of this month. Now that we have it, we should definitely use it in favor of disabling a bot entirely.

This needs integrated into the docs. I'll take that action item.

So then, what is the flow when a bot starts having issues? I would propose that it be something like this:

Code owners have to address issues in X amount of time.

If the code owners has failed to address the situation, it falls to the buildbot owners. Perhaps at the beginning or in the middle of this period, the bot owners get an email that says: “Hey, so and so, we’re close to tagging the bot “silent”, can you have a look?”

If both the code owners and the buildbot owners have failed to address the situation, the bot gets tagged “silent”. The buildbot owner gets notified that this happened and the notification spells out how much longer they have before the bot gets turned off.

If both the code owners and the buildbot owners have failed to address the situation for some time longer, the bot gets turned off.

Each of this steps should be allowed a pre-determined amount of time. A few hours? A few days? Ideally, each of the transitions (but definitely 2->3->4) come with notifications. If it was possible for a bot to be moved to staging automatically, we could even have an extra step where it gets moved to staging before it gets turned off. I don’t think that’s currently possible though.

Now that we have a silence mechanism, I think we can split our policy into two pieces.

Part 1 - When do we silence a bot

Part 2 - When do we disable a bot

I think we can afford to have a long and involved process for part 2. Once a bot is silence, it doesn't have much cost to keep around, and we basically only need to handle the abandoned bot problem.

The majority of our focus can be on when we silence a bot. Here I would argue pretty strongly for a different default: we should silence and un-silence bots cheaply.

Here's some suggested wording:

If you believe a bot to be unstable, please file a github issue describing the situation. Please either add the bot owner as he assignee or email the bot owner directly. If the instability is frequent - say more than 1 build in 10 - please send a change for review which silences the builder.

As a bot owner, you are expected to address reported instability. If you can't do so promptly, please silence the bot. Once you're ready to unsilence the bot, post a change for review which does so and describes the action taken to stabilize the bot.

(Obviously, this needs expanded a bit.)

> The main problem with flaky tests is random false blames. People get annoyed and stop paying attention to failures on a particular builder, and other builders as well, arguing that build bot in general is not reliable.

Galina made a good point to me that people get annoyed by failures and stop paying attention to all buildbots.

More immediately, people set up mail rules to ignore bots. I know of multiple people who have these, and was on the edge of doing so myself. This means that a bot which is spammy effectively only harasses new contributors which is, ah, less than ideal.

I can see how flaky tests/bots contribute to the general ignoring of the buildbots, but I would argue that the root cause is the sheer volume of build breaks that are not the fault of a committer. The few times I’ve made commits to llvm, for example, I’ve always gotten at least one email about a break that was unrelated to my change (because my changes are perfect, thank you very much). This larger problem of build breaks is much harder to address than flaky bots or tests, but I think would improve the health of llvm & friends significantly more (and in the meantime, we could tolerate some “flakiness”).

I will note that I and Galina have been actively working on attempts to stabilize our existing infrastructure. There's active work on trying to add mechanisms (e.g. silencing, staged builders, and maximum batch sizes) to cut down on the problem. Please don't let "it's hard" become an argument that we should ignore the problem.

Also, while yes many of our failures are bad changes, I think this makes up a minority of all failure notices. I have checked anything other than my own trash folder, but that's certainly what I see. The biggest contributors are blatantly unstable bots and unreasonably slow batched builders.

David Blaikie via llvm-dev

unread,

Jan 13, 2022, 5:36:55 PM1/13/22

to Paul Robinson, llvm-dev

On Thu, Jan 13, 2022 at 2:08 PM <paul.r...@sony.com> wrote:

Stella wrote:
> The few times I’ve made commits to llvm, for example, I’ve always
> gotten at least one email about a break that was unrelated to my
> change (because my changes are perfect, thank you very much). This
> larger problem of build breaks is much harder to address than flaky
> bots or tests, but I think would improve the health of llvm &
> friends significantly more (and in the meantime, we could tolerate
> some “flakiness”).

This is consistent enough that if I don’t get a bot email, I wonder
if my “git push” failed. 😊 The project very much needs a functioning
pre-commit sanity check of some kind. What we have now is Phabricator
running something that basically always fails, making it largely
useless. But that is straying from the topic of bot-owner
responsibilities.

On that topic, however, I would like to request a way to actively get
help with a bot failure. A little while ago I tried to commit a patch
to lit, which after a couple of tries, passed everywhere except *one*
test on *one* bot. I asked for help on llvm-dev and got no reply.
The patch is reverted and remains on a back burner because I couldn’t
get help. At some point I will try again, but I suspect there won’t
be a way to solve the problem without the active help of the bot owner,
whoever it is. How do you find a bot owner, anyway?

You should be able to get the email address for the buildbot owner & email them directly (probably good to also include llvm-dev, though (with the move to discourse - maybe you can tag the buildbot owner in a post? be good to make sure all buildbot owners are on discourse and readily identifiable from the buildbot owner info in buildbot)). If they don't reply/help after some time I think it's reasonable to consider the configuration unsupported/silence the buildbot, etc.

via llvm-dev

unread,

Jan 13, 2022, 5:59:15 PM1/13/22

to dbla...@gmail.com, llvm...@lists.llvm.org

>> How do you find a bot owner, anyway?

> You should be able to get the email address for the buildbot
> owner & email them directly

I agree fully; but the unanswered question is, how?
I'm sure it's not hard; I'm equally sure it's not obvious,
after clicking around randomly on the buildbot page for a
couple of minutes.

Stella Stamenova via llvm-dev

unread,

Jan 13, 2022, 6:01:21 PM1/13/22

to paul.r...@sony.com, dbla...@gmail.com, llvm...@lists.llvm.org

You need to go to the worker for the buildbot like so:

https://lab.llvm.org/buildbot/#/workers/55

It will give you the name and email for the buildbot owner.

Thanks,
-Stella

-----Original Message-----
From: paul.r...@sony.com <paul.r...@sony.com>
Sent: Thursday, January 13, 2022 2:59 PM
To: dbla...@gmail.com
Cc: Stella Stamenova <sti...@microsoft.com>; gkist...@gmail.com; joke...@gmail.com; list...@philipreames.com; llvm...@lists.llvm.org
Subject: RE: [llvm-dev] [EXTERNAL] Re: Responsibilities of a buildbot owner

[You don't often get email from paul.r...@sony.com. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

Mehdi AMINI via llvm-dev

unread,

Jan 13, 2022, 6:04:47 PM1/13/22

to Paul Robinson, llvm-dev

On Thu, Jan 13, 2022 at 2:59 PM <paul.r...@sony.com> wrote:

>> How do you find a bot owner, anyway?

> You should be able to get the email address for the buildbot
> owner & email them directly

I agree fully; but the unanswered question is, how?
I'm sure it's not hard; I'm equally sure it's not obvious,
after clicking around randomly on the buildbot page for a
couple of minutes.

I think it is associated with the worker and not the build (another reason why the owner of the bot isn't a good contact in many case, a "linux worker" can build/test many build spawning projects/config):

Screen Shot 2022-01-13 at 3.02.06 PM.png

--

Mehdi

--paulr

Galina Kistanova via llvm-dev

unread,

Jan 13, 2022, 6:05:27 PM1/13/22

to Robinson, Paul, LLVM Dev

Hi Stella,

> This larger problem of build breaks is much harder to address than flaky bots or

> tests, but I think would improve the health of llvm & friends significantly more (and

> in the meantime, we could tolerate some “flakiness”).

There is some work in progress which will hopely improve this.

Hi Paul,

> How do you find a bot owner, anyway?

There are few ways, one is described at https://llvm.org/docs/DeveloperPolicy.html#working-with-the-ci-system.

Basically, everywhere you can see worker detailed information in the buildbot WebUI, it shows who administers that host.

You may want to contact the owner of that *one* bot and see if it is possible to validate your patch before you commit.

If an owner does not respond it might mean those builders are unsupported. Usually owners are very good at helping.

Thanks

Galina

Stella Stamenova via llvm-dev

unread,

Jan 13, 2022, 11:38:33 PM1/13/22

to Pavel Labath, Galina Kistanova, Jonas Devlieghere, Jim Ingham, llvm-dev

I had a chat with Jonas earlier today and one of the things that came out was that we actually have three separate suites of tests in lldb:
- shell
- unit
- api

The category that causes the most pain in general, including on the Windows lldb bot, is the API tests. The shell tests are very stable and so are all (but one) of the unit tests.

Since, as Pavel pointed out, there's not a very active community for lldb on Windows, one thing we could do is run only the shell and unit test suites on the Windows buildbot and drop the API tests. This would allow us to prevent complete bit rot by providing relatively good coverage while at the same time removing the most unstable tests from the buildbot. Then we could dispense with having to disable individual API tests when they show instability on Windows.

I drafted a patch that would do that (with the assumption that everyone would be on board):
https://reviews.llvm.org/D117267

Let me know if you disagree with this course of action or have any other concerns.

Thanks,
-Stella

> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org

> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7Ceaf5b1164b4d47cd7d3908d9d5edf20a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776040213456529%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=i4%2FHWKyjKWdXm5PE6dj339TNuFIs5xMNZr3yuFzMoVA%3D&reserved=0

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org

https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7Ceaf5b1164b4d47cd7d3908d9d5edf20a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776040213456529%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=i4%2FHWKyjKWdXm5PE6dj339TNuFIs5xMNZr3yuFzMoVA%3D&reserved=0

via llvm-dev

unread,

Jan 14, 2022, 9:24:00 AM1/14/22

to gkist...@gmail.com, paul.r...@sony.com, llvm...@lists.llvm.org

Thanks for the info—the distinction between builders and workers was not clear in my mind.

--paulr

Omair Javaid via llvm-dev

unread,

Jan 14, 2022, 6:12:55 PM1/14/22

to Stella Stamenova, Jim Ingham, llvm-dev

Hi Stella,

This is in reference to my email on lldb-dev about setting up a LLDB window on Arm64 buildbot. We are currently working on setting up a Arm64 bot that will run only unit-tests and shell-tests. However in future we are going to be taking up LLDB on Windows Arm64 maintenance and hope to run a full featured testsuite on our buildbots. Meanwhile, as python API support is a very important LLDB feature, not running API tests will result in an incremental pile of windows specific failures which will increase engineering effort required for stabilising LLDB on windows. I have suggested reducing the number of parallel API tests on windows to see if it reduces the amount of noise generated by flaky tests.

https://reviews.llvm.org/D117363

In the case it doesnt work, I'll take up the ownership of Windows x64 buildbot as well and try to keep noise reduced similar to what I do for LInux Arm/Arm64 LLDB bots.

Thanks!

Omair Javaid
www.linaro.org

Stella Stamenova via llvm-dev

unread,

Jan 14, 2022, 6:27:40 PM1/14/22

to Omair Javaid, clay...@gmail.com, Jim Ingham, llvm-dev

Thanks Omair!

I’ll wait for your change to go in and we can evaluate what else might need to happen afterwards.

I’ve been running some local tests with `LLDB_USE_LLDB_SERVER` set to 1 and that appears to have made them more stable locally. I think we should consider defaulting to using lldb-server on Windows instead of the other way around. @Greg Clayton do you happen to know why it defaults to not using lldb-server?

Thanks,

-Stella

Greg Clayton via llvm-dev

unread,

Jan 14, 2022, 6:56:01 PM1/14/22

to Stella Stamenova, Jim Ingham, llvm-dev

On Jan 14, 2022, at 3:27 PM, Stella Stamenova <sti...@microsoft.com> wrote:

Thanks Omair!

I’ll wait for your change to go in and we can evaluate what else might need to happen afterwards.

I’ve been running some local tests with `LLDB_USE_LLDB_SERVER` set to 1 and that appears to have made them more stable locally. I think we should consider defaulting to using lldb-server on Windows instead of the other way around. @Greg Clayton do you happen to know why it defaults to not using lldb-server?

I do not but the golden path that we really want people to follow is to use the lldb-server to debug things. This allows remote debugging to work well in all cases instead of being just some avenue that no one tests.

Benefits of using lldb-server:

- Mac and linux have been using it since the beginning and the ProcessGDBRemote is the best supported process plug-in as it has see many different GDB remote clients and served multiple architectures really well

- We can get a packet log for tests to see what actually went wrong. When using ProcessWindows, unless we have logging on every API call and event that is generated, we have no hope of figuring any issues out. Anyone can enable a log with “log enable -f /tmp/packets.txt gdb-remote packets” and send that to someone to help figure out issues

- Dynamic register information is transferred and allows the logs to be even more useful since we know all of the registers from the register context detection packets

- Makes remote debugging possible and it works really well.

So I would highly suggest to switch over to using the lldb-server permanently if possible and I would like to see the ProcessWindows class go away in the future. The main reason is we will be able to see what is going on by checking the lldb-server logs when we have a flaky tests. I would be happy to help figure out issues on windows if I can see the packet log for a flaky test where we have one log that passes the test and one that fails it. I am quite good at looking at these logs and figuring out what is going wrong. With ProcessWindows and absolutely no logging, we have no hope of figuring any buildbot issue out unless we can reliably reproduce the issue. Also, we have a TON of testing on the lldb-server debugging since 99% of all LLDB users use it (wither lldb-server or debugserver for Darwin (macOS, iOS, tvOS, watchOS)).

So a big vote to enable this, and if all goes well, remove the ProcessWindows class and always use lldb-server from here on out if all goes well

Reply all

Reply to author

Forward