RFC: New LLVM integrate process

43 views
Skip to first unread message

Stella Laurenzo

unread,
Sep 20, 2023, 10:34:05 AM9/20/23
to iree-discuss
I've been experimenting with integrate processes over the last couple of weeks and am ready to propose a next step towards automated ours. I've got some very hacky scripts that have been helping me do this for the past ~day or so, and I think with a bit of refinement, it can be a workable process for us.

The basic process is encapsulated by this loop:

while true; do
echo "***********************************************************"
echo " INTEGRATE ITERATION"
echo "***********************************************************"

./auto_integrate.py status

echo "***** PERFORMING BUILD AT CURRENT REVISION *****"
./build_and_validate.sh

echo "***** ADVANCING *****"
while true; do
set +e
./auto_integrate.py next
rc="$?"
set -e
if [ "$rc" == "99" ]; then
echo "At ToT. Waiting..."
sleep 300
fi
if [ "$rc" == "0" ]; then
echo "Successful advance. Giving a beat..."
sleep 10
break
fi
echo "Could not advance."
exit 1
done
done
fd

In this process, the goal is to, with the minimal amount of human interrupts, produce an integrate PR at a specific LLVM commit representing the next major break to the project (either API break or egregious compiler issue). The project members then gang up on that PR and fix the issues by whatever means necessary, land it, and the automation continues its search. In the unlikely event that LLVM HEAD produces no interrupts for an extended period of time, we can do a time based integrate after a few days. The idea is that between the LLVM commits of an integrate PR, it is highly likely that LLVM can be cleanly bisected without source level compilation failures, and we have integrate PRs that are focused on fixing one issue at a time, which is usually very easy to farm out (and often in isolation, very mechanical).

There are five parts to this where we can tweak things:
  1. Definition of affecting change: Currently `auto_integrate.sh {status|next}` consider all MLIR commits as "affecting" and will report how far from true-HEAD we are at each step. This could be extended to include milestone commits from other projects, etc. This is presently producing on the order of 20-40 affecting commits per day that need to be validated (although the true number of underlying commits to the LLVM repo can be quite large).
  2. What build and validation steps are done: In the example above, we do an ASAN/ASSERTS release build of just the core compiler/runtime with standard options and then build "all" followed by "iree-test-deps". This is intentionally *non-comprehensive* as it is intended to be incremental/quick and cause breaks on hard compilation failures and egregious compiler bugs (i.e. ASAN uncleanness, etc). Since the result of the process is a trivially bisectable range, it is assumed that when the full integrate is done with all testing, a human can pinpoint any more advanced failure cases. Also by excluding exotic and unsupported features that are easily patched in a real integrate (i.e. bazel build files, CMake errors for less frequently used platforms, etc).
  3. How breaks are handled: This part currently has no automation and it is just me. I've had three breaks so far. Yesterday, we made it through a couple dozen affecting commits before an API break. On the first, I exported the branch and posted on Discord. Kunwar fixed over my night and I clicked submit in the morning. On the second, it was a trivial fix, so I just made it in situ and continued. The third I am evaluating now. I think this part can be trivially automated to export the branch(es) to GH, raise a PR and post an update on Discord for attention. The simple loop can be extended to always pause when such an integrate PR is open.
  4. How patches are carried on llvm-project: Currently, I'm not doing anything special here beyond checking if the branch is clean with respect to upstream. If it is, then the script does a `git reset --hard` and if not, it does a `git pull --rebase` to advance to the desired commit. Unclean branches require a human to be aware of the patch stack we are carrying locally to ensure that we retire them at an appropriate time. Also, having carried patches can cause the automation to fail to rebase, requiring human intervention. In practice, if we are primarily carrying committed patches from the future or reverts that will land to upstream, it mostly works out: when the integrate advances past the patch, it is dropped automatically from our carried set. When this breaks, though, only a human can untangle it.
  5. What about other LLVM dependent projects: stablehlo and torch-mlir also depend on LLVM. In practice, the parts of those we use is much more tolerant than IREE to LLVM commit changes. I don't have a ratio right now because divide by zero isn't a number. Suffice to say we need to bump dependencies in three situations: a) when an LLVM API break affects them, requiring them to advance, or b) we require features from them, or c) some other exotic breakage happens requiring us to carry a local patch. In all cases, this is in the realm of humans/other automation and not implicated here.
In practice, implementing this flow will involve a greater number of semi-automated integrates around breakages, clearly defined bisect ranges, and more scoped triage activities. The automation helps handle the repetitive parts so that humans can take over for real issues.

Thoughts? I was planning to run this way as I evolve the local scripts for a bit and then rope someone else in to help.
- Stella

Jacques Pienaar

unread,
Sep 20, 2023, 10:46:30 AM9/20/23
to Stella Laurenzo, iree-discuss
And this would just be functional tests correct? Effectively end up with known sections of "green" functionality wise and then fixes. How will cherry picks be handled and recorded? Or will the intention be to keep this clean with an LLVM upstream rev?

-- Jacques 

--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CABPCc9CHhzzFDDXdY1nCuyTKuK%2Bm-aY0TgibMmiqtuzMZaoWng%40mail.gmail.com.

Stella Laurenzo

unread,
Sep 20, 2023, 10:54:32 AM9/20/23
to Jacques Pienaar, iree-discuss
On Wed, Sep 20, 2023 at 7:46 AM Jacques Pienaar <jpie...@google.com> wrote:
And this would just be functional tests correct? Effectively end up with known sections of "green" functionality wise and then fixes. How will cherry picks be handled and recorded? Or will the intention be to keep this clean with an LLVM upstream rev?

Yes, it is light functional testing when smoke testing a new commit (currently very light but can tweak).

Meant to mention ideas for future refinement:

  1. When we take in a commit, if there is a branch with some name like "llvm-patch-{commit}" available, we could pull that in automatically.
  2. There isn't an intention to keep a clean state with LLVM upstream, as that is not always possible. By keeping the train moving better, though, it should limit the amount of functionality-related cherry-picking we need to do, and I'd like the steady state to be that we carry reverts and landed patches from the future as needed. The thought being is that those are pretty easy to reason about and they get dropped automatically by rebase when no longer needed. As an example: a common case is that something is landed in LLVM and then reverted XX patches later. In order to keep the train moving, the integrator may just choose to pull the revert into the local stream and let us otherwise keep moving linearly. Then XX patches later, it would get dropped automatically.
Both of these things require a certain amount of human judgement and can break/need intervention when the unexpected happens. I think the goal should be to let the script do the simplest thing and then give a human the tools to untangle it when everything goes bad.

Brehler, Marius

unread,
Sep 21, 2023, 4:27:58 PM9/21/23
to Stella Laurenzo, iree-discuss
Hey Stella,

Thanks for your RFC. Just to make sure that I got it right: The script works through each MLIR commit one by one? If so, this is quite resource hungry, but it makes sense if the main goal is to keep human intervention to a minimum.

I was actually about to write that I am not too confident regarding the "What about other LLVM dependent projects", but I just realized that you have switched Torch-MLIR from tensorflow/mlir-hlo to openxla/stablehlo. Thats awesome, as I experienced finding green LLVM commits that worked for both tensorflow/mlir-hlo and openxla/stablehlo was troublesome. Hence, I don't see any real issues with the proposed integrate process.

Best, Marius

From: iree-d...@googlegroups.com <iree-d...@googlegroups.com> on behalf of Stella Laurenzo <ste...@nod-labs.com>
Sent: Wednesday, 20 September 2023 16:33
To: iree-discuss <iree-d...@googlegroups.com>
Subject: [iree-discuss (public)] RFC: New LLVM integrate process
 

Stella Laurenzo

unread,
Sep 21, 2023, 4:33:37 PM9/21/23
to Brehler, Marius, iree-discuss


On Thu, Sep 21, 2023, 1:27 PM Brehler, Marius <marius....@iml.fraunhofer.de> wrote:
Hey Stella,

Thanks for your RFC. Just to make sure that I got it right: The script works through each MLIR commit one by one? If so, this is quite resource hungry, but it makes sense if the main goal is to keep human intervention to a minimum.

I've just been running it on my workstation in the background with ccache and haven't really noticed. It is only stopping on MLIR and bazel changes so gets pretty good incrementality. Also with no real deadline, it is fine to give it fewer resources: it is not a large number of commits. I was able to clear about half a week's backlog in a few hours.

If we ever get underwater, we can take a bigger jump if we need too.


I was actually about to write that I am not too confident regarding the "What about other LLVM dependent projects", but I just realized that you have switched Torch-MLIR from tensorflow/mlir-hlo to openxla/stablehlo. Thats awesome, as I experienced finding green LLVM commits that worked for both tensorflow/mlir-hlo and openxla/stablehlo was troublesome. Hence, I don't see any real issues with the proposed integrate process.

Yeah, exactly. In my experience, the way it is now requires a patch or intervention a time or two a month, which is completely reasonable. The old way was just constant churn.
Reply all
Reply to author
Forward
0 new messages