I've been experimenting with integrate processes over the last couple of weeks and am ready to propose a next step towards automated ours. I've got some very hacky scripts that have been helping me do this for the past ~day or so, and I think with a bit of refinement, it can be a workable process for us.
The basic process is encapsulated by this loop:
while true; do
echo "***********************************************************"
echo " INTEGRATE ITERATION"
echo "***********************************************************"
./auto_integrate.py status
echo "***** PERFORMING BUILD AT CURRENT REVISION *****"
./build_and_validate.sh
echo "***** ADVANCING *****"
while true; do
set +e
./auto_integrate.py next
rc="$?"
set -e
if [ "$rc" == "99" ]; then
echo "At ToT. Waiting..."
sleep 300
fi
if [ "$rc" == "0" ]; then
echo "Successful advance. Giving a beat..."
sleep 10
break
fi
echo "Could not advance."
exit 1
done
done
fd
In this process, the goal is to, with the minimal amount of human interrupts, produce an integrate PR at a specific LLVM commit representing the next major break to the project (either API break or egregious compiler issue). The project members then gang up on that PR and fix the issues by whatever means necessary, land it, and the automation continues its search. In the unlikely event that LLVM HEAD produces no interrupts for an extended period of time, we can do a time based integrate after a few days. The idea is that between the LLVM commits of an integrate PR, it is highly likely that LLVM can be cleanly bisected without source level compilation failures, and we have integrate PRs that are focused on fixing one issue at a time, which is usually very easy to farm out (and often in isolation, very mechanical).
There are five parts to this where we can tweak things:
- Definition of affecting change: Currently `auto_integrate.sh {status|next}` consider all MLIR commits as "affecting" and will report how far from true-HEAD we are at each step. This could be extended to include milestone commits from other projects, etc. This is presently producing on the order of 20-40 affecting commits per day that need to be validated (although the true number of underlying commits to the LLVM repo can be quite large).
- What build and validation steps are done: In the example above, we do an ASAN/ASSERTS release build of just the core compiler/runtime with standard options and then build "all" followed by "iree-test-deps". This is intentionally *non-comprehensive* as it is intended to be incremental/quick and cause breaks on hard compilation failures and egregious compiler bugs (i.e. ASAN uncleanness, etc). Since the result of the process is a trivially bisectable range, it is assumed that when the full integrate is done with all testing, a human can pinpoint any more advanced failure cases. Also by excluding exotic and unsupported features that are easily patched in a real integrate (i.e. bazel build files, CMake errors for less frequently used platforms, etc).
- How breaks are handled: This part currently has no automation and it is just me. I've had three breaks so far. Yesterday, we made it through a couple dozen affecting commits before an API break. On the first, I exported the branch and posted on Discord. Kunwar fixed over my night and I clicked submit in the morning. On the second, it was a trivial fix, so I just made it in situ and continued. The third I am evaluating now. I think this part can be trivially automated to export the branch(es) to GH, raise a PR and post an update on Discord for attention. The simple loop can be extended to always pause when such an integrate PR is open.
- How patches are carried on llvm-project: Currently, I'm not doing anything special here beyond checking if the branch is clean with respect to upstream. If it is, then the script does a `git reset --hard` and if not, it does a `git pull --rebase` to advance to the desired commit. Unclean branches require a human to be aware of the patch stack we are carrying locally to ensure that we retire them at an appropriate time. Also, having carried patches can cause the automation to fail to rebase, requiring human intervention. In practice, if we are primarily carrying committed patches from the future or reverts that will land to upstream, it mostly works out: when the integrate advances past the patch, it is dropped automatically from our carried set. When this breaks, though, only a human can untangle it.
- What about other LLVM dependent projects: stablehlo and torch-mlir also depend on LLVM. In practice, the parts of those we use is much more tolerant than IREE to LLVM commit changes. I don't have a ratio right now because divide by zero isn't a number. Suffice to say we need to bump dependencies in three situations: a) when an LLVM API break affects them, requiring them to advance, or b) we require features from them, or c) some other exotic breakage happens requiring us to carry a local patch. In all cases, this is in the realm of humans/other automation and not implicated here.
In practice, implementing this flow will involve a greater number of semi-automated integrates around breakages, clearly defined bisect ranges, and more scoped triage activities. The automation helps handle the repetitive parts so that humans can take over for real issues.
Thoughts? I was planning to run this way as I evolve the local scripts for a bit and then rope someone else in to help.
- Stella