Turning down ARM64 GitHub runners

146 views
Skip to first unread message

Scott Todd

unread,
Jul 19, 2024, 11:54:05 AM7/19/24
to iree-discuss
Hi all,

We'll be turning off the ARM64 continuous integration builds for both the smaller 'runtime' build/test job and the larger 'all' (compiler) build/test job that have been running on PRs and each merged commit. Securing and maintaining runner VMs has been costly and low runner availability has stalled checks on PRs in the past few weeks.

Note: this is not unique to ARM64 - we're also trying to shed some CI complexity and load with auxiliary x86 Linux jobs, Android jobs, A100 GPU jobs, etc. We'll be focusing on the project foundations (build packages -> run tests in various configurations, i.e. 'pkgci') then can add back new CI jobs over time as resources become available. We are still committed to supporting each of these platforms/configurations, we just need to revamp how CI resources are managed.

Please reach out if you have questions or would like to help with resourcing or maintenance.

Andrzej Warzyński

unread,
Jul 19, 2024, 1:10:09 PM7/19/24
to iree-discuss
Hi Scott,

This is obviously unfortunate from our (Arm) perspective, but totally understandable. How can we help? :)

I can't really promise anything just yet, but am happy to ask around and see whether we could provide an AArch64 runner. For that to happen, I'd need to better understand the requirements:
  • storage,
  • pipeline exec frequency,
  • pipeline exec duration.
I've been meaning to extract this information myself (by studying the current set-up), but haven't had the cycles to do it. Any sort of indication would help. While we care about both mobile and cloud, we'll probably look at cloud solutions (i.e. some rather powerful AArch64 configs).

Our budget is limited, but I feel that we should make sure that we build and test IREE on AArc64. Even if that's just a single nightly job. Would you be able to point me in the right direction? And, more importantly, is that something that IREE will support?

Thank you,
-Andrzej

Scott Todd

unread,
Jul 19, 2024, 2:15:33 PM7/19/24
to Andrzej Warzyński, iree-discuss
I think we'd be looking at either
  1. Using GitHub-hosted Arm runners, which are in public beta: https://github.blog/2024-06-03-arm64-on-github-actions-powering-faster-more-efficient-build-systems/ . These are not free, even at lower core counts (yet?): https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners/about-larger-runners#specifications-for-general-larger-runners. We'd also need to get iree-org back onto a Team or Enterprise GitHub plan to even trial those (moving from Alphabet to LF changed our status there). Not needing to manage the runners directly would be immensely helpful, if we can figure out billing and such.
  2. Using self-hosted Arm runners, either with the existing Google Cloud Platform setup or (my slight preference) provided somehow by your group.

In either case, overall infrastructure instability and scaling issues lately have me thinking we should aim for nightly builds or on-demand ('workflow_dispatch' instead of 'push' / 'pull_request' events) for builds that either don't have free runners provided by GitHub or don't have sufficient capacity to run on every commit 24/7.

As for requirements, we have some full compiler builds running in between 45 minutes and 5 hours (depending on cache hits and runner platform) on standard GitHub-hosted runners with these specs: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories . Our main CPU build jobs we use for x86 Linux builds and tests are n1-standard-96 VMs: https://cloud.google.com/compute/docs/general-purpose-machines#n1_machine_types . Those can build the full project with a cold cache in 15 minutes. Stella or Jacques may have statistics for commit / CI run frequency as an average / upper bound, but I think we should still aim for nightly instead at this point.





--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/91bc590b-68f1-4dd1-8069-e668efb008f7n%40googlegroups.com.

Stella Laurenzo

unread,
Jul 19, 2024, 2:19:44 PM7/19/24
to Scott Todd, Andrzej Warzyński, iree-discuss
We should also scrub through what features get enabled for all of these builds. We're currently doing kitchen sink builds, even of optional/rarely used platforms and compiler backends, for everything.



--
- Stella

Scott Todd

unread,
Jul 19, 2024, 9:21:37 PM7/19/24
to Stella Laurenzo, Andrzej Warzyński, iree-discuss
The GCP arm64 runners are back online for now, but we shouldn't plan on them staying up consistently.

I also have more ideas for how to structure the workflows so each platform can set its own triggers and be more easily turned on/off based on infrastructure status. Still need  to develop more of a plan for self-hosted vs github-hosted runners to service those jobs/workflows though. I'll continue to follow-up next week.

Andrzej Warzyński

unread,
Jul 22, 2024, 11:26:03 AM7/22/24
to iree-discuss
Thanks Scott,

I think that keeping AArch64 nightly builds + on-demand would probably cover most important cases. It would be super helpful if we could keep that.

I do want to offer some hardware for IREE for self-hosted Arm runners,  but sadly our team managing such requests has quite a backlog. That would be some relatively large cloud node (build times comparable to your reference nodes). I'm trying to figure whether there's scope for me to do it myself (there are certain "processes" involved).

I need a bit more time to understand our internal constraints. If I get "go ahead", that would mostly likely only cover nightly + on-demand. Unless there isn't that many pre-commit jobs?

Thank you,
-Andrzej

ste...@laurenzo.org

unread,
Jul 22, 2024, 5:06:34 PM7/22/24
to iree-discuss
(repost - I sent this 3 days ago but it didn't end up on list and only went to Andrzej)

Hey Andrzej, speaking for myself, I definitely don't want AArch64 excluded. But the infra is crumbling in ways that Scott has been trying to use extraordinary measures to forestall. I finally gave him the go ahead to cut the things that aren't working and involve the community vs spending all of his time firefighting. So absolutely, this is not a "we're not going to support" kind of thing but a "we need help". As you say, a little can go a long way, and I'm confident that some practical things like you are suggesting will pay off.

Technically, the existing AArch64 runners were set up on Google infra that is not really amenable to third party maintenance. So we need to shut that down. But, assuming there is some resources, it shouldn't be hard at all to attach an additional bot -- especially if running at a low frequency like you suggest.

Marius

unread,
Jul 26, 2024, 4:48:13 AM7/26/24
to Scott Todd, iree-discuss
I know that this wouldn't be a direct replacement, but would it be a suitable alternative to cross-compile samples / tests for ARM64 and exec those with Renode or maybe even an FVP (Fixed Virtual Plattform)?

Jacques Pienaar

unread,
Jul 26, 2024, 9:11:45 AM7/26/24
to Marius, Scott Todd, iree-discuss
I think Andrzej's suggestion provides probably >80% of the value at a much lower cost. As Stella phrased it, it is not about supporting or not, but about needing resources to support*. Which makes it to me a development methodology thing to enable partners such a lower cost testing of supported platforms: that workflow for supported processors, means there may be a delay (~2 days) between change landing and issue seen where the core team would be willing to act on as if it were a presubmit or a postsubmit just post that failed.

I don't think we need to get too creative beyond that, if this works then I think it ends up being a good/cheap template too**.

-- Jacques

* Running on Google infra had a cost advantage here, tradeoff to moving things that is believed to enable more sustainable flow for more partners.
** I was going to joke that we should run all these presubmits on a new ipad pro so we can check SME even ... it actually sounds fun to setup a PoC, no maintaining etc

Scott Todd

unread,
Jul 29, 2024, 7:25:26 PM7/29/24
to iree-discuss

Updates on my end:
  • I've been continuing to refactor our workflows to make it easier to have jobs use individually configurable event triggers and in certain cases possible to disable directly from the GitHub UI.
  • I also drafted some documentation for how we use GitHub Actions that I hope can help navigate these sorts of situations going forward: https://github.com/iree-org/iree/pull/18035.

Andrzej Warzyński

unread,
Jul 30, 2024, 12:13:42 PM7/30/24
to iree-discuss
Thanks Scott!

I don't have much to share myself. Sadly, the hardware provider that we used to use for public CI is ramping down, so I'm looking at other options. Separately, our dedicated CI team hinted that they won't be available to help till Autumn :( (they have a long backlog).

I'm exploring other options and trying to expedite this, but it looks like it might be a few months before we can offer some runners.

-Andrzej 

Scott Todd

unread,
Aug 29, 2024, 5:48:52 PM8/29/24
to iree-discuss
Another update: we're making progress on switching from GCP GitHub runners to a new cluster (likely on Azure, but also considering AWS): https://github.com/iree-org/iree/issues/18238 . We're focusing on the majority of the builds needed core project development (Linux x86_64) before the long tail of builds that would be nice to support (Linux arm64, Windows, CUDA GPUs, etc.), so we may lose the current arm64 VMs without a ready replacement on our current trajectory.

Andrzej Warzyński

unread,
Aug 30, 2024, 3:50:03 AM8/30/24
to iree-discuss
Please also migrate base-arm64.Dockerfile :) Hopefully that's not too much extra work? IIUC, these images are used for testing and to build Python packages for the corresponding platform? So, we will need it if we want to have an AArch64 runner? (even if we are to loose one in the short term)

There's been some positive "traffic" internally and I might have some good news next month, but don't want to promise anything just yet. 

-Andrzej

Scott Todd

unread,
Sep 3, 2024, 12:04:12 PM9/3/24
to Andrzej Warzyński, iree-discuss
It's used for testing. A different image is used for building the python packages: quay.io/pypa/manylinux_2_28_aarch64 (see code here). It shouldn't be too much work, just need to figure out:
  • Which deps are needed and how to install (the new repo uses apt for more packages and has some different install scripts)
  • How to manage the qemu-aarch64 binary currently stored at https://storage.googleapis.com/iree-shared-files/qemu-aarch64 . Steps for how to build that were posted on Discord. We could re-host it somewhere, check it in to the docker images repo (with some license?), build it from source, or find another way to install it.
  • How to test new images (could use github actions or spin up an arm64 VM for testing)

Andrzej Warzyński

unread,
Sep 30, 2024, 12:17:44 PM9/30/24
to iree-discuss
Good news from our end - this is now available for IREE: https://gitlab.arm.com/tooling/gha-runner-docs :)

I will continue the discussion under https://github.com/iree-org/iree/issues/18238

-Andrzej
Reply all
Reply to author
Forward
0 new messages