SummaryI propose we remove the "Benchmark" and "Benchmark Large" workflows from the iree-org/iree repository. These workflows are what indicate possible benchmark improvements/regressions on pull requests and generate data for
https://perf.iree.dev/, which I also propose we either let go stale or take offline.
This is at least our third iteration of benchmarking infrastructure and I think it's time to start fresh with a new iteration. Redesigning infrastructure like this is a healthy part of a project's lifecycle.
BackgroundThe benchmark suites are documented at
https://iree.dev/developers/performance/benchmark-suites/.
These code paths control "the benchmark CI":
This infrastructure runs (or has run) automated benchmarks of these models:
- DeeplabV3
- MobileSSD
- PoseNet
- MobileBERT
- MobileNet
- PersonDetect
- EfficientNet
- VIT
- UNet
- Clip
- BERT
- Falcon
- ResNet
- T5
Specific subsets of those models (and associated microbenchmarks) are benchmarked on ARM CPU, x86_64 CPU, RISC-V CPU, CUDA GPU, Adreno GPU, and Mali GPU. Benchmarks are run both on PRs (opt-in) and on each push to 'main'. Historical results from runs on 'main' are tracked at
https://perf.iree.dev. We measure both execution latency and select compilation statistics (e.g. dispatch count, dispatch size, compilation time). We used to run benchmarks both with and without tracing, with traces archived in cloud storage for developer analysis.
This infrastructure is heavily based on Google Cloud Platform (GCP): runners are hosted on GCP, files are passed between workflows and stored for later access in GCS, and machines use Docker images stored in GCR. At one point we were uploading 50GB of benchmark artifacts for every benchmark run, with no TTL on those files, so the costs can be quite high if not monitored closely.
The website uses
https://github.com/google/dana, an unmaintained project that we forked to add some features:
https://github.com/google/dana/tree/iree.
Rationale for removalAs part of moving IREE into the LF AI & Data Foundation, we have been auditing our infrastructure configurations, ongoing infrastructure costs, cloud project permissions, and related items.
As it currently exists, the benchmarking infrastructure is incurring significant costs without matching returns. Specifically, the cloud runners will need to be migrated off of GCP soon and the code in-tree is blocking significant infrastructure clean up efforts. Furthermore, no benchmark series have been added in about 8 months and the "benchmark large" workflow has been failing for multiple weeks without many complaints.
Plans for replacementWhile out of scope for this specific proposal, I do have some general ideas for how to design the next iteration of benchmarking infrastructure:
- Create a separate repository for benchmark configurations (decoupled from the core project), like the iree-org/iree-comparative-benchmark and nod-ai/SHARK-TestSuite repositories
- Source models from huggingface, tfhub/kaggle, onnx models, etc. instead of mirrors in the iree-model-artifacts GCS bucket
- Use local caches on persistent runners for model sources, only upload files for archival that are specifically useful for developers
- Define schemas for benchmark series (model x compiler options x runtime options x hardware x statistics to track), avoiding using CMake or Bazel
- Derive benchmarks from test suites, so all benchmarked models are also tested for correctness/accuracy
- Optimize ergonomics for local development workflows and automated CI workflows
- Add new benchmarks for currently popular model families (e.g. transformer models, llama)
- Find new benchmark runners (AMD has a few, e.g. MI250 / MI300 datacenter GPU machines)
- Track results in a database that can be queried / filtered / etc. independent of the website UI
- Backfill results if possible
Proposed removal timeline1. ASAP: Disable workflows
2. ASAP: Delete code
3. < 1 month: Archive historical data somewhere
4. >= 3 months? Take
perf.iree.dev offline
Alternative plans consideredKeep the benchmarks running at least compilation statistics?
- This will prune the usage of self-hosted runners but will incur all the same code maintenance costs.
Build a new system for benchmarking before turning down the existing one?
- I'd certainly prefer to transition this way, but the ongoing maintenance and budget costs are quite steep.
- This may be possible and fairly inexpensive. I'm not sure how much utility we'd get out of it though.