RFC: Turning down 'benchmark' workflows and perf.iree.dev

121 views
Skip to first unread message

Scott Todd

unread,
Aug 7, 2024, 1:40:53 PM8/7/24
to iree-discuss
Summary

I propose we remove the "Benchmark" and "Benchmark Large" workflows from the iree-org/iree repository. These workflows are what indicate possible benchmark improvements/regressions on pull requests and generate data for https://perf.iree.dev/, which I also propose we either let go stale or take offline.

This is at least our third iteration of benchmarking infrastructure and I think it's time to start fresh with a new iteration. Redesigning infrastructure like this is a healthy part of a project's lifecycle.

Background

The benchmark suites are documented at https://iree.dev/developers/performance/benchmark-suites/.

These code paths control "the benchmark CI":
This infrastructure runs (or has run) automated benchmarks of these models:
  • DeeplabV3
  • MobileSSD
  • PoseNet
  • MobileBERT
  • MobileNet
  • PersonDetect
  • EfficientNet
  • VIT
  • UNet
  • Clip
  • BERT
  • Falcon
  • ResNet
  • T5
Specific subsets of those models (and associated microbenchmarks) are benchmarked on ARM CPU, x86_64 CPU, RISC-V CPU, CUDA GPU, Adreno GPU, and Mali GPU. Benchmarks are run both on PRs (opt-in) and on each push to 'main'. Historical results from runs on 'main' are tracked at https://perf.iree.dev. We measure both execution latency and select compilation statistics (e.g. dispatch count, dispatch size, compilation time). We used to run benchmarks both with and without tracing, with traces archived in cloud storage for developer analysis.

This infrastructure is heavily based on Google Cloud Platform (GCP): runners are hosted on GCP, files are passed between workflows and stored for later access in GCS, and machines use Docker images stored in GCR. At one point we were uploading 50GB of benchmark artifacts for every benchmark run, with no TTL on those files, so the costs can be quite high if not monitored closely.

The website uses https://github.com/google/dana, an unmaintained project that we forked to add some features: https://github.com/google/dana/tree/iree.

Rationale for removal

As part of moving IREE into the LF AI & Data Foundation, we have been auditing our infrastructure configurations, ongoing infrastructure costs, cloud project permissions, and related items.

As it currently exists, the benchmarking infrastructure is incurring significant costs without matching returns. Specifically, the cloud runners will need to be migrated off of GCP soon and the code in-tree is blocking significant infrastructure clean up efforts. Furthermore, no benchmark series have been added in about 8 months and the "benchmark large" workflow has been failing for multiple weeks without many complaints.

Plans for replacement

While out of scope for this specific proposal, I do have some general ideas for how to design the next iteration of benchmarking infrastructure:
  • Create a separate repository for benchmark configurations (decoupled from the core project), like the iree-org/iree-comparative-benchmark and nod-ai/SHARK-TestSuite repositories
  • Source models from huggingface, tfhub/kaggle, onnx models, etc. instead of mirrors in the iree-model-artifacts GCS bucket
  • Use local caches on persistent runners for model sources, only upload files for archival that are specifically useful for developers
  • Define schemas for benchmark series (model x compiler options x runtime options x hardware x statistics to track), avoiding using CMake or Bazel
  • Derive benchmarks from test suites, so all benchmarked models are also tested for correctness/accuracy
  • Optimize ergonomics for local development workflows and automated CI workflows
  • Add new benchmarks for currently popular model families (e.g. transformer models, llama)
  • Find new benchmark runners (AMD has a few, e.g. MI250 / MI300 datacenter GPU machines)
  • Track results in a database that can be queried / filtered / etc. independent of the website UI
  • Backfill results if possible
Proposed removal timeline

1. ASAP: Disable workflows
2. ASAP: Delete code
3. < 1 month: Archive historical data somewhere
4. >= 3 months? Take perf.iree.dev offline

Alternative plans considered

Keep the benchmarks running at least compilation statistics?
  • This will prune the usage of self-hosted runners but will incur all the same code maintenance costs.
Build a new system for benchmarking before turning down the existing one?
  • I'd certainly prefer to transition this way, but the ongoing maintenance and budget costs are quite steep.
Keep perf.iree.dev online with historical data?
  • This may be possible and fairly inexpensive. I'm not sure how much utility we'd get out of it though.

Scott Todd

unread,
Aug 7, 2024, 2:57:22 PM8/7/24
to iree-discuss
Here is a PR deleting the associated code (and a few related scripts, docs, tests, etc. that got coupled together with the benchmarks): https://github.com/iree-org/iree/pull/18144. I'll wait to proceed with that until at least a few days have passed and we've had time to discuss here.

Scott Todd

unread,
Aug 12, 2024, 12:39:29 PM8/12/24
to iree-discuss
Based on some discussion on Discord, it seems like developers are still getting value out of the "comp-stats" (compilation statistics) and CPU benchmark series. I don't see a way to keep those running on the current infrastructure without significant effort, but it's probably worth seeing what it would take to give similar signal through some new scripts/workflows. My preference would still be to turn down the current infrastructure ASAP to save on costs and unblock other infrastructure refactoring (scripts, Dockerfiles, workflows, etc.).

Stella Laurenzo

unread,
Aug 12, 2024, 2:00:51 PM8/12/24
to Scott Todd, iree-discuss
I can't seem to respond to Google groups anymore so no idea if this is going to the list.

I think we should set how to dump those statistics on a more aggregated form from current test jobs and then make it more accessible if needed.

The issue isn't really open for feedback to keep the current stuff running: the infra behind it is going away. But we can fill the holes that leaves with other things to avoid disruption.

--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/0d844997-fb51-4abc-b442-1227ca82b1a3n%40googlegroups.com.

Cullen Rhodes

unread,
Aug 14, 2024, 3:49:46 AM8/14/24
to iree-discuss
Thanks for the RFCs to improve infrastructure Todd!

I've recently started building some internal infrastructure for benchmarking after a long hiatus on this stuff so these are quite timely. I just wanted to ask if you've considered using LLVMs LNT for tracking performance? I've took inspiration from the llvm-test-suite which uses Lit (as test runner, I noticed you mentioned pytest on your other RFC which looks interesting and is something I wouldnt have considered) and LNT, it supports uploading and comparing Perf traces which I'm going to try for IREE.

Scott Todd

unread,
Aug 14, 2024, 12:28:22 PM8/14/24
to iree-discuss
I haven't personally thought too deeply yet about the specific tools available for replacing what we have today. LNT and https://github.com/llvm/llvm-test-suite/ both look quite interesting, thanks for mentioning them. At one point we also wanted to use https://grafana.com/ (or similar tools) for the UI.

I'm planning on merging https://github.com/iree-org/iree/pull/18144 today to delete the existing infra.

Reply all
Reply to author
Forward
0 new messages