Background and Details
As you hopefully know, as a project we're suffering from extended costs for our infrastructure on GCP. While definitely not dominating, scalability-oriented tests are a non-negligible part of those (~17.5% during Jan-Feb'23 period).
We have already visibly reduced the costs by:
- removing 100-node presubmits
- removing/adjusting frequency of couple less important tests
However, we have a push to temporarily reduce our costs further - until our overall project budget is under control.
While as SIG Scalability we would really like to avoid this, thinking from the wider project perspective we are considering further reducing 5k-node tests, worst case to running both
And now the main question comes - it's extremely important for us that these remain important from the release POV (remain release-informing) and we won't accidentally release a new version that is not passing those tests.
Would (a temporary) situation of those tests running that infrequently still allow us to achieve it?
We're obviously committed for debugging and fixing them asap if some regression happens, but the time to find regression (if it affects only one of those tests) will now extend to 1w, and debugging time can also extend due to potentially more changes to go through or harder fixes (e.g. quick revert may no longer be possible after a week+). So the question is whether we would still have your support with that?
Looking forward to hearing your answer (and would be happy to join some meeting to discuss it).
On behalf of SIG scalability
wojtek