Frequency of scalability tests

78 views
Skip to first unread message

Wojciech Tyczyński

unread,
Mar 10, 2023, 9:36:14 AM3/10/23
to kubernetes-sig-release, kubernetes-sig-scale, Tim Hockin, Benjamin Elder, Justin Santa Barbara, Maciej Borsz, Jacek Kaniuk
 Hi Release Team,

TL;DR; Would temporarily decreasing the frequency (due to our project-level budget shortage for GCP infrastructure) of 5k-node scalability tests to once per week would still allow us to be treated the same way scalability was treated so far as very important from the release POV

 Background and Details
  As you hopefully know, as a project we're suffering from extended costs for our infrastructure on GCP.  While definitely not dominating, scalability-oriented tests are a non-negligible part of those (~17.5% during Jan-Feb'23 period).

  We have already visibly reduced the costs by:
- removing 100-node presubmits
- removing/adjusting frequency of couple less important tests
- decreasing frequency of 5-node tests from running both 5k-node-correctness and 5k-node-performance from running both daily, to running both of them every second day (alternating)

  However, we have a push to temporarily reduce our costs further - until our overall project budget is under control.

  While as SIG Scalability we would really like to avoid this, thinking from the wider project perspective we are considering further reducing 5k-node tests, worst case to running both
5k-node-correctness and 5k-node-performance once per week (so some 5k-node tests every 3.5 days).
  And now the main question comes - it's extremely important for us that these remain important from the release POV (remain release-informing) and we won't accidentally release a new version that is not passing those tests.
 Would (a temporary) situation of those tests running that infrequently still allow us to achieve it?

 We're obviously committed for debugging and fixing them asap if some regression happens, but the time to find regression (if it affects only one of those tests) will now extend to 1w, and debugging time can also extend due to potentially more changes to go through or harder fixes (e.g. quick revert may no longer be possible after a week+). So the question is whether we would still have your support with that?

  Looking forward to hearing your answer (and would be happy to join some meeting to discuss it).

  On behalf of SIG scalability
wojtek

Davanum Srinivas

unread,
Mar 10, 2023, 9:47:41 AM3/10/23
to Wojciech Tyczyński, kubernetes-sig-release, kubernetes-sig-scale, Tim Hockin, Benjamin Elder, Justin Santa Barbara, Maciej Borsz, Jacek Kaniuk
Wojtek,

To strike a balance until we can get some of the other things to make room in our budget:
- keep current guarantees for release and schedule for the jobs as-is between code freeze and final release (looks like 4 weeks for 1.27), we can reduce frequency during the rest of the cycle
- even in the rest of the cycle, when we do see a problem, we should be ok to bump up the frequency to help get to the bottom of a problem

I'd request us to look at the new AWS infra as well. Am hoping that Shyam and others from AWS can help stand up some of the scalability tests there.

thanks,
Dims

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-release" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-re...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-release/CAPgfqyU%3D%2Baca_NK%2BT%2B-Y-kxLOrdnraZq46ovNcqaHn9HEcDJxg%40mail.gmail.com.


--
Davanum Srinivas :: https://twitter.com/dims

Tied 2theather

unread,
Mar 12, 2023, 9:33:59 PM3/12/23
to Wojciech Tyczyński, kubernetes-sig-release, kubernetes-sig-scale, Tim Hockin, Benjamin Elder, Justin Santa Barbara, Maciej Borsz, Jacek Kaniuk
If there is any thing I can do for you sig people I have some time on my hands but I'm not very familiar with it not sure what i should do. but I can learn quickly and would
Like To become a lot more familiar so please let me know

--
Message has been deleted

Shiming Zhang

unread,
Mar 16, 2023, 6:23:46 AM3/16/23
to kubernetes-sig-release
Maybe we can simulate some nodes with KWOK to replace the real ones https://kwok.sigs.k8s.io/

Shyam Jeedigunta

unread,
Mar 21, 2023, 3:12:12 PM3/21/23
to kubernetes-sig-release
I brought this up here at EKS last week and we have buy-in from the leadership on making a longer term commitment towards scale-testing Kubernetes. Goal is help load-balance the costs ($$$) and expertise (developer time) of running and managing these tests and resolving issues we catch from them.
Hope this comes in as good news both for the project's sustainability and GCP who's been the torchbearer for scale-testing so far. Should also help put some of those AWS credits to good use.

Over the next few months we'll figure out a plan and ramp-up the test setup and related infra. Happy to discuss more about this in the upcoming SIG-scale/SIG-release meeting.

Thanks,
Shyam

Benjamin Elder

unread,
Mar 21, 2023, 3:22:57 PM3/21/23
to Shyam Jeedigunta, kubernetes-sig-release
Thank you Shyam and EKS/AWS.

I think SIG Testing and Infra will also be very interested to help figure out the details, versus previously working with Scale to cut costs for this year as an unfortunate stopgap.

Shiming: Thank you for the suggestion. SIG Scale actually has had tools to simulate fake nodes for a long time now, but they're not suitable for all tests and to run a very large number of fake nodes we actually need a moderately large real cluster to host them.

Benjamin Elder

unread,
Mar 21, 2023, 3:24:39 PM3/21/23
to Shyam Jeedigunta, kubernetes-sig-testing, kubernetes-sig-k8s-infra, kubernetes-sig-release

Wojciech Tyczyński

unread,
Mar 22, 2023, 8:05:47 AM3/22/23
to Shyam Jeedigunta, kubernetes-sig-release
 Thanks Shyam!
 i'm happy to also meet ad-hoc in addition to our regular meetings to discuss if that can help. Let's make this happen!

Shyam Jeedigunta

unread,
Mar 24, 2023, 6:16:03 PM3/24/23
to kubernetes-sig-release
Thanks Ben and Wojtek. Will need guidance from sig-testing and k8s-infra for sure.
I've cut this issue to start tracking the work - https://github.com/kubernetes/test-infra/issues/29139.
Reply all
Reply to author
Forward
0 new messages