Performance tests and handling flakiness

241 views
Skip to first unread message

Mani Sarkar

unread,
Feb 16, 2020, 9:52:09 AM2/16/20
to mechanica...@googlegroups.com
Hi all

I have been recently writing performance tests and each time I reach a milestone I come across slightly new challenges.

At first it was capturing the baselines and then pinning the tests to the new performance numbers.

But then the question arises how do we check if our tests are telling the right thing if the underlying system or the implementation or both have an element of flakiness. 

Do you run them a few times and then take the average of them or do you run them a few times and if they pass a set maximum number of times the test is good or else its a failed test.

I'm sure many of you might have come to this situation when you have optimised a system and want to regression proof it. And want to ensure that it should tell you when the underlying implementation has genuinely regressed due to some changes.

It's not cool if the performance tests randomly fail on CICD or local machine.  

Just want to know how everyone else does it. And what you think of the above.

Regards 
Mani
--
--
@theNeomatrix369  |  Blogs: https://medium.com/@neomatrix369  | @adoptopenjdk @graalvm @graal @truffleruby  |  Github: https://github.com/neomatrix369  |  Slideshare: https://slideshare.net/neomatrix369 | LinkedIn: https://uk.linkedin.com/in/mani-sarkar

Don't chase success, rather aim for "Excellence", and success will come chasing after you!

Mani Sarkar

unread,
Feb 25, 2020, 7:13:33 AM2/25/20
to mechanica...@googlegroups.com
  • I'll add some more context and intent to my original query.

  • Let's say existing functionality in some application is slow:

  • - we profile and detect the functions that are slow (takes ~30 minutes to finish)
  • - we improve them and see speed benefits (brought it down to ~3 minutes)
  • - we benchmark the old timings measured on each of these functionalities (say they are individual methods)
  • - we then use those numbers to compare the performance of the improved functionalitiies
  • - So a test assertion would look like assertThat(currentSpeed, isLessOrEqualTo(expectedSpeed))
  • - expectedSpeed is pinned to fixed values approximately 10% of the original slow speed
  • Although as we know this currentSpeed values can be spiky/flaky at times (by some small variations)

One way I tried to bring down the number of failing tests is taking values from multiple runs, averaging them and then using the average to compare with the expectedSpeed - this has given much better results. (I was also advised to use the std deviation if necessary - I haven't applied it yet)

Does this sound like regularly used method, are there better ways to do this? I hope the context is clearer now.
--

Yannick Schpilka

unread,
Feb 26, 2020, 7:10:10 AM2/26/20
to mechanical-sympathy
Hey,

I am mostly a passive follower of this group but i think what you need is ndeed multiple runs, then average, std deviation and percentiles.
There isn't that one value that gives all the responses.

But a p99.9 over a bunch of runs that fit into your target would likely indicate that you are good.
Reply all
Reply to author
Forward
0 new messages