Performance tests and handling flakiness

Mani Sarkar

unread,

Feb 18, 2020, 12:45:27 PM2/18/20

to behaviordriv...@googlegroups.com

Hi all,

(cross posting from another list, the topic overlaps with the discussions held here hence posting it here)

I have been recently writing performance tests and each time I reach a milestone I come across slightly new challenges.

At first it was capturing the baselines and then pinning the tests to the new performance numbers.

But then the question arises how do we check if our tests are telling the right thing if the underlying system or the implementation or both have an element of flakiness.

Do you run them a few times and then take the average of them or do you run them a few times and if they pass a set maximum number of times the test is good or else its a failed test.

I'm sure many of you might have come to this situation when you have optimised a system and want to regression proof it. And want to ensure that it should tell you when the underlying implementation has genuinely regressed due to some changes.

It's not cool if the performance tests randomly fail on CICD or local machine.

Just want to know how everyone else does it. And what you think of the above.

Regards

Mani

--

--
@theNeomatrix369 | Blogs: https://medium.com/@neomatrix369 | @adoptopenjdk @graalvm @graal @truffleruby | Github: https://github.com/neomatrix369 | Slideshare: https://slideshare.net/neomatrix369 | LinkedIn: https://uk.linkedin.com/in/mani-sarkar

Don't chase success, rather aim for "Excellence", and success will come chasing after you!

--

--
@theNeomatrix369 | Blogs: https://medium.com/@neomatrix369 | @adoptopenjdk @graalvm @graal @truffleruby | Github: https://github.com/neomatrix369 | Slideshare: https://slideshare.net/neomatrix369 | LinkedIn: https://uk.linkedin.com/in/mani-sarkar

Don't chase success, rather aim for "Excellence", and success will come chasing after you!

Mani Sarkar

unread,

Feb 25, 2020, 7:14:08 AM2/25/20

to behaviordriv...@googlegroups.com

I'll add some more context and intent to my original query.
Let's say existing functionality in some application is slow:
- we profile and detect the functions that are slow (takes ~30 minutes to finish)
- we improve them and see speed benefits (brought it down to ~3 minutes)
- we benchmark the old timings measured on each of these functionalities (say they are individual methods)
- we then use those numbers to compare the performance of the improved functionalitiies
- So a test assertion would look like assertThat(currentSpeed, isLessOrEqualTo(expectedSpeed))
- expectedSpeed is pinned to fixed values approximately 10% of the original slow speed
Although as we know this currentSpeed values can be spiky/flaky at times (by some small variations)

One way I tried to bring down the number of failing tests is taking values from multiple runs, averaging them and then using the average to compare with the expectedSpeed - this has given much better results. (I was also advised to use the std deviation if necessary - I haven't applied it yet)

Does this sound like regularly used method, are there better ways to do this? I hope the context is clearer now.

Regards,

Mani

--

Sponsor https://github.com/sponsors/neomatrix369

George Dinwiddie

unread,

Feb 25, 2020, 10:28:16 AM2/25/20

to behaviordriv...@googlegroups.com

Mani,

On 2/25/20 7:13 AM, Mani Sarkar wrote:
> * I'll add some more context and intent to my original query.
> *
>
> * Let's say existing functionality in some application is slow:
> *
>
> * - we profile and detect the functions that are slow (takes ~30
> minutes to finish)
> * - we improve them and see speed benefits (brought it down to ~3 minutes)
> * - we benchmark the old timings measured on each of these

> functionalities (say they are individual methods)

> * - we then use those numbers to compare the performance of the
> improved functionalitiies
> * - So a test assertion would look like |assertThat(currentSpeed,
> isLessOrEqualTo(expectedSpeed))|
> * - expectedSpeed is pinned to fixed values approximately 10% of the
> original slow speed
> * Although as we know this currentSpeed values can be spiky/flaky at

> times (by some small variations)

> * >

> One way I tried to bring down the number of failing tests is taking
> values from multiple runs, averaging them and then using the average to
> compare with the |expectedSpeed| - this has given much better results.
> (I was also advised to use the std deviation if necessary - I haven't
> applied it yet)
>
> Does this sound like regularly used method, are there better ways to do
> this? I hope the context is clearer now.

I'm not a fan of testing performance to thresholds. Not only do you get
the "flakiness" you're seeing from small variations (perhaps due to
other loads on the machine or network), but the change that puts you
over the threshold is probably not the biggest problem. It's just "the
last straw" that brought notice to the problem.

I recommend graphing performance figures over time. You could, if you
want, include a horizonatal line indicating your target performance
measurement. More importantly, pay attention to the shape of the
performance curve. Where is there a sudden change in slope? What's
behind that?

If you go this route, you'll want to look at the graphs frequently, and
have conversations about what you see in them.

- George

--
----------------------------------------------------------------------
* George Dinwiddie * http://blog.gdinwiddie.com
Software Development http://www.idiacomputing.com
Consultant and Coach
----------------------------------------------------------------------

Andrew Premdas

unread,

Feb 25, 2020, 11:04:14 AM2/25/20

to behaviordriv...@googlegroups.com

Adding a little to George's excellent answer

Garry Bernhardt did some really interesting stuff measuring performance against a git repo in this screencast https://www.destroyallsoftware.com/screencasts/catalog/history-spelunking-with-unix

This cast shows running a test against every commit on the master branch of a repository to get performance history and see when performance degrades/improves. There is a lot to learn in all his screencasts.

--
You received this message because you are subscribed to the Google Groups "Behaviour Driven Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to behaviordrivendeve...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/behaviordrivendevelopment/db9073b6-5309-b042-8143-b463304dd424%40iDIAcomputing.com.

--

------------------------

Andrew Premdas

blog.andrew.premdas.org

Mani Sarkar

unread,

Feb 25, 2020, 5:27:48 PM2/25/20

to behaviordriv...@googlegroups.com

Thanks both Andrew and George for your answers.

Feedback like this is super helpful and I can use them atm, I think. I will take a look at the video as well. Thanks for that.

Although my question will be how can we still write automated tests that help alerting us during ci/cd or even developer local machine builds. Basically letting the concerned parties know within a good scope of time that their work has potentially degraded the performance (if that is the case) of the system.

What you mention about analysis the graph would that not be similar to some statical analysis of several runs? When the slope tilts towards and range of numbers it would mean the performance is improving or degrading? Is this a reasonable way to also check?

--

Wlodek Krakowski [refactoring.pl]

unread,

Mar 2, 2020, 4:01:55 AM3/2/20

to Behaviour Driven Development

Hi Mani,

I think there are 2 concerns here.

Using "performance check" as part of build/local-build can be performed easily only once you find/watch for a possible bottleneck in our performance at a very small/local scale. It is tricky and the team might stop taking care (!) of it if the threshold is exceeded randomly from time to time.

My teams were using Gatling tool as a basement of a separate "performance check project" that was running performance tests of the whole system once/twice a day. In such a case you might look at the performance slope going down/up at part of weekly/biweekly retrospective and draw conclusions regularly.

All the best,

Włodek

Włodek Krakowski - IT Technical Trainer

www.refactoring.pl

Daniel Terhorst-North

unread,

Mar 9, 2020, 2:41:35 PM3/9/20

to behaviordriv...@googlegroups.com

Hi Mani,

Although my question will be how can we still write automated tests that help alerting us during ci/cd or even developer local machine builds. Basically letting the concerned parties know within a good scope of time that their work has potentially degraded the performance (if that is the case) of the system.

There are two different statements here. “Letting the concerned parties know within a good scope of time” is the goal (technically: noticing soon enough that rectifying the situation doesn’t noticeably impact your flow of value). “Alerting us during CI/CD or even developer local machine builds” is a local optimisation and may even be chasing ghosts, i.e. it isn’t reasonably solvable.

Having a separate perf stage that runs out-of-band of your usual build, either daily, hourly, or whatever works, is a great way to trend your performance over time. You will likely start with something simplistic for round-trip times, and get more sophisticated as the app and your knowledge increase.

What you mention about analysis the graph would that not be similar to some statical analysis of several runs? When the slope tilts towards and range of numbers it would mean the performance is improving or degrading? Is this a reasonable way to also check?

You can start just by checking by eye. Also if someone notices a sudden degradation in performance, you can check the graphs to see if that correlates. On one project someone (an internal user) noticed the app “felt more sluggish”. The team looked at the various performance graphs and saw a massive drop three months before that no one had noticed. Because they had the graph they were able to pinpoint the date the change probably happened, and then they checked the commits that day and found the unintended behaviour. It took a matter of hours to identify, isolate, fix, test and redeploy.

You can get fancy with automated analysis of the graphs, but the simplest approach is just to look at them every week or so.