Running performance tests regularly

18 views
Skip to first unread message

drewgs...@gmail.com

unread,
Jan 2, 2021, 2:04:54 PM1/2/21
to jackson-dev
Does anyone have experience running performance tests regularly? 
We have some benchmarks for Jackson, but they're only run manually right now and I'd like a way to see more detailed performance over time.

I've got a lightly basement server that could host a VM for running the tests nightly, but I don't really know where to start beyond perhaps setting up Jenkins.

-Drew

Marshall Pierce

unread,
Jan 2, 2021, 4:26:20 PM1/2/21
to jacks...@googlegroups.com
I don’t, but these articles by Mark Price come to mind about his experience at LMAX with performance testing:

https://epickrram.blogspot.com/2014/05/performance-testing-at-lmax-part-one.html
https://epickrram.blogspot.com/2014/07/performance-testing-at-lmax-part-two.html
https://epickrram.blogspot.com/2014/08/performance-testing-at-lmax-part-three.html

Could be something applicable in there at least conceptually, though the tools may have changed since 2014.

I’d start poking at what level of repeatability you can get from a VM. Before you wrestle with Jenkins, it’d be good to know if you can in fact tune a guest and host kernel in that setup to have at least one or two cores that won’t get pre-empted by the kernel, or throttled, or any of the other pernicious things that plague benchmark repeatability. Once you’ve got a pretty repeatable benchmark sans latency spikes via taskset or isolcpus or whatever, then perhaps getting Jenkins to run a workload with that setup might follow easily enough? Mark’s blog has more on that sort of kernel fiddling.

Also, while I’m here, I just saw https://www.morling.dev/blog/towards-continuous-performance-regression-testing/ pop up in my feed the other day. Could be interesting for future test writing.

drewgs...@gmail.com

unread,
Jan 3, 2021, 6:17:41 AM1/3/21
to jackson-dev
On Saturday, January 2, 2021 at 4:26:20 PM UTC-5 marshall wrote:
I don’t, but these articles by Mark Price come to mind about his experience at LMAX with performance testing:

Ooh, yes lots of great information in Mark's series which is still relevant https://epickrram.blogspot.com/2015/09/reducing-system-jitter.html

My basement server only has two cores and I'd rather not lose one for performance testing (isolating performance tests on their own CPU core is one of the most impactful changes described in the above article).  Perhaps a Raspberry Pi 4 would be a good solution to this—it's cheap enough that having it dedicated to this task isn't a significant wallet hit and its low power consumption is good in that respect as well.

I'll PoC with my existing machine first, but does that sound like a reasonable next step for more accurate results?

-Drew

Marshall Pierce

unread,
Jan 3, 2021, 1:10:55 PM1/3/21
to jacks...@googlegroups.com


> On Jan 3, 2021, at 4:17 AM, drewgs...@gmail.com <drewgs...@gmail.com> wrote:
> ...
> My basement server only has two cores and I'd rather not lose one for performance testing (isolating performance tests on their own CPU core is one of the most impactful changes described in the above article). Perhaps a Raspberry Pi 4 would be a good solution to this—it's cheap enough that having it dedicated to this task isn't a significant wallet hit and its low power consumption is good in that respect as well.
>
> I'll PoC with my existing machine first, but does that sound like a reasonable next step for more accurate results?

SGTM. Arm64 will produce _different_ results than x64, but the point for performance regressions is simply to know if things change relative to yesterday’s test, so I think a Pi 4 is reasonable as long as it’s in a case with a hefty heat sink so it doesn’t downclock when it gets hot.

drewgs...@gmail.com

unread,
Jan 3, 2021, 3:30:18 PM1/3/21
to jackson-dev
On Sunday, January 3, 2021 at 1:10:55 PM UTC-5 marshall wrote:
SGTM. Arm64 will produce _different_ results than x64, but the point for performance regressions is simply to know if things change relative to yesterday’s test, so I think a Pi 4 is reasonable as long as it’s in a case with a hefty heat sink so it doesn’t downclock when it gets hot.

Indeed, RPi4s really need cooling to maintain their highest clockspeed.  It would probably be good to check whether any throttling occurred during the test run.

-Drew

Tatu Saloranta

unread,
Jan 3, 2021, 6:44:36 PM1/3/21
to jacks...@googlegroups.com
This is something I have quite often thought about as something that'd be really cool, but never figured out exactly how to go about it. Would love to see something in this space.

Getting tests to run is probably not super difficult (any CI system could trigger it), and could be also limited to specific branches/versions for practical purposes.
There would no doubt be some challenges in this part too; possible number of tests is actually huge (even for a single version), across formats, possible test cases, read/write, afterburner/none, string/byte source/target.
And having dedicated CPU resources would be a must for stable results.

To me, big challenges seemed to be about result processing, visualization; how to group test runs and so on.
Jenkins plug-ins tend to be pretty bad (just IMO) in displaying meaningful breakdowns, trends; it is easy to create something to impress a project manager, but less so to produce something to show important actual trends.
But even without trends, it'd be essential to be able to compare more than one result set to see diffs between certain versions.

And of course, it would also be great not to require local resources but use cloud platforms iff they could provide fully static cpu resources (tests fortunately do not use lots of i/o or network or even memory).

-+ Tatu +-




--
You received this message because you are subscribed to the Google Groups "jackson-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jackson-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jackson-dev/d0285396-5e9f-4788-8c43-e06095d4bbfbn%40googlegroups.com.

Drew Stephens

unread,
Jan 3, 2021, 7:10:07 PM1/3/21
to Tatu Saloranta, jacks...@googlegroups.com
Agreed that visualization is the hard part and that the existing Jenkins options aren’t great.

I’ll start by getting the benchmarks project setup to run automatically, a system (probably Jenkins) to do that running (probably just nightly on master & 2.12…I still haven’t run the whole thing to see how long it takes).

If I get all that sorted, we can have some ongoing results to figure out how to make some graphs from.  A gnuplot graph of total runtime over time should be easy enough to generate, and we could make drill-downs for each test suite or some other simple dimensions that would be useful.  Thereafter we can figure out how to figure out how to present the many other dimensions, because you’re definitely right that we’ll want those to be able to really figure out where things have changed.

-Drew
You received this message because you are subscribed to a topic in the Google Groups "jackson-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jackson-dev/e3GdN9l7cf4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jackson-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jackson-dev/CAGrxA27T08gNzkuEkXKNWNsKStQnMs3br5%3D43kMJ%3DPtR73W2Ew%40mail.gmail.com.

Tatu Saloranta

unread,
Jan 3, 2021, 8:10:40 PM1/3/21
to Drew Stephens, jacks...@googlegroups.com
On Sun, Jan 3, 2021 at 4:10 PM Drew Stephens <dr...@dinomite.net> wrote:
Agreed that visualization is the hard part and that the existing Jenkins options aren’t great.

I’ll start by getting the benchmarks project setup to run automatically, a system (probably Jenkins) to do that running (probably just nightly on master & 2.12…I still haven’t run the whole thing to see how long it takes).

Yes, I hope some of the settings in "results-pojo-2.12-home.txt" (and others) help. Warmup times of ~5 second and total runtime per test of something like 30 - 60 seconds (I think most are 5 second runs with 10 repeats, i.e. 50 seconds) seem to produce stable enough results.
I'm sure there is also some trade-off between low variability (longer runs) and frequent full test suite runs (shorter ones).

Test suite actually has quite a few tests that are not included in the set I use. So if the "main test" (one with "MediaItem") can be automated easily enough, it'd be possible to consider adding more alternate tests.

General test variations currently included are:

* Read / write (deser/ser)
* Different models (POJO, TreeNode ["Node"], Object ["Untyped"])
* Different formats (for some formats, only POJO)
* Afterburner / regular ("vanilla")

and then some limited variations just for JSON:

* "Wasteful" read/write: discard ObjectMapper after every iteration
* Alternate input sources -- DataInput, String (regular test always runs from byte[])

Some aspects that would be good to cover but aren't yet:

* Non-Java variants: Kotlin and Scala module (with / without Afterburner)
    - note: it is possible to run individual tests in profiler; I do this quite frequently myself -- could help find optimization targets
* Blackbird (replacement for Afterburner)
* (for JSON) with/without indentation?
* Tests for various annotations: basic tests use minimal annotations, and none (f.ex) use constructors for deserialization
 
If I get all that sorted, we can have some ongoing results to figure out how to make some graphs from.  A gnuplot graph of total runtime over time should be easy enough to generate, and we could make drill-downs for each test suite or some other simple dimensions that would be useful.  Thereafter we can figure out how to figure out how to present the many other dimensions, because you’re definitely right that we’ll want those to be able to really figure out where things have changed.

That makes sense.

Another thing, related to trends: not sure if it is practical, but since performance of released versions should not change
a lot after release (except maybe for different JDK), it might make sense to have separate runs for snapshots/branches, and for releases:

1. For snapshots, frequent but shorter runs, to give general idea; but also trends over date to possibly spot performance change
2. For released versions, longer runs trying to get stable "official" numbers after release? (in theory, also: multiple full runs, try to merge? Or pick fastest run per test type)

These are just general ideas that may or may not make sense. But ones I've had over time.

Also: please let me know if and how I can help! This is one area that I am very excited about, and where automation could help a lot.

-+ Tatu +-

Carter Kozak

unread,
Jan 4, 2021, 6:37:40 PM1/4/21
to jacks...@googlegroups.com
This recent blog post from Gunnar Morling provides an alternative approach better suited for pre-merge validation and CI systems, however it’s another abstraction which may miss drastic performance changes. This upside is reproducibility where perf validation is otherwise flaky.

https://www.morling.dev/blog/towards-continuous-performance-regression-testing

I’ve not used the framework yet myself, but it may be useful in this case.

Best,
Carter Kozak

On Jan 3, 2021, at 6:44 PM, Tatu Saloranta <tsalo...@gmail.com> wrote:


Reply all
Reply to author
Forward
0 new messages