Who can donate a (few) machines for benchmarks ?

1,208 views
Skip to first unread message

ymo

unread,
Mar 16, 2016, 12:00:38 PM3/16/16
to mechanical-sympathy
Based on the recent conversant benchmarks "discussion" I was wondering if anyone could donate hw to run nightly the aeron benchmarks. The benchmarks would be open for all to modify and compare.

This is quite a nice example of what one could acheive ... https://www.techempower.com/benchmarks

If you have contacts to some of the hw vendors like intel/amd/amazon or any of the bare metal providers it could be very interesting to get them to donate ! Anyone ???

Jonathan Yu

unread,
Mar 16, 2016, 9:48:35 PM3/16/16
to mechanica...@googlegroups.com
It might also be a good idea to look at EC2 (TechEmpower runs some of their benchmarks there) and SoftLayer (where you can run bare-metal servers on an hourly basis.) This would help ensure that the hardware is kept reasonably current. The problem with hardware donations is that they're unlikely to be recurring, so eventually your hardware becomes so outdated that the benchmark numbers are no longer useful.

If you do have to build your own hardware for some reason, look at the vendors producing Facebook's OpenCompute hardware.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Jonathan Yu @jawnsy on LinkedInTwitterGitHubFacebook
“Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.” — Samuel Beckett, Worstward Ho (1983) 

“In an adaptive environment, winning comes from adapting to change by continuously experimenting and identifying new options more quickly and economically than others. The classical strategist's mantra of sustainable competitive advantage becomes one of serial temporary advantage.” — Navigating the Dozens of Different Strategy Options (HBR)

Theodore Omtzigt

unread,
Apr 7, 2016, 8:09:12 AM4/7/16
to mechanical-sympathy
We have some serious (Intel) gear that might be of interest to you, and I would love to help: our speciality is building hardware accelerators for critical applications, this could be an interesting hunting ground for us.

ymo

unread,
Apr 8, 2016, 7:24:30 AM4/8/16
to mechanical-sympathy
just Great ! wonder how you want to do this ?

You can create the github repository yourself if you want more control on the code running on your machines. FTR I am not one the maintainers of aeron but you can also ask people to submit it into the aeron benchmarks then run them nightly on your machine(s) once the pull requests are accepted. What would be your preference ?

Regards.

Theodore Omtzigt

unread,
Apr 8, 2016, 10:39:07 AM4/8/16
to mechanical-sympathy
Can you point me to the 'stack', so that I can familiarize myself with the workflow. For example, I need a quick education regarding the relationship between aeron and techempower link you provided. It sounds like you have already thought about what would be a good setup and organization and comparison. As messaging frameworks are core architectural elements in a enterprise application, their quality and feature set are typically more important than raw performance; point in case, many of the Java ESBs out there. That feature set would need to be reflected in any benchmarking effort somehow. Just thinking about how to (perf) test any retry behavior should give anyone pause. This will have to be a community effort, so a public repo is a given. Please give me a brain dump of what you have been thinking about so far.

ymo

unread,
Apr 11, 2016, 11:52:25 AM4/11/16
to mechanical-sympathy
My interest currently resembles what is defined in a normal best case scenario. I understand that some people have "issues" with synthetic benchmarks. But for me if i can prove that something behaves in a certain fashion when i remove all the other white noise factors that are prevalent in normal applications then at least i have an upper bound to what i can expect the platform is capable of doing. So in all the benchmarks the ultimate motive is this simple question "What is the upper bound that i can expect given a known platform and compute nodes"

Here are some of the few things i think are worth investigating

1) The application platform configuration. Meaning how to segregate all the cpu(s) in the test via isolcpu or some other init based means. The idea here is  to get the kernel out of the way and only deal with pure application benchmarks without contention from other applications running within the same box. Of course test harness i

2) Compute heavy workflow. The idea here is to come up with a queue benchmark implementation that allows testing compute heavy workflows to be stitched together. We are not testing the actual calculations/computations. I think what we want to test is how is the throughput/latency is affected when you switch between different queue implementation for the exact same workflow/calculations. The easiest example is a sum of squares or a sum of integers to start with. So the nodes of your workflow are DAG(s) that each compute a certain thing based on an input and produce a particular output that can be used by another compute node. The computations will map to a collection of cores in this example.

3) Io heavy workflow.Obviously once you get to a certain level of complexity the biggest bottleneck becomes the I/O if you have a bunch of distributed compute nodes (as opposed to cores for the 2). So the idea is again is to consider a collection of nodes as a cluster and perform the same computations but this time switching between different transport implementations. The ultimate goal is whats the best combination of queues (for intra core communications)  and transports (for intra nodescommunications) given a certain workloads.

The nice thing i like about techempower is that it showcases some frameworks and the best implementation just emerges from the plethora of options in a "natural" fashion. Albeit each platform comes with its own caveat but you come out of the exercise with a big boost in making informed decisions ... i think )))

I know that synthetic benchmarks are what they are ! However, i still think they make us, at the very least learn, learn from how we should test our code. Moreover, The added bonus for me is if i was able to replace/plugin the computation with my own workload. To draw an analogy Jmh provides a benchmark harness for micro benchmarks. It would be nice if we could come up with an easy way of doing the same for intra cores and for intra nodes bench marking and make the computation , queues and transport "pluggable" variables in the equation. Maybe this is "Just a dream" !

ymo

unread,
Apr 11, 2016, 12:06:32 PM4/11/16
to mechanical-sympathy
I forgot to answer the "stack" question. Here is my keen interest for the stack:

 in terms of queues:
1) aeron/argona
2) jctools
3) dpdk

in terms of transport:
1) aeron
2) dpdk
3) seastar

Gil Tene

unread,
Apr 12, 2016, 12:02:38 AM4/12/16
to mechanica...@googlegroups.com
If you are looking at the set of "stacks" below (all of which are queues/transports), I would strongly encourage you to avoid repeating the mistakes of testing methodologies that focus entirely on max achievable throughput and then report some (usually bogus) latency stats at those max throughout modes. The tech empower numbers are a classic example of this in play, and while they do provide some basis for comparing a small aspect of behavior (what I call the "how fast can this thing drive off a cliff" comparison, or "peddle to the metal" testing), those results are not very useful for comparing load carrying capacities for anything that actually needs to maintain some form of responsiveness SLA or latency spectrum requirements.

Rules of thumb I'd start with (some simple DOs and DON'Ts):

1. DO measure max achievable throughput, but DON'T get focused on it as the main or single axis of measurement / comparison.

2. DO measure response time / latency behaviors across a spectrum of attempted load levels (e.g. at attempted loads between 2% to 100%+ of max established thoughout).

3. DO measure the response time / latency spectrum for each tested load (even for max throughout, for which response time should linearly grow with test length, or the test is wrong). HdrHistogram is one good way to capture this information.

4. DO make sure you are measuring response time correctly and labeling it right. If you also measure and report service time, label it as such (don't call it "latency").

5. DO compare response time / latency spectrum at given loads.

6. DO [repeatedly] sanity check and calibrate the benchmark setup to verify that it produces expected results for known forced scenarios. E.g. forced pauses of known size via ^Z or SIGSTOP/SIGCONT should produce expected response time percentile levels. Attempting to load at >100% than achieved throughput should result in response time / latency measurements that grow with benchmark run length, while service time (if measured) should remain fairly flat well past saturation.

7. DON'T use or report standard deviation for latency. Ever. Except if you mean it as a joke.

8. DON'T use average latency as a way to compare things with one another. [use median or 90%'ile instead, if what you want to compare is "common case" latencies]. Consider not reporting avg. at all.

9. DON'T compare results of different setups or loads from short runs (< 20-30 minutes).

10. DON'T include process warmup behavior (e.g. 1st minute and 1st 50K messages) in compared or reported results. 

For some concrete visual examples of how one might actually compare the behaviors of different stack and load setups, I'm attaching some example charts (meant purely as a an exercise in plotting and comparing results under varying setups and loads) that you could similarly plot if you choose to log your results using HdrHistogram logs.

As an example for #4 and #5, I'd look to plot the behavior of one stack under varying loads like this (comparing latency behavior of same Setup under varying loads):



And to compare two stack under varying loads like this (comparing Setup A and Setup B latency behaviors at same load):


Or like this (comparing Setup A and Setup B under varying loads):



-- Gil.

ymo

unread,
Apr 14, 2016, 8:09:31 AM4/14/16
to mechanical-sympathy
Hi Gil.

I had a great time listening to your talks. I come from hardware network testing equipment (code) where these are prevalent and absolutely love your talks. One minor comment i had was that i never understood why someone needs to "hack" around coordinated omission. To me if your hardware is not not able to keep up with your SUT (system under test) then you have one of these issues:
1) you either wrote bad test code
2) you need to buy more hardware !!!

Coordinated omission "fixing" to me is an excuse to not spend money on more hardware (which are so cheap nowadays) or to not fix the test code )))



On Tuesday, April 12, 2016 at 12:02:38 AM UTC-4, Gil Tene wrote:
If you are looking at the set of "stacks" below (all of which are queues/transports), I would strongly encourage you to avoid repeating the mistakes of testing methodologies that focus entirely on max achievable throughput and then report some (usually bogus) latency stats at those max throughout modes. The tech empower numbers are a classic example of this in play, and while they do provide some basis for comparing a small aspect of behavior (what I call the "how fast can this thing drive off a cliff" comparison, or "peddle to the metal" testing), those results are not very useful for comparing load carrying capacities for anything that actually needs to maintain some form of responsiveness SLA or latency spectrum requirements.

Rules of thumb I'd start with (some simple DOs and DON'Ts):

1. DO measure max achievable throughput, but DON'T get focused on it as the main or single axis of measurement / comparison.

I was going to measure latency as a target based on throughput. Meaning setup a certain load then increase or decrease it until the required (or at least stable) latency is achieved via a feedback loop system. From a control system perspective the hw guys laugh when they see what types of control systems we have in software. Look what they did above[1] )))
 

2. DO measure response time / latency behaviors across a spectrum of attempted load levels (e.g. at attempted loads between 2% to 100%+ of max established thoughout).

 
3. DO measure the response time / latency spectrum for each tested load (even for max throughout, for which response time should linearly grow with test length, or the test is wrong). HdrHistogram is one good way to capture this information.
Great !
 

4. DO make sure you are measuring response time correctly and labeling it right. If you also measure and report service time, label it as such (don't call it "latency").

I hear you ! For I/O it would be nice to have a deterministic way of figuring when a packet arrives on the nic on egress or ingress without fidling with the kernel ( This would pinpoint how much of the latency is incurred because of the network stack. This becomes important if one is trying to compare with dpdk for example where there is very minimal impairments/latency introduced by the stack.


5. DO compare response time / latency spectrum at given loads.

Great. 

 

6. DO [repeatedly] sanity check and calibrate the benchmark setup to verify that it produces expected results for known forced scenarios. E.g. forced pauses of known size via ^Z or SIGSTOP/SIGCONT should produce expected response time percentile levels. Attempting to load at >100% than achieved throughput should result in response time / latency measurements that grow with benchmark run length, while service time (if measured) should remain fairly flat well past saturation.

Great ! 

7. DON'T use or report standard deviation for latency. Ever. Except if you mean it as a joke.
Surely, you must be joking ... and no i know your name is not shirley !
 

8. DON'T use average latency as a way to compare things with one another. [use median or 90%'ile instead, if what you want to compare is "common case" latencies]. Consider not reporting avg. at all.
Great ! 

9. DON'T compare results of different setups or loads from short runs (< 20-30 minutes).

Not sure i understand here. I was going to do pattern based load. Meaning X seconds of Y load followed by silence then rinse repeat with different values of X,Y over a very very long time and record latencies with HdrHistogram as buckets.
 
10. DON'T include process warmup behavior (e.g. 1st minute and 1st 50K messages) in compared or reported results. 

Great !

Remko Popma

unread,
Apr 14, 2016, 12:44:00 PM4/14/16
to mechanical-sympathy
Gil,

Thank you for these rules of thumb, very helpful! Also saw your recent InfoQ presentation, nice work!

I have a question: would you have any guidelines (or DOs and DONTs) on how to reduce the load? 
Is Thread.currentThread().sleep() a good idea or would this somehow skew the measurements and is it better to do a busy spin and/or some expensive calculation?

I'm working to make Log4j 2 garbage-free during steady state logging and am about to do a response time/latency comparison so this topic is very timely!
Any tips would be greatly appreciated.

Remko
Reply all
Reply to author
Forward
0 new messages