Hi Daniel,
Thanks for the questions. I'll try to answer them, but I will preface this by saying that some of these questions would be best answered by the author of the load tool we're using, Wrk. That's Will Glozer, @wg on GitHub. He has been phenomenally helpful in answering questions and adapting the implementation of Wrk for our needs, such as by adding an HTTP pipelining mode.
You can find his GitHub repo here:
https://github.com/wg/wrk1. With respect to errors, Wrk provides us with the following:
a. Read, write, and connect errors, about which I admittedly don't have a deep understanding. I figure, for example, an inability to create a socket to the server would register as a connect error.
b. Non-200/300 HTTP responses, which we assume are 500-series HTTP responses. But to be clear, Wrk just identifies them as not 200 or 300. See for example the last plaintext test for Ruby Rack in this raw file:
https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/results/i7/20130619104939/plaintext/rack-ruby/rawHere is that output from Wrk:
8 threads and 16384 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.70s 2.44s 11.45s 92.11%
Req/Sec 8.01k 1.75k 15.23k 88.10%
963243 requests in 15.01s, 178.27MB read
Socket errors: connect 0, read 2129, write 0, timeout 79702
Non-2xx or 3xx responses: 610
Requests/sec: 64155.34
Transfer/sec: 11.87MB
Note that we do not count timeout errors because @wg explains that these "timeouts" are simply a count of requests that took longer than 2,000ms (2s) to respond. For our bar charts, we're measuring the number of successful responses, and not the amount of time required per response, so this measurement of a 2s timeout is not meaningful for us.
The bar charts render the number of "successful responses," computed as such:
Successful responses = Total requests - (Non-200/300 responses) - (Read + Write + Connect errors)
Errors (rightmost column) = (Non-200/300 responses) + (Read + Write + Connect errors)
2. The latency information we report is captured as-is from the Wrk output. For the above example, we would report 1,700ms (1.7s) average, 2,440ms stdev, 11,450ms maximum. @wg has put quite a bit of thought and tuning into the latency capture, but if you believe there is a potential miscalculation, he has also been very open to review and critique.
3. Can you explain your idea for framework tiers in a little more detail? I'm not sure I follow. Elsewhere, I have discussed an attempt to capture the popularity of a framework as an additional attribute. But I read your idea as something distinct from popularity.
The script makes testing all frameworks an automated process, so running a subset doesn't save us much effort. So I suspect what you're suggesting is that we could identify some as Tier 1, meaning a subset that we could spend the necessary time to ensure their tests successfully run. Do I have that correct?
4. I agree about CPU, memory, and IO usage capture. We have a GItHub issue concerning that objective:
https://github.com/TechEmpower/FrameworkBenchmarks/issues/108With the changes I've outlined above, we aim to reduce the amount of effort we put into accepting pull requests--meaning we do not need to stop accepting more frameworks and variations. This should allow us to spend more time on improving the benchmark tools themselves, including perhaps dealing with issue 108 at some point.
In short, we are making the changes I outlined so that we are less of a bottleneck in processing community contributions. It's incidental, but not very surprising, that most community contribution are test implementations.