Processing pull-requests, testing, and round rhythm

Brian Hauer

unread,

Jul 19, 2013, 5:43:15 PM7/19/13

to framework-...@googlegroups.com

Thanks to numerous contributions from the community, this project now covers an enormous breadth of frameworks. More frameworks are included than I even knew existed when we started the project.

The list of test implementations is large enough now that recent rounds have seen up spend a non-trivial amount of time troubleshooting framework tests that are not running properly. Various complications of this sort delayed the last two rounds. Rather than make delays a routine, we'd prefer to change the strategy slightly.

I have hinted at this in comment threads elsewhere, but I want to capture it here on the mailing list. For Rounds 7 and beyond, we would like to make the following changes:

Work on improving the Python scripts to allow contributors to work on a single framework or a subset of the whole list. Presently, the benchmark scripts assume that everyone wants to work with every single framework. That was perhaps reasonable when we started the project with ~20 frameworks. But now with more than 70 and counting, it means you're in for a several-hour setup process. If you want to work on the Dart test, for example, the scripts should be improved to help you setup the Dart subdirectory and that's it. No need to have node.js, Flask, Mono, Servlet, etc., etc. downloaded and installed in order to work on Dart.
Similarly, we want to make sure the scripts are comfortable working in a single-machine configuration. While we run our tests using two machines/instances, this can be a burden for contributors. We've been very pleasantly surprised by the contributions--we didn't expect as many as we've received. To help existing and future contributors, we want to be sure that it's possible to be productive on a single workstation or laptop.
For logistics reasons, we need to be informal in scheduling rounds, but at the same time we'd like to establish a somewhat normal rhythm of a round roughly once per month. Given that modest frequency, I feel that if a given test implementation fails to run ideally, the next round is never that far away, so the implementation can be fixed up for the next round (as opposed to delaying the current round).
Considering the above, we'd like to reduce the amount of effort we put into making tests work. Assuming we can make the tests easy for contributors to run and debug on their own hardware, we should be able to accept pull requests more liberally and allow the chips to fall where they may when the tests are run. Put loosely, this is a merge-quickly and iterate round-over-round strategy.

If you're watching the GitHub repository, you may have already noticed that we're adopting this strategy already and many pending pull requests have been merged.

As usual, I'd love to hear any opinions or thoughts about this. Thanks again to everyone for your contributions!

Message has been deleted

Daniel Theophanes

unread,

Jul 22, 2013, 11:14:40 AM7/22/13

to framework-...@googlegroups.com

Regarding failures to test: I think there should be more clarification in how the failed connections affect the latency and rankings. For instance the following questions come to mind when I see errors in a test:

* If there are X errors, what is the total number of attempts? Would percentage be better?

* Some errors appear to artificially reduce the latency measure. This should not happen.

--

I think I'd also suggest to tier frameworks. Mark a few that you always include, that are stable and easy to perform. For instance, there are several Go and PHP frameworks. Maybe just one Go and one PHP framework is a Tier 1 in their respective groups.

Personally, I'd be more interested in seeing memory and CPU usage included then more frameworks, or X variants of essentially the same ones. And when I say memory, I mean, how much memory is not allocated to the benchmark that can use it elsewhere. In Windows this might be called private working set memory. If a VM runtime eats a bunch of it, the framework might only use a small portion of it, but that still means the OS can't use the memory the runtime eats elsewhere. It may also mean you can or can't run it easily on a constrained device (raspberry pi).

-Daniel

Brian Hauer

unread,

Jul 22, 2013, 12:06:08 PM7/22/13

to framework-...@googlegroups.com

Hi Daniel,

Thanks for the questions. I'll try to answer them, but I will preface this by saying that some of these questions would be best answered by the author of the load tool we're using, Wrk. That's Will Glozer, @wg on GitHub. He has been phenomenally helpful in answering questions and adapting the implementation of Wrk for our needs, such as by adding an HTTP pipelining mode.

You can find his GitHub repo here: https://github.com/wg/wrk

1. With respect to errors, Wrk provides us with the following:

a. Read, write, and connect errors, about which I admittedly don't have a deep understanding. I figure, for example, an inability to create a socket to the server would register as a connect error.

b. Non-200/300 HTTP responses, which we assume are 500-series HTTP responses. But to be clear, Wrk just identifies them as not 200 or 300. See for example the last plaintext test for Ruby Rack in this raw file:

https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/results/i7/20130619104939/plaintext/rack-ruby/raw

Here is that output from Wrk:

Running 15s test @ http://172.16.98.122:8080/plaintext
  8 threads and 16384 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.70s     2.44s   11.45s    92.11%
    Req/Sec     8.01k     1.75k   15.23k    88.10%
  963243 requests in 15.01s, 178.27MB read
  Socket errors: connect 0, read 2129, write 0, timeout 79702
  Non-2xx or 3xx responses: 610
Requests/sec:  64155.34
Transfer/sec:     11.87MB

Note that we do not count timeout errors because @wg explains that these "timeouts" are simply a count of requests that took longer than 2,000ms (2s) to respond. For our bar charts, we're measuring the number of successful responses, and not the amount of time required per response, so this measurement of a 2s timeout is not meaningful for us.

The bar charts render the number of "successful responses," computed as such:

Successful responses = Total requests - (Non-200/300 responses) - (Read + Write + Connect errors)

Errors (rightmost column) = (Non-200/300 responses) + (Read + Write + Connect errors)

2. The latency information we report is captured as-is from the Wrk output. For the above example, we would report 1,700ms (1.7s) average, 2,440ms stdev, 11,450ms maximum. @wg has put quite a bit of thought and tuning into the latency capture, but if you believe there is a potential miscalculation, he has also been very open to review and critique.

3. Can you explain your idea for framework tiers in a little more detail? I'm not sure I follow. Elsewhere, I have discussed an attempt to capture the popularity of a framework as an additional attribute. But I read your idea as something distinct from popularity.

The script makes testing all frameworks an automated process, so running a subset doesn't save us much effort. So I suspect what you're suggesting is that we could identify some as Tier 1, meaning a subset that we could spend the necessary time to ensure their tests successfully run. Do I have that correct?

4. I agree about CPU, memory, and IO usage capture. We have a GItHub issue concerning that objective: https://github.com/TechEmpower/FrameworkBenchmarks/issues/108

With the changes I've outlined above, we aim to reduce the amount of effort we put into accepting pull requests--meaning we do not need to stop accepting more frameworks and variations. This should allow us to spend more time on improving the benchmark tools themselves, including perhaps dealing with issue 108 at some point.

In short, we are making the changes I outlined so that we are less of a bottleneck in processing community contributions. It's incidental, but not very surprising, that most community contribution are test implementations.

Reply all

Reply to author

Forward