Hi Adam,
Your question has the marks of a great conversation. :) Virtually no web sites in the world receive a million requests per second. That kind of request load is orders of magnitude higher than the load seen by the vast majority of the web.
The most successful sites I have worked on have had millions of users and across a collection of servers might process several thousand (dynamic) requests per second at peak load times. (It's important to note that I'm virtually never concerned about the delivery of static assets on a high-traffic site since those are handled by a CDN or separate static servers.)
When we measure requests per second with a trivial response, such as tested by Google and mimicked by us for the above blog entry, we intend for this to be a proxy for application performance. Furthermore, with any realistic response payload, it's too easy to saturate gigabit Ethernet with a modestly high-performance server. Even a fairly small response of only 1,000 bytes per response means you need only provide ~80,000 requests per second to totally saturate a gigabit Ethernet connection.
The tests in our project are one or two steps closer to testing real-world application behaviors, but they too are proxies for real applications. Even the Fortunes test, which simulates the widest spectrum of framework and platform functions in our project remains a very simple workload compared to most real-world applications.
For that reason, when I've been asked to describe the value in this data, I routinely suggest that if we are comfortable using these data as a proxy, and we are comfortable applying a very rough coefficient to the numbers to represent the additional workload of our real-world applications (say, 0.01 or 0.001), we can begin to map the data to something realistic.
For example, imagine I am evaluating two framework options that I am comfortable with for all other reasons (language, expressiveness, community, developer efficiency). Framework A hits 5,000 requests per second on Fortunes while Framework B reaches 500. If I believe my application will be about 100 times more complex than Fortunes, I can roughly compute that my application may enjoy either 50 requests per second (5,000 / 100) from Framework A and 5 requests per second (500 / 100) from Framework B. 50 might be acceptable for my use-case, but 5 is probably not. So I'd favor Framework A.
In practice, I may then invest the time to build out a proof of concept and benchmark my specific application on A and B to confirm my hunch. The above assumes I'm evaluating only two options. But greater value comes from situations where I am pretty tolerant on the other variables, so I am open to considering a spread of options. I can't reasonably implement my proof of concept code on a dozen or more frameworks, so the proxy data gives me a hand in narrowing the field.
But circling back to raw request per second data measured in the millions per second. No, that alone is not interesting to most of the world because without a bigger picture of how the system performs once your application code is in play, it's not easy to know if you're dealing with a highly optimized web server paired with a sluggish application stack. Of course in this case, we know that Undertow is a JVM platform so your application stack is going to be high-performance as well.
For this particular metric, it's a bit more like bragging rights.
Speaking big-picture, I feel we have more than adequate coverage of trivial tests in this project and future test types should be more complex operations. At some point, I want to have test types that compress even the top performers into the realm of hundreds of requests per second.