Round 14 Previews

1,284 views
Skip to first unread message

Brian Hauer

unread,
Mar 22, 2017, 4:29:18 PM3/22/17
to framework-...@googlegroups.com
We've posted a first Preview of Round 14 for community review:

https://www.techempower.com/benchmarks/previews/round14/

This preview was captured on our ServerCentral hardware.

New for this round's preview is a differences chart that shows how Round 14 Preview 1 compares to Round 13 before it:

https://www.techempower.com/benchmarks/previews/round14/r13-vs-r14p1.html

Note that framework name changes cause false-positive reports of additions and subtractions (e.g., "added dancer" alongside "removed dancer-raw").  Aside from that, however, it has been useful for us to confirm that a majority of test implementations continue to run as they have before.  We are investigating and resolving some known issues with test implementations that failed to run in this Preview 1 and some that appear to have performed implausibly well (most likely indicating a defective measurement of an implementation error).

Our current expectation is that we will run and share at least one more preview prior to starting a final run for Round 14.  It's risky for me to make a timing prediction, but I would like to see a Preview 2 out before the end of March.

We're looking forward to any fixes that we can merge in prior to Round 14's final run.

As always, thank you to everyone who has contributed to the project!

Daniel Nicoletti

unread,
Mar 22, 2017, 4:37:55 PM3/22/17
to Brian Hauer, framework-benchmarks
Hi, I'm confused, why are Cutelyst NGINX tests removed?
> --
> You received this message because you are subscribed to the Google Groups
> "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to framework-benchm...@googlegroups.com.
> Visit this group at https://groups.google.com/group/framework-benchmarks.
> For more options, visit https://groups.google.com/d/optout.



--
Daniel Nicoletti

KDE Developer - http://dantti.wordpress.com

Michael Hixson

unread,
Mar 22, 2017, 5:03:40 PM3/22/17
to Daniel Nicoletti, Brian Hauer, framework-benchmarks
Hi Daniel,

The cutelyst-nginx tests were not really removed. The chart says
"REMOVED" because they all failed in the preview run.

However, they all pass when I run them individually on our
ServerCentral hardware. Maybe one of other the tests than ran prior
did something naughty that caused them to break. In any case, we
should have actual numbers for those tests in Preview 2.

-Michael

Brian Hauer

unread,
Mar 22, 2017, 6:00:05 PM3/22/17
to framework-benchmarks
A quick follow-up: We've already identified a few anomalies in the Preview 1 results.  We believe that some were measured at a different duration from the rest and therefore a re-run of those should properly align them.  We've decided to not wait until Preview 2 to fix that up.

As with all previews, please interpret these results with caution.  We've given them a quick once-over, and we believe there are a few that need quick correction.  Ultimately, the purpose of the previews is to give all of us an opportunity to sanity check the results and identify obvious (or even not so obvious) problems before we finalize the round.

zloster

unread,
Mar 22, 2017, 6:20:54 PM3/22/17
to Brian Hauer, framework-benchmarks
Hi, thanks for the update.
Great work.
I've just noticed that Wicket is reported twice in all the database
related tests. Both times with "Did not complete".

Best regards,
zloster
> --
> You received this message because you are subscribed to the Google
> Groups "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to framework-benchm...@googlegroups.com.
> Visit this group at
> https://groups.google.com/group/framework-benchmarks [1].
> For more options, visit https://groups.google.com/d/optout [2].
>
>
> Links:
> ------
> [1] https://groups.google.com/group/framework-benchmarks
> [2] https://groups.google.com/d/optout

Gelin Luo

unread,
Mar 22, 2017, 11:20:20 PM3/22/17
to framework-benchmarks
Great work! Thanks TechEmpower!

One interesting thing I've found is spark got "257,184" on JSON serialization test but a poor "6,956" in plaintext test ...

Daniel Nicoletti

unread,
Mar 22, 2017, 11:28:32 PM3/22/17
to Gelin Luo, framework-benchmarks
Can't we rename plain text test to "http pipeline"? This keeps confusing people...

--
You received this message because you are subscribed to the Google Groups "framework-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to framework-benchmarks+unsub...@googlegroups.com.
Visit this group at https://groups.google.com/group/framework-benchmarks.
For more options, visit https://groups.google.com/d/optout.

Brian Hauer

unread,
Mar 23, 2017, 1:13:46 PM3/23/17
to framework-benchmarks
To follow up on my message above, late yesterday we realized that some tests had been merged in with data collected at a different duration (four times as long), causing their results to appear four times what they should have been.  We apologize for this error.  We've re-run those tests and have merged in the corrections in-place.  I've identified the resulting data as "Preview 1.1" and you can see that here:

https://www.techempower.com/benchmarks/previews/round14/

The following shows all changes between Round 13 Final and Round 14 Preview 1.1:

https://www.techempower.com/benchmarks/previews/round14/r13-vs-r14p1.1.html

And the following shows only the changes between yesterday's Preview 1 (which contained the errors cited above) and today's Preview 1.1:

https://www.techempower.com/benchmarks/previews/round14/r14p1-vs-r14p1.1.html

In summary, the affected frameworks were:
  • bottle-nginx-uwsgi
  • compojure
  • curacao (investigating new issue here)
  • ffead-cpp
  • flask-nginx-uwsgi
  • hapi
  • lapis
  • lwan
  • nodejs
  • silicon
  • start
  • uwsgi
  • wheezyweb-py3
  • wicket

With those corrections posted, I'd like to consider Preview 1 frozen so that we can move onto Preview 2.  If there are remaining errors, configuration problems, or otherwise, let's get them identified and fixed for Preview 2.

Aliaksandr Valialkin

unread,
Mar 23, 2017, 3:55:04 PM3/23/17
to framework-benchmarks
It looks like something is broken in the preview results for fasthttp - fasthttp-*-prefork is mentioned in results, while it have been removed from the benchmarks 4 months ago in the commit https://github.com/TechEmpower/FrameworkBenchmarks/commit/a2e309f4ba72e1230dac0b3c79adb0588a5be51a .

zloster

unread,
Mar 23, 2017, 5:25:00 PM3/23/17
to Brian Hauer, framework-benchmarks
On 2017-03-23 19:13, Brian Hauer wrote:
> With those corrections posted, I'd like to consider Preview 1 frozen
> so that we can move onto Preview 2. If there are remaining errors,
> configuration problems, or otherwise, let's get them identified and
> fixed for Preview 2.

I have concerns about the following test results:
Group 1:
ADDED revenj-jvm json 385,746 32
ADDED revenj-jvm plaintext 1,054,489 23
REMOVED Revenj.JVM json 555,923 2
REMOVED Revenj.JVM plaintext 1,195,782 16
servlet plaintext 1,156,604 846,840 -26.8% 17 30
-13
I've made a PR for Resin update some time ago:
https://github.com/TechEmpower/FrameworkBenchmarks/pull/2387
My local tests showed some small improvement for the JSON. I have some
logs for the upgrade from 4.0.41 to 4.0.49
(https://github.com/zloster/logs/tree/master/FrameworkBenchmarks/resin-update).
I don't have the logs for the update to 4.0.48.
JSON: The preview results give a little better numbers for the lower
concurrency than R13. The highest concurrency give lower results than
R13.
Plaintext: as it is running at higher concurrences the hit is stronger.
Maybe Resin 4.0.48 have some problem?

Group 2:
dropwizard fortune 33,516 34,020 +1.5% 63 69 -6
dropwizard-postgres fortune 45,607 37,008 -18.9% 45 61
-16
The PostgreSQL version takes a big drop. And it shouldn't. The code of
the two tests is the same and in the other DB read tests the PostgreSQL
version has advantage (DB and QUERY).

Gelin Luo

unread,
Mar 24, 2017, 7:46:35 PM3/24/17
to framework-benchmarks
I found the fortunes test result is very different from my local test (which I've done many times and the results are consistent):

Framework       |  Local Test |  Round 14 (preview) |      Delta
-------------------------------------------------------------------
act-pgsql       |       22000 |               18618 |       -15%
act-mysql       |       16000 |               18635 |       +16%
act-mongo       |       22000 |               18167 |       -17%
dropwizard-pgsql|       13000 |               37008 |      +185%
dropwizard-mysql|       10000 |               34020 |      +240%
dropwizard-mongo|        7000 |               13632 |       +94%
light-java-pgsql|       47000 |              144723 |      +208%
jooby(mysql)    |       26000 |               37661 |       +45%
rapidoid-pgsql  |       35000 |               41506 |       +19%
Spring(mysql)   |        8000 |               24223 |      +203%

I suppose the delta (between local and round 14 preview) across framework tests shall be consistent. However it varies in a huge range from -15% to +240%, any idea how this can happen?

On Thursday, March 23, 2017 at 7:29:18 AM UTC+11, Brian Hauer wrote:

Nikolche Mihajlovski

unread,
Mar 30, 2017, 7:20:07 AM3/30/17
to framework-benchmarks
Hi,

I have some questions / comments regarding the ServerCentral test environment.

Despite the server having 40 HT cores (80 HTs), the wrk client is executed with "-t 32" (instead of "-t 80"), which is very strange.

What is the client machine configuration? I would expect it to be the same as the application server.

So, the wrk client becomes a bottleneck for the highly performing frameworks.

Finally, even for 32 threads the numbers are low in Plaintext.
This is what I get with Rapidoid using "c4.8xlarge" client and "c4.8xlarge" server on AWS:

wrk -d 15 -t 32 -c 512 -s pipeline.lua 'http://rapidoid:8080/plaintext' -- 16

Running 15s test @ http://rapidoid:8080/plaintext
  32 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.26ms    6.55ms 205.76ms   89.45%
    Req/Sec   163.39k    15.09k  355.23k    83.79%
  78241536 requests in 15.10s, 9.69GB read
Requests/sec: 5181332.55
Transfer/sec:    657.19MB

Regards,
Nikolche

Brian Hauer

unread,
Mar 30, 2017, 4:17:52 PM3/30/17
to framework-benchmarks
Hi Nikolche,

One of the other guys over here can give you accurate core counts, but I will quickly mention that the ServerCentral hardware is not homogeneous.  All previous rounds indeed used (nearly) identical hardware for the three roles: application server, database server, and load generator.  But the ServerCentral hardware is approximately as follows:
  • Application server: 40 physical cores, 80 HT cores
  • Database server: 8 cores
  • Load generator: 8 cores

This heterogeneous hardware configuration has indeed required some tweaks to the toolset, and we've found that 32 threads on wrk tends to yield the best results.

Brian Hauer

unread,
Mar 30, 2017, 4:19:29 PM3/30/17
to framework-benchmarks
A second preview of Round 14 is ready for review.  We've observed a few notable changes that we will be investigating, but we welcome any additional review and input that others can provide!

Round 14 Preview 2:

https://www.techempower.com/benchmarks/previews/round14/

Changes between Preview 1.1 and Preview 2:

https://www.techempower.com/benchmarks/previews/round14/r14p1.1-vs-r14p2.html

Thanks!

Daniel Nicoletti

unread,
Mar 30, 2017, 5:48:15 PM3/30/17
to Brian Hauer, framework-benchmarks
Wow this is odd, how can cutelyst JSON went from +450k to 100k?
Can't you guys provide some CPU usage too?
Im afraid with 40 cores and with persistent connections things might not get well balanced. As my laptop with an i5 CPU can do 100k using a single thread/process...

--

Stuart Small

unread,
Mar 30, 2017, 5:59:33 PM3/30/17
to framework-benchmarks, teona...@gmail.com
I'm a little confused with a similar change.  I saw the tokio json went up 40%.  It doesn't look like the code change, but I saw there was a commit upgrading the rust version installed 3 hours ago.  Was that commit included in this run?

Michael Hixson

unread,
Mar 30, 2017, 8:41:27 PM3/30/17
to Daniel Nicoletti, Brian Hauer, framework-benchmarks
Hi Daniel,

The cutelyst RPS numbers from Preview 1.1 were inflated by 4x. There
was a similar issue with several frameworks, and we corrected most of
them (that was the change from Preview 1.0 to 1.1), but we missed
cutelyst. I'm partially responsible for that -- sorry!

For CPU usage, we do have output in CSV format from a tool called
dstat... see the various files named "stats" in our logs, like in
here:
http://tfb-logs.techempower.com/round-14/preview-2/cutelyst/json/

I don't know how to interpret that CSV though, to be honest.

-Michael

On Thu, Mar 30, 2017 at 2:48 PM, Daniel Nicoletti <dant...@gmail.com> wrote:
> Wow this is odd, how can cutelyst JSON went from +450k to 100k?
> Can't you guys provide some CPU usage too?
> Im afraid with 40 cores and with persistent connections things might not get
> well balanced. As my laptop with an i5 CPU can do 100k using a single
> thread/process...
>
> Em 30 de mar de 2017 5:19 PM, "Brian Hauer" <teona...@gmail.com>
> escreveu:
>>
>> A second preview of Round 14 is ready for review. We've observed a few
>> notable changes that we will be investigating, but we welcome any additional
>> review and input that others can provide!
>>
>> Round 14 Preview 2:
>>
>> https://www.techempower.com/benchmarks/previews/round14/
>>
>> Changes between Preview 1.1 and Preview 2:
>>
>>
>> https://www.techempower.com/benchmarks/previews/round14/r14p1.1-vs-r14p2.html
>>
>> Thanks!
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "framework-benchmarks" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to framework-benchm...@googlegroups.com.
> --
> You received this message because you are subscribed to the Google Groups
> "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to framework-benchm...@googlegroups.com.

Daniel Nicoletti

unread,
Mar 30, 2017, 8:57:50 PM3/30/17
to Michael Hixson, framework-benchmarks, Brian Hauer
Right, but now I believe there's something really wrong with the results, it's simply not comparable with any local benchmarks.
On round 13 there was a segfault on threaded tests and the results were better.
On round 14 I optimized the code a little more and added jemalloc which improved by 15% results. And now there's EPOLL which improved performance with many connections.

Locally with all desktop apps running and a single thread I get steady 100k on JSON, if I do my laptop benchmarking my Phenom II over gigabit I get more than 300k (this was with r13 version).

So there's something wrong in this, will try to look at that data but I cant​ see how r13 can have better results after all the optimizations I did...


>> Visit this group at https://groups.google.com/group/framework-benchmarks.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Daniel Nicoletti

unread,
Mar 30, 2017, 11:59:23 PM3/30/17
to Michael Hixson, framework-benchmarks, Brian Hauer
Ok the data is clear of 80 cores only 9 were used on the tests.

The only way I can improve this is by writing a FD balancer which
does round-robin so in 256 connections we get closer to all cores
in use, still it feels like there was some latency on the network.

Currently the SO will wake up all process and the first to call accept
get's it, meaning a process which is already processing things might
call accept and get more connections.

I've re-run my tests with 2 pcs and got to 200k with 4 threads.

As a side node I think this server could be swapped with the db one,
for example if you look at the stats file of libreactor it had 11 process
running and around 40 cores in use, 40 idle, but the fine tunning here is hard.

Besides doing round robbing with connections FD, an important
tunning is fixing a thread/process to a singe CPU.
just my 2c.
>> >> email to framework-benchm...@googlegroups.com.
>> >> Visit this group at
>> >> https://groups.google.com/group/framework-benchmarks.
>> >> For more options, visit https://groups.google.com/d/optout.
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "framework-benchmarks" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an
>> > email to framework-benchm...@googlegroups.com.
>> > Visit this group at
>> > https://groups.google.com/group/framework-benchmarks.
>> > For more options, visit https://groups.google.com/d/optout.



--

Kamil Endruszkiewicz

unread,
Mar 31, 2017, 8:03:10 AM3/31/17
to framework-benchmarks
Hi,

Could we increase warm  up times? It's look like JIT for java or PyPy has not enough time to kick in.

zloster

unread,
Mar 31, 2017, 5:19:16 PM3/31/17
to Kamil Endruszkiewicz, framework-benchmarks
Hi Kamil,
On 2017-03-31 15:03, Kamil Endruszkiewicz wrote:
> Hi,
>
> Could we increase warm up times? It's look like JIT for java or PyPy
> has not enough time to kick in.

The benchmarks are quite heavy in the warmup phase. There are two phases
of warming the given framework before actually measuring the results.
For example let's check vertx/json logs for R14 preview 2:
http://tfb-logs.techempower.com/round-14/preview-2/vertx/json/raw This
file contains the logs from the load generator (wrk).

First there is: "Running Primer vertx" for 5 seconds which results in
"384998 requests in 5.10s, 55.81MB read". After this there is: "Running
Warmup vertx" for 15 seconds which results in "7080289 requests in
15.10s, 1.00GB read".
Summing the two number of requests will give us a nice big number of
actual method calls: 7+ million calls.

I'm not familiar with PyPy and the following will be only for Java/JVM.
Currently the benchmarks are using Oracle JDK 8. Checking here:
http://docs.oracle.com/javase/8/docs/technotes/tools/unix/java.html#BABHDABI
we will find the following:
> -Xcomp
> Forces compilation of methods on first invocation. By default, the
> Client VM (-client) performs 1,000 interpreted
> method invocations and the Server VM (-server) performs 10,000
> interpreted method invocations to gather information
> for efficient compilation. Specifying the -Xcomp option disables
> interpreted method invocations to increase
> compilation performance at the expense of efficiency.
> You can also change the number of interpreted method invocations
> before compilation using the -XX:CompileThreshold > option.

The 1000/10000 method calls threshold is ONLY valid when NOT using
'server' JVM which enables 'tiered compilation' by default. Check for
"-XX:CompileThreshold=invocations" and "-XX:-TieredCompilation" in the
same document. The tiered compilation in the JVM has complex policy
about what when and how much to be compiled but it should have done its
work after several million calls. Most of the frameworks explicitly
enable the server option. These that don't enable it will get the server
JVM based on the hardware specifications (64-bit CPU and more than 2GB
RAM).

Also it is easy to verify if framework using the JVM is affected by JIT.
You can checkout the benchmarks, benchmark the framework and then modify
the java command line with"-Xcomp" option to force the compilation and
benchmark again. Then compare the numbers.

As a side node: very recently I've compiled a list with the various
arguments for the JVM used by the frameworks. It's here:
https://github.com/zloster/FrameworkBenchmarks/issues/5#issuecomment-288467041

Best regards,
zloster



> W dniu środa, 22 marca 2017 21:29:18 UTC+1 użytkownik Brian Hauer
> napisał:
>
>> We've posted a first Preview of Round 14 for community review:
>>
>> https://www.techempower.com/benchmarks/previews/round14/ [1]
>>
>> This preview was captured on our ServerCentral hardware.
>>
>> New for this round's preview is a differences chart that shows how
>> Round 14 Preview 1 compares to Round 13 before it:
>>
>>
> https://www.techempower.com/benchmarks/previews/round14/r13-vs-r14p1.html
>> [2]
>>
>> Note that framework name changes cause false-positive reports of
>> additions and subtractions (e.g., "added dancer" alongside "removed
>> dancer-raw"). Aside from that, however, it has been useful for us
>> to confirm that a majority of test implementations continue to run
>> as they have before. We are investigating and resolving some known
>> issues with test implementations that failed to run in this Preview
>> 1 and some that appear to have performed implausibly well (most
>> likely indicating a defective measurement of an implementation
>> error).
>>
>> Our current expectation is that we will run and share at least one
>> more preview prior to starting a final run for Round 14. It's risky
>> for me to make a timing prediction, but I would like to see a
>> Preview 2 out before the end of March.
>>
>> We're looking forward to any fixes that we can merge in prior to
>> Round 14's final run.
>>
>> As always, thank you to everyone who has contributed to the project!
>
> --
> You received this message because you are subscribed to the Google
> Groups "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to framework-benchm...@googlegroups.com.
> Visit this group at
> https://groups.google.com/group/framework-benchmarks [3].
> For more options, visit https://groups.google.com/d/optout [4].
>
>
> Links:
> ------
> [1] https://www.techempower.com/benchmarks/previews/round14/
> [2]
> https://www.google.com/url?q=https%3A%2F%2Fwww.techempower.com%2Fbenchmarks%2Fpreviews%2Fround14%2Fr13-vs-r14p1.html&amp;sa=D&amp;sntz=1&amp;usg=AFQjCNFhoXlDw1hF2KpsgDccYhKgxpvs8w
> [3] https://groups.google.com/group/framework-benchmarks
> [4] https://groups.google.com/d/optout

zloster

unread,
Mar 31, 2017, 5:38:23 PM3/31/17
to Brian Hauer, framework-benchmarks
Hi to the TFB team,
On 2017-03-30 23:19, Brian Hauer wrote:
> A second preview of Round 14 is ready for review. We've observed a
> few notable changes that we will be investigating, but we welcome any
> additional review and input that others can provide!

As a note to everyone that have refreshed the Vagrant environment
recently
(https://github.com/TechEmpower/FrameworkBenchmarks/commit/19f8f53e4c5c52014d4b1b66a0c8e701a015d9fa):

I have a problem with the new Vagrant environment. The bug is here:
https://github.com/TechEmpower/FrameworkBenchmarks/issues/2645
After `vagrant up` the first test that hits the PostgreSQL fails. The
following tests are OK. Execute again and everything is fine.

Best regards,
zloster

> Round 14 Preview 2:
>
> https://www.techempower.com/benchmarks/previews/round14/
>
> Changes between Preview 1.1 and Preview 2:
>
> https://www.techempower.com/benchmarks/previews/round14/r14p1.1-vs-r14p2.html
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google
> Groups "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to framework-benchm...@googlegroups.com.
> Visit this group at
> https://groups.google.com/group/framework-benchmarks [1].
> For more options, visit https://groups.google.com/d/optout [2].
>
>
> Links:
> ------
> [1] https://groups.google.com/group/framework-benchmarks
> [2] https://groups.google.com/d/optout

Andy

unread,
Apr 1, 2017, 4:27:08 PM4/1/17
to framework-benchmarks
The results of this round seem really out of whack. 

For example:

Fortune:

ulib-postgres
Rd 13: 178k rps
Rd 14: 87k rps

urweb-postgres
Rd 13: 175k rps
Rd 14: 72k rps

start
Rd 13: 52k rps
Rd 14: 11k rps


Multiple Queries

stream
Rd 13: 12.5k rps
Rd 14: 4.5k rps

phalcon
Rd 13: 7.8k rps
Rd 14: 2.7k rps

redstone-mongodb
Rd 13: 11k rps
Rd 14: 3.4k rps



Single query

ulib-mongodb
Rd 13: 194k rps
Rd 14: 75k rps

urweb-postgres
Rd 13: 155k rps
Rd 14: 86k rps



Data updates

express-mongodb
Rd 13: 2.8k rps
Rd 14: 26 rps (yes 26, with 0 error rate!)

phoenix
Rd 13: 1.9k rps
Rd 14: 688 rps 


These are just a few that jumped out to me. Did anything significant changed between round 13 and 14? With data that varies this much I'm not sure how reliable are the benchmark results. 

zloster

unread,
Apr 2, 2017, 7:53:13 AM4/2/17
to Andy, framework-benchmarks
Hello Andy,

I've become curious and did some checks. See my notes below.
On 2017-04-01 23:27, Andy wrote:
> The results of this round seem really out of whack.
>
> For example:
>
> FORTUNE:
>
> ULIB-POSTGRES
> Rd 13: 178k rps
> Rd 14: 87k rps

I've checked about ulib. It had flawed implementation which was not
following the rules for some tests. The issue is here:
https://github.com/TechEmpower/FrameworkBenchmarks/issues/2546. This is
the commit:
https://github.com/TechEmpower/FrameworkBenchmarks/pull/2548/files You
can see yourself that quite a lot had changed in the implementation.

>
> urweb-postgres
>
> Rd 13: 175k rps
> Rd 14: 72k rps

About urweb: there are few commits there:
https://github.com/TechEmpower/FrameworkBenchmarks/commits/master/frameworks/Ur/urweb
I don't see anything major in the recent commits. IMO the result
difference needs a closer look.

> START
>
> Rd 13: 52k rps
> Rd 14: 11k rps
>
> Multiple Queries
>
> STREAM
> Rd 13: 12.5k rps
>
> Rd 14: 4.5k rps
>
> PHALCON
> Rd 13: 7.8k rps
> Rd 14: 2.7k rps
>
> redstone-mongodb
>
> Rd 13: 11k rps
> Rd 14: 3.4k rps
>
> Single query
>
> ulib-mongodb
>
> Rd 13: 194k rps
> Rd 14: 75k rps

See my note above.

> urweb-postgres
>
> Rd 13: 155k rps
> Rd 14: 86k rps

See my note above.

> Data updates
>
> express-mongodb
>
> Rd 13: 2.8k rps
> Rd 14: 26 rps (yes 26, with 0 error rate!)
>
> PHOENIX
> Rd 13: 1.9k rps
>
> Rd 14: 688 rps

I've taken a look at the commit history of PHOENIX here:
https://github.com/TechEmpower/FrameworkBenchmarks/commits/master/frameworks/Elixir/phoenix.
One commit seems a good hint about the result in the 20-updates per
request test:
https://github.com/TechEmpower/FrameworkBenchmarks/commit/e440c284e47f6c7371d2e7386bfcb887c7431472#diff-59890a869218f0800bf33b0825ae8d36R15.
I don't know anything about Elixir/phoenix but it seems that the number
of concurrent real DB connections was lowered significantly. Quite a lot
of frameworks did the similar thing because for Round 13 the physical DB
server was downgraded in terms of CPU cores. IMO particularly this test
variation (20-updates per request) seems to favor bigger connection
pools. For example undertow is using 256 db connections for the pool:
https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/Java/undertow/src/main/java/hello/Helper.java#L38-L40
Revenj.jvm doesn't even use a DB connection pool:
https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/Java/revenj-jvm/src/main/java/hello/Context.java#L32-L41
The connection is kept in a threadlocal variable end extracted when
needed:
https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/Java/revenj-jvm/src/main/java/hello/UpdatesServlet.java#L19
revenj.jvm is using Resin as a servlet containter which means that
whatever number of threads for request processing is configured the
number of DB connections will be the same.
One other commit will affect the Phoenix/JSON result:
https://github.com/TechEmpower/FrameworkBenchmarks/commit/96deafcf351ae0b7ae44ae51b292cfa829eab382
Seems like fix for rules compliance.
And a third one:
https://github.com/TechEmpower/FrameworkBenchmarks/commit/a8fe9e59df779a6a396acf87e0586a09804b1a77
even the author is not sure how the results will be affected.

> These are just a few that jumped out to me. Did anything significant
> changed between round 13 and 14? With data that varies this much I'm
> not sure how reliable are the benchmark results.
>

Indeed sometimes the difference in some results between the rounds are
quite big but with a little digging quite a lot of questions could be
answered. You could try to check what happened with some of the other
frameworks you've mentioned. And if something is not looking good, open
a bug and/or pull request at github. Recent examples:
https://github.com/TechEmpower/FrameworkBenchmarks/issues/2589,
https://github.com/TechEmpower/FrameworkBenchmarks/issues/2581,
https://github.com/TechEmpower/FrameworkBenchmarks/issues/2579. And
yesterday I've opened this one:
https://github.com/TechEmpower/FrameworkBenchmarks/issues/2646 The
results and the code is public and accessible. Everyone is able to
verify them. An extra pair of eyes will always help to strengthen the
validity of the results.

As a side note and background information: There was a big effort for
cleaning and fixing problematic tests for round13/14. UrWeb is mentioned
there: https://github.com/TechEmpower/FrameworkBenchmarks/issues/2074 As
far as I remember at that time there were a master branch (for round 13)
and round14 branch. Please someone to correct me if I'm mistaken about
this.

Cheers,
zloster

Daniel Nicoletti

unread,
Apr 2, 2017, 8:11:10 AM4/2/17
to Andy, framework-benchmarks
Hey Andy,

my framework got worse results too even tho I have improved performance
by 20% locally. so I think there might be 2 issues:
- Non reliable network
- Connections/load distribution.
This last item is something I'm working on my framework, what I noticed
from the stats file is that only 9 process where running at time. On the
same test the top performer was using 11.
The more cores you have on a CPU the lower the clock and wider room
for tuning.
With 80 cores and no scheduler tuning, chances are that you gets
results like this,
setting CPU affinity helps a lot with this.
But even having one thread mapped to a single core if you don't balance
new connections, you might get 256 connections distributed to a few cores.

This explain a bit of this issue:
http://www.glennklockwood.com/hpc-howtos/process-affinity.html
> --
> You received this message because you are subscribed to the Google Groups
> "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to framework-benchm...@googlegroups.com.

Tom Christie

unread,
Apr 5, 2017, 5:59:39 PM4/5/17
to framework-benchmarks, selfor...@gmail.com
Hi folks,

What's the likely timeline on Round 14 looking like at the moment?

I've got a Python framework, API Star, that I'm trying to add in time for the round. (See here)

It'd be super helpful to have an idea how much time there might be remaining
to consider adding extra test types to its matrix. (Right now it's just including JSON & Plaintext test types)

Many thanks!

  Tom

Brian Hauer

unread,
Apr 12, 2017, 6:25:58 PM4/12/17
to framework-benchmarks
Hello everyone!

We have posted a third preview for Round 14:

https://www.techempower.com/benchmarks/previews/round14/

This run completed on 2017-04-07.  You can find logs and differences versus Round 13 Final and Round 14 Preview 2 linked from the alert box above the results charts.

This will be the final preview for Round 14.  A run was started today that will complete some time on Friday.  So we'd like to have any remaining last minute PRs received and merged in by end of Thursday (tomorrow, 2017-04-13).  Two prior runs started before today had increased error rates due to a problem with apt-get, as seen below:



Incidentally, Preview 3 is identified as "Continuous Benchmarking Run 2017-04-04" in the chart above.


We'll do an internal sanity check of the run that is expected to complete this Friday, and assuming it is mostly consistent with prior Previews, the run that begins on Friday will be collecting final results for Round 14.  We'll kick off a run in Azure to coincide with that ServerCentral run.


The continuous benchmarking is now performing with decent consistency (issues such as the one with apt-get notwithstanding).  We believe this will allow us to retain momentum and release Round 15 previews beginning immediately following Round 14.

Brian Hauer

unread,
Apr 12, 2017, 6:48:57 PM4/12/17
to framework-benchmarks
One small addendum: The "Differences" renderings we've added with the previews for Round 14 are not using the 20-iteration samples for the multi-query and updates test types, but rather the 1-iteration samples of those test types.

They should be using the 20-iteration samples to align with the results web site's rendering of the data.  We'll get this fixed up for future differences renderings.

green

unread,
Apr 12, 2017, 7:06:12 PM4/12/17
to framework-benchmarks, Joe Cincotta
Well the result shows some inconsistencies:

actframework-mongodb

* -11.0% in db
* -23.9% in query
* +33.0% in fortune

actframework-mysql

* +17.8% in db
* +0.4% in query
* -46.6% in fortune

actframework-pgsql

* -4.1% in db
* +76.6% in query
* -90.7% in fortune

I am absolutely lost with these numbers and cannot figure out anything that can explain the data.

--
You received this message because you are subscribed to a topic in the Google Groups "framework-benchmarks" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/framework-benchmarks/sQDY1uELRkY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to framework-benchm...@googlegroups.com.

green

unread,
Apr 12, 2017, 8:34:47 PM4/12/17
to Joe Cincotta, framework-benchmarks

On Thu, Apr 13, 2017 at 9:40 AM Joe Cincotta <j...@thinking.group> wrote:
focus on the Fortune benchmark. What exactly is that test trying to do? Can you share the test code link?

J
 
--
Joe Cincotta, Managing Director - Thinking.Group
Level 25, 88 Phillip St, Sydney NSW 2000 Australia

This e-mail and any files transmitted with it may contain confidential information and 
is intended solely for use by the individual to whom it is addressed. If you received this 
e-mail in error, please notify the sender, do not disclose its contents to others and delete 
it from your system.

Daniel Nicoletti

unread,
Apr 12, 2017, 8:38:55 PM4/12/17
to green, framework-benchmarks, Joe Cincotta
There is a bug with the tool set!

$MAX_THREADS can only be hardcoded to 8.
as for example uwsgi detected 80 cores but
created only 8 workers due
--threads $MAX_THREADS

This explains why the server load only
had 9 process running at maximun at the
same time

Please fix this and do a another preview,
a quick grep showed lots of frameworks
rely on this.

To unsubscribe from this group and all its topics, send an email to framework-benchmarks+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "framework-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to framework-benchmarks+unsub...@googlegroups.com.

green

unread,
Apr 12, 2017, 9:02:36 PM4/12/17
to framework-benchmarks
Another question, which version of the code are you using for preview #3? I saw my changes has been merged 9 days ago. However I can't see the change in the test report, specifically I changed the display name while the test report is still using the old names
pasted1

Michael Hixson

unread,
Apr 12, 2017, 10:20:17 PM4/12/17
to green, framework-benchmarks
The third preview used this version of the code:

so it included that change to the act framework metadata.  I think what happened is we didn't put the updated metadata on the results website yet.  (The metadata and performance data are in different files.)

-Michael

pasted1


To unsubscribe from this group and all its topics, send an email to framework-benchmarks+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "framework-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to framework-benchmarks+unsub...@googlegroups.com.

green

unread,
Apr 12, 2017, 10:27:56 PM4/12/17
to Michael Hixson, framework-benchmarks
Thanks Michael for the information.

Still I am struggling to understand the data. Any clue from your side?

pasted1


To unsubscribe from this group and all its topics, send an email to framework-benchm...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "framework-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to framework-benchm...@googlegroups.com.

Michael Hixson

unread,
Apr 12, 2017, 10:51:15 PM4/12/17
to green, framework-benchmarks
No, I don't have an explanation for why the act performance data changed like that.  For what it's worth, the results for act framework in the run that is happening right now on ServerCentral (started about 8 hours ago) are consistent with the most recent results from Preview 3.

-Michael

pasted1


To unsubscribe from this group and all its topics, send an email to framework-benchmarks+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "framework-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to framework-benchmarks+unsub...@googlegroups.com.

green

unread,
Apr 13, 2017, 12:24:11 AM4/13/17
to Michael Hixson, framework-benchmarks
Ok, thanks for letting me know. Very illogic change in the data though and I cannot reproduce the same result on my local.

Just get the raw file from http://tfb-logs.techempower.com/round-14/preview-3/actframework-pgsql/fortune/, created the following scripts based on raw:

luog@luog-Satellite-P50-A:~/tmp/teb$ cat pgsql.sh 
 wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' --latency -d 5 -c 8 --timeout 8 -t 8 http://TFB-server:8080/pgsql/fortunes
 wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' --latency -d 15 -c 256 --timeout 8 -t 8 http://TFB-server:8080/pgsql/fortunes
 wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' --latency -d 15 -c 8 --timeout 8 -t 8 http://TFB-server:8080/pgsql/fortunes
 wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' --latency -d 15 -c 16 --timeout 8 -t 8 http://TFB-server:8080/pgsql/fortunes
 wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' --latency -d 15 -c 32 --timeout 8 -t 8 http://TFB-server:8080/pgsql/fortunes
 wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' --latency -d 15 -c 64 --timeout 8 -t 8 http://TFB-server:8080/pgsql/fortunes
 wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' --latency -d 15 -c 128 --timeout 8 -t 8 http://TFB-server:8080/pgsql/fortunes
 wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' --latency -d 15 -c 256 --timeout 8 -t 8 http://TFB-server:8080/pgsql/fortunes

And the result is completely different, just pick up a few final tests:
pasted2
In comparing with the TEB test result:
pasted3
The only thing is I am testing from local to local. Really don't understand the massive deviation of the data on ServerCentral ...

Note the `TEB-Server` point to my localhost in /etc/hosts file.

Fredrik Widlund

unread,
Apr 13, 2017, 4:17:19 AM4/13/17
to Brian Hauer, framework-benchmarks
Hi,

The Application server is clearly faster than the load generator. This will benchmark the load generator, in a given setup, rather than the application server.

Kind regards,
Fredrik Widlund

--

Fredrik Widlund

unread,
Apr 13, 2017, 4:20:33 AM4/13/17
to Brian Hauer, framework-benchmarks
Hi,

I'm seeing very strange variations in the result that I find hard to understand. Could you please manually double check a few of the top candidates in each section where the result varies greatly, to double check that the benchmark environment gives consistent results?

Kind regards,
Fredrik Widlund


On Thu, Apr 13, 2017 at 12:48 AM, Brian Hauer <teona...@gmail.com> wrote:
One small addendum: The "Differences" renderings we've added with the previews for Round 14 are not using the 20-iteration samples for the multi-query and updates test types, but rather the 1-iteration samples of those test types.

They should be using the 20-iteration samples to align with the results web site's rendering of the data.  We'll get this fixed up for future differences renderings.

--

Tom Christie

unread,
Apr 13, 2017, 4:38:10 AM4/13/17
to framework-benchmarks
> we'd like to have any remaining last minute PRs received and merged in by end of Thursday (tomorrow, 2017-04-13). 

Test cases for "API Star" got merged in on the 7th (JSON and Plaintext) - I'd be curious to know if we'll make it into Round 14 as a new framework, or if there's a cut-off that we needed the framework to already be in for the preliminary rounds?

Either way will be looking forward to building out the remaining test cases in time for Round 15, and making sure that we're all set well in advance for that.

Thanks so much for your time!

  Tom :)

Daniel Nicoletti

unread,
Apr 13, 2017, 9:35:47 AM4/13/17
to green, framework-benchmarks, Joe Cincotta
So, benchmark.cfg.example has threads=8,
the value seems not to be increased to 80
on application server and is 8 on Travis
which caused compiling issues since it
has limited amount of RAM and make was
run with 8 parallel jobs which used all
available memory.

Are we expected to not rely on MAX_THREAD
or it's a value you will adjust?

Looking at tests implementations seems
everyone assumed MAX_THREAD would be the
CPU cores number and you get:

frameworks/Python/uwsgi/setup.sh:uwsgi --master -L -l 5000 --gevent 1000 --http :8080 --http-keepalive --http-processes $MAX_THREADS -p $MAX_THREADS -w hello --add-header "Con
nection: keep-alive" --pidfile /tmp/uwsgi.pid &

frameworks/Ur/urweb/setup_mysql.sh:./bench.exe -q -k -t ${MAX_THREADS} &

frameworks/Haskell/spock/setup.sh:${IROOT}/stack --allow-different-user exec spock-exe -- +RTS -A32m -N${MAX_THREADS} &

And many more including my own Cutelyst.

Best :)

To unsubscribe from this group and all its topics, send an email to framework-benchmarks+unsubscrib...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "framework-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to framework-benchmarks+unsubscrib...@googlegroups.com.
--
Daniel Nicoletti

KDE Developer - http://dantti.wordpress.com

Julia Nething

unread,
Apr 13, 2017, 11:31:58 AM4/13/17
to framework-benchmarks
Hi Tom,
API Star was already merged in, so it'll be in the Round 14 final! If you have any other fixes, make sure to get them in by end of day today to make Round 14. Thanks for your contributions!

Nathan Brady

unread,
Apr 13, 2017, 12:11:10 PM4/13/17
to framework-benchmarks, green...@gmail.com, j...@pixolut.com
Hey Daniel,

Just an fyi that we will be moving away from the max_threads environment variable and will provide a cpu_count env variable instead. I've added a note in this PR https://github.com/TechEmpower/FrameworkBenchmarks/pull/2586

Brian Hauer

unread,
Apr 13, 2017, 1:37:16 PM4/13/17
to framework-benchmarks
Following up on the differences rendering, we've changed these for Preview 3 to show the 20-iteration samples for the multi-query and updates test types, so the numbers should now align with the default bar-chart rendering on the results web site.

Daniel Nicoletti

unread,
Apr 13, 2017, 2:05:20 PM4/13/17
to Nathan Brady, framework-benchmarks, Gelin Luo, Joe Cincotta
Awesome Nathan,

Are you going to do some find&replace or should I replace with
$CPU_COUNT already?
> --
> You received this message because you are subscribed to the Google Groups
> "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to framework-benchm...@googlegroups.com.

Adam Chlipala

unread,
Apr 13, 2017, 2:08:27 PM4/13/17
to framework-...@googlegroups.com
Let me just add my vote to consider this change a major issue. For
implementations trying to spawn as many threads as CPUs, they're
severely handicapped now, and each one needs to have someone craft a fix
manually. It probably makes sense for TE staff to search and replace as
Daniel suggests, as a better default than running old code with an
environment variable set to mean something different than it used to mean.

Nathan Brady

unread,
Apr 13, 2017, 2:08:35 PM4/13/17
to framework-benchmarks, nathan....@gmail.com, green...@gmail.com, j...@pixolut.com

As a part of that PR I'll make sure to replace anyone that's currently using $MAX_THREADS. I won't be merging that until after round 14 is final, so we'll have time to troubleshoot.

Adam Chlipala

unread,
Apr 13, 2017, 2:09:55 PM4/13/17
to framework-...@googlegroups.com
Sorry, just to be sure: you're saying that frameworks whose authors
didn't happen to notice this change and submit patches will be penalized
in Round 14, even though it should be a trivial fix across the whole
codebase?

Nathan Brady

unread,
Apr 13, 2017, 3:03:15 PM4/13/17
to framework-benchmarks
Sorry Adam for the misunderstanding. Round 13 was the first round in which the load generator and app server had a mismatched number of cores. I didn't realize this was the case. I'll be working on a fix for this now.

Adam Chlipala

unread,
Apr 13, 2017, 3:27:12 PM4/13/17
to framework-...@googlegroups.com
Ur/Web dropped a lot in performance from Round 13 to the Round 14
previews (with no code changes that I was aware of), so I'm betting some
kind of other change happened in the mean time, and changing the
environment-variable convention would explain it.

Max Gortman

unread,
Apr 13, 2017, 8:36:28 PM4/13/17
to framework-benchmarks
I'm now looking at "tokio-minihttp" benchmarks closely. I see that for JSON benchmark for instance, there's a -100K req/sec in throughput from preview 2 to preview 3. I examined dstat logs for both and there are two observations (btw good tool to vis dstat quickly: http://lamada.eu/dstat-graph/#):
- CPU usage is generally lower in preview 3 compared to preview 2
- CPU usage is overall low in both rounds.
Typically when I see this with async frameworks it means not enough load has been applied. What are your thoughts?

I also find "hyper" results in preview 3 suspicious: throughput is about the same for 64, 128 and 256 concurrency values (even degrading with increased concurrency) - ~389K req/sec. For comparison, I've ran the hyper JSON benchmark in Azure D3v2 VM (Ubuntu 16.04) and stressed it with wrk from F16 VM. Here are the results:
$ wrk -t 16 -c 256 -d 1m http://x.x.x.x:8080/json
Running 1m test @ http://x.x.x.x:8080/json
  16 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.98ms    6.60ms  99.33ms   97.97%
    Req/Sec    15.17k     2.59k   22.15k    82.25%
  14739331 requests in 1.02m, 2.06GB read
Requests/sec: 241225.18
Transfer/sec:     34.51MB
So 4 core VM in Azure is pushing 60% of what 40 (80 HT) core machine does.

zloster

unread,
Apr 14, 2017, 3:55:47 AM4/14/17
to Max Gortman, framework-benchmarks
Hello Max,

On 2017-04-14 03:36, Max Gortman wrote:
> I'm now looking at "tokio-minihttp" benchmarks closely. I see that for
> JSON benchmark for instance, there's a -100K req/sec in throughput
> from preview 2 to preview 3. I examined dstat logs for both and there
> are two observations (btw good tool to vis dstat quickly:
> http://lamada.eu/dstat-graph/#):
> - CPU usage is generally lower in preview 3 compared to preview 2
> - CPU usage is overall low in both rounds.
> Typically when I see this with async frameworks it means not enough
> load has been applied. What are your thoughts?
>
> I also find "hyper" results in preview 3 suspicious: throughput is
> about the same for 64, 128 and 256 concurrency values (even degrading
> with increased concurrency) - ~389K req/sec. For comparison, I've ran
> the hyper JSON benchmark in Azure D3v2 VM (Ubuntu 16.04) and stressed
> it with wrk from F16 VM. Here are the results:
> $ wrk -t 16 -c 256 -d 1m http://x.x.x.x:8080/json
> Running 1m test @ http://x.x.x.x:8080/json
> 16 threads and 256 connections
> Thread Stats Avg Stdev Max +/- Stdev
> Latency 1.98ms 6.60ms 99.33ms 97.97%
> Req/Sec 15.17k 2.59k 22.15k 82.25%
> 14739331 requests in 1.02m, 2.06GB read
> Requests/sec: 241225.18
> Transfer/sec: 34.51MB

Note 1:
I've a quick check of:
http://tfb-logs.techempower.com/round-14/preview-3/hyper/out.txt. There
is one warning that is bothering me:
Setup hyper: install: installing component 'cargo'
Setup hyper: install: WARNING: failed to run ldconfig. this may
happen when not installing as root. run with --verbose to see the error

Since rust is using managed runtime is it possible that this runtime is
not picking some essential shared library due to the above warning?

Note 2:
In each test result folder there is a "raw" file - it contains the
output of the wrk client during the test run.

---------------------------------------------------------
Concurrency: 256 for hyper
wrk -H 'Host: localhost' -H 'Accept:
application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7'
-H 'Connection: keep-alive' --latency -d 15 -c 256 --timeout 8 -t 8
http://TFB-server:8080/json
---------------------------------------------------------

Running 15s test @ http://TFB-server:8080/json
8 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.47ms 4.61ms 127.21ms 96.00%
Req/Sec 43.44k 12.10k 79.08k 80.85%
Latency Distribution
50% 616.00us
75% 1.00ms
90% 1.63ms
99% 20.25ms
5185886 requests in 15.10s, 741.85MB read
Requests/sec: 343449.06
Transfer/sec: 49.13MB
STARTTIME 1491445267
ENDTIME 1491445282

Note the difference in the number of threads wrk is using: TFB - 8, your
test - 16.

Note 4:
The second warmup of hyper in the plaintext test is giving almost the
same number of RPS as the JSON test. Note that the warmup is NOT using
HTTP pipelining and the actual test is using it. See the raw wrk log
here:
http://tfb-logs.techempower.com/round-14/preview-3/hyper/plaintext/raw

---------------------------------------------------------
Running Warmup hyper
wrk -H 'Host: localhost' -H 'Accept:
text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7'
-H 'Connection: keep-alive' --latency -d 15 -c 16384 --timeout 8 -t 8
http://TFB-server:8080/plaintext
---------------------------------------------------------

Running 15s test @ http://TFB-server:8080/plaintext
8 threads and 16384 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 140.66ms 249.34ms 6.97s 88.59%
Req/Sec 41.08k 7.21k 71.85k 81.50%
Latency Distribution
50% 24.34ms
75% 179.96ms
90% 428.34ms
99% 997.47ms
4904617 requests in 15.06s, 608.06MB read
Requests/sec: 325613.71
Transfer/sec: 40.37MB

Having almost the same results in Plaintext and JSON suggest some
limitation in the configuration of the async framework. Take a look a
Dropwizard results. It is using Jetty as HTTP server and Jersey for the
REST APIs.
Dropwizard Plaintext: 136,773 rps
Dropwizard JSON: 120,728 rps

In the same tests Jetty (Stripped implementation and Jetty server
version is some old 9.3.x)
Jetty Plaintext: 513,764 rps
Jetty JSON: 306,299 rps

We see a significant difference between the plaintext and JSON as we
should expect because the JSON test involves more operations before
returning the response.

Note 5:
Hyper is having problems with the "plaintext" test. This test is the
only one which is using HTTP pipelining.
When the wrk client enables the pipeling everything goes to a halt. See
the mentioned raw file above.

> So 4 core VM in Azure is pushing 60% of what 40 (80 HT) core machine
> does.
> On Wednesday, April 12, 2017 at 3:25:58 PM UTC-7, Brian Hauer wrote:
>
>> Hello everyone!
>>
>> We have posted a third preview for Round 14:
>>
>> https://www.techempower.com/benchmarks/previews/round14/ [1]
>
> --
> You received this message because you are subscribed to the Google
> Groups "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to framework-benchm...@googlegroups.com.
> Visit this group at
> https://groups.google.com/group/framework-benchmarks [2].
> For more options, visit https://groups.google.com/d/optout [3].
>
>
> Links:
> ------
> [1] https://www.techempower.com/benchmarks/previews/round14/
> [2] https://groups.google.com/group/framework-benchmarks
> [3] https://groups.google.com/d/optout

--
Best regards,
zloster

green

unread,
Apr 14, 2017, 5:57:24 AM4/14/17
to framework-benchmarks, Michael Hixson
Alright finally get what's happened. The latest version of act-ebean plugin has an bug: it initialized a data source with 256 db connections opened but didn't use that. Thus ebean uses it's own datasource with only 2 connections available.

@Michael I will send a pull request today. I hope this version can catch up the final run...

Ilya Ryzhenkov

unread,
Apr 14, 2017, 7:55:37 AM4/14/17
to framework-benchmarks
Is there any (preliminary) schedule about preview and final rounds? Will there be any more previews? 

We just added "ktor" framework via PR (https://github.com/TechEmpower/FrameworkBenchmarks/pull/2669) and could use any preliminary data for analysis. 

On Wednesday, March 22, 2017 at 11:29:18 PM UTC+3, Brian Hauer wrote:
We've posted a first Preview of Round 14 for community review:

https://www.techempower.com/benchmarks/previews/round14/

This preview was captured on our ServerCentral hardware.

New for this round's preview is a differences chart that shows how Round 14 Preview 1 compares to Round 13 before it:

https://www.techempower.com/benchmarks/previews/round14/r13-vs-r14p1.html

Note that framework name changes cause false-positive reports of additions and subtractions (e.g., "added dancer" alongside "removed dancer-raw").  Aside from that, however, it has been useful for us to confirm that a majority of test implementations continue to run as they have before.  We are investigating and resolving some known issues with test implementations that failed to run in this Preview 1 and some that appear to have performed implausibly well (most likely indicating a defective measurement of an implementation error).

Our current expectation is that we will run and share at least one more preview prior to starting a final run for Round 14.  It's risky for me to make a timing prediction, but I would like to see a Preview 2 out before the end of March.

We're looking forward to any fixes that we can merge in prior to Round 14's final run.

As always, thank you to everyone who has contributed to the project!

Fredrik Widlund

unread,
Apr 14, 2017, 8:02:10 AM4/14/17
to Max Gortman, framework-benchmarks
I believe the application server resources are not saturated since the bottleneck is the load generator node. Top contenders are both more optimized/resource-effective, and run on hardware that has a lot more CPU cycles available. It also seemes to me that wrk varies in performance for some reason which in addition introduces a random/unknown factor.

Kind regards,
Fredrik Widlund

On Fri, 14 Apr 2017 at 02:36, Max Gortman <maxim....@gmail.com> wrote:
I'm now looking at "tokio-" benchmarks closely. I see that for JSON benchmark for instance, there's a -100K req/sec in throughput from preview 2 to preview 3. I examined dstat logs for both and there are two observations (btw good tool to vis dstat quickly: http://lamada.eu/dstat-graph/#):
--
You received this message because you are subscribed to the Google Groups "framework-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to framework-benchm...@googlegroups.com.

Ilya Ryzhenkov

unread,
Apr 14, 2017, 8:02:55 AM4/14/17
to framework-benchmarks
http://tfb-logs.techempower.com/round-14/preview-3/spark/plaintext/stats.json
It's an array of 3 empty objects, am I missing something?

Julia Nething

unread,
Apr 14, 2017, 9:11:16 AM4/14/17
to framework-benchmarks
Hi Ilya,
As announced by Brian yesterday, the third and last preview for Round 14 has already been released. The Round 14 final run will begin today. Because your ktor framework was already merged in, it will be included in the final run for Round 14. Thanks for your contributions!

zloster

unread,
Apr 14, 2017, 12:54:48 PM4/14/17
to Ilya Ryzhenkov, framework-benchmarks
Hi Ilya,

On 2017-04-14 15:02, Ilya Ryzhenkov wrote:
> http://tfb-logs.techempower.com/round-14/preview-3/spark/plaintext/stats.json
> It's an array of 3 empty objects, am I missing something?
Round 14 preview 1 has data in the stats.json file. Round 14 preview 2
and 3 don't have data there. So TFB team should check.

As a workaround use the CSV files exported from the dstat tool - they
seem OK (without extention):
http://tfb-logs.techempower.com/round-14/preview-3/spark/plaintext/stats

Max Gortman <maxim....@gmail.com> gave a link to someone's webpage
which is able to visualise the data from the dstat file ->
http://lamada.eu/dstat-graph/#

> On Wednesday, March 22, 2017 at 11:29:18 PM UTC+3, Brian Hauer wrote:
>
>> We've posted a first Preview of Round 14 for community review:
>>
>> https://www.techempower.com/benchmarks/previews/round14/ [1]
>>
>> This preview was captured on our ServerCentral hardware.
>>
>> New for this round's preview is a differences chart that shows how
>> Round 14 Preview 1 compares to Round 13 before it:
>>
>>
> https://www.techempower.com/benchmarks/previews/round14/r13-vs-r14p1.html
>> [2]
>>
>> Note that framework name changes cause false-positive reports of
>> additions and subtractions (e.g., "added dancer" alongside "removed
>> dancer-raw"). Aside from that, however, it has been useful for us
>> to confirm that a majority of test implementations continue to run
>> as they have before. We are investigating and resolving some known
>> issues with test implementations that failed to run in this Preview
>> 1 and some that appear to have performed implausibly well (most
>> likely indicating a defective measurement of an implementation
>> error).
>>
>> Our current expectation is that we will run and share at least one
>> more preview prior to starting a final run for Round 14. It's risky
>> for me to make a timing prediction, but I would like to see a
>> Preview 2 out before the end of March.
>>
>> We're looking forward to any fixes that we can merge in prior to
>> Round 14's final run.
>>
>> As always, thank you to everyone who has contributed to the project!
>
> --
> You received this message because you are subscribed to the Google
> Groups "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to framework-benchm...@googlegroups.com.
> Visit this group at
> https://groups.google.com/group/framework-benchmarks [3].
> For more options, visit https://groups.google.com/d/optout [4].
> https://www.techempower.com/benchmarks/previews/round14/r13-vs-r14p1.html
> [3] https://groups.google.com/group/framework-benchmarks
> [4] https://groups.google.com/d/optout

--
Best regards,
zloster

Ilya Ryzhenkov

unread,
Apr 14, 2017, 1:00:27 PM4/14/17
to framework-benchmarks, ilya.ry...@gmail.com, mo...@edno.moe
I don't see any data here either: http://tfb-logs.techempower.com/round-14/preview-1/spark/plaintext/stats.json
For CSV yep, thanks, I just wanted to mention a potential problem with json here. 

Michael Hixson

unread,
Apr 14, 2017, 1:18:01 PM4/14/17
to Ilya Ryzhenkov, framework-benchmarks, mo...@edno.moe
Yeah, the stats.json files are broken. Use the CSVs instead.

Here's what I concluded when I looked into this a week or so ago:

- The stats.json files are supposed to contain the same information
as the CSVs, except broken apart by concurrency level (or queries per
request).
- There's just one CSV covering all concurrency levels, so the
toolset has to try to figure out which parts of the CSV are associated
with each concurrency level.
- The toolset tries to do this by comparing the timestamps in the CSV
rows to the timestamps in the raw wrk output. Relevant part of code:
https://github.com/TechEmpower/FrameworkBenchmarks/blob/cee6ff033a95fa5b854212791d2a9ddefda91971/toolset/benchmark/framework_test.py#L694-L736
- This means the toolset is comparing timestamps from the application
server with timestamps from the client server.
- In ServerCentral, the clocks on those two machines are a few
minutes off from each other, so this approach doesn't work at all.

That's where I left it.

-Michael
>> > an email to framework-benchm...@googlegroups.com.
> --
> You received this message because you are subscribed to the Google Groups
> "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to framework-benchm...@googlegroups.com.
> Visit this group at https://groups.google.com/group/framework-benchmarks.
> For more options, visit https://groups.google.com/d/optout.

Max Gortman

unread,
Apr 14, 2017, 3:13:19 PM4/14/17
to framework-benchmarks, maxim....@gmail.com, mo...@edno.moe
zloster,
Talking about the way wrk is ran: "-t 8" is there because (I assume) load generator has only 8 cores. I would argue that it is impossible to apply enough load while benchmarking high performance frameworks from 8 core machine when the server is 40/80 cores. That way we're not benchmarking the server but the load generator. If you're not stressing the server it is hard to compare results for different frameworks without throwing resource consumption (likely CPU would be prime resource) into the picture. I.e. we may have two frameworks showing the same throughput numbers while one uses up more CPU than the other.

Coming back to hyper, "failed to run ldconfig" is unlikely to affect anything. It's reported during Rust setup and is shared by hyper and tokio-minihttp (and other Rust frameworks) and tokio's benchmarks worked fine. I tried running wrk with the same pipeline.lua script against my Azure VM (running benchmark directly, not going through setup) and it works fine:
$ wrk -H 'Host: localhost' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 256 --timeout 8 -t 8 http://x.x.x.x:8080/plain
text -s ~/pipeline.lua -- 16
  8 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    20.47ms   13.78ms 178.38ms   60.05%
    Req/Sec    45.50k    14.66k  176.08k    81.16%
  Latency Distribution
     50%   20.00ms
     75%   31.50ms
     90%   38.42ms
     99%   44.93ms
  5288124 requests in 14.62s, 655.61MB read
  Socket errors: connect 0, read 0, write 0, timeout 766
Requests/sec: 361818.60
Transfer/sec:     44.86MB

And server is not stressed, increasing concurrency gives higher results (e.g. 417K req/sec with "-c 950"). Also, to counter your point on JSON vs plaintext, I ran wrk with pipelining against JSON endpoint in my Azure setup and getting slightly less throughput (CPU is ~90% on server in both cases):
$ wrk -H 'Host: localhost' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 950 --timeout 8 -t 8 http://x.x.x.x:8080
/json -s ~/pipeline.lua -- 16
Running 15s test @ http://x.x.x.x:8080/json
  8 threads and 950 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    25.07ms   14.02ms  87.84ms   61.37%
    Req/Sec    51.44k    12.66k  236.97k    95.98%
  Latency Distribution
     50%   24.77ms
     75%   36.19ms
     90%   43.37ms
     99%   59.58ms
  5983728 requests in 14.68s, 855.98MB read
  Socket errors: connect 0, read 0, write 0, timeout 2832
Requests/sec: 407717.30
Transfer/sec:     58.32MB

I would argue that the significant difference between JSON and plaintext is due to difference between benchmarks themselves, not necessarily the effort required on the server side to handle it.


Fredrik Widlund

unread,
Apr 14, 2017, 8:28:51 PM4/14/17
to Max Gortman, framework-benchmarks, mo...@edno.moe
As it is the json/plaintext benchmarks are simply broken. I would just like to point out that this could easily be fixed by reversing the node roles in the json/plaintext tests, and have the 80 core node generate load and test the candidates on the smaller 16 core
node. This should give higher results, avoid impact from noise related to overloading wrk, and actually test the limits of the candidates.

Other test are asymmetrical from a load point of view and I assume are not affected.

Kind regards,
Fredrik Widlund

--
You received this message because you are subscribed to the Google Groups "framework-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to framework-benchm...@googlegroups.com.
Visit this group at https://groups.google.com/group/framework-benchmarks.
For more options, visit https://groups.google.com/d/optout.

Nikolche Mihajlovski

unread,
Apr 15, 2017, 4:17:55 AM4/15/17
to framework-benchmarks
I like Fredrik's idea to switch the app server and load generator machines, but I think it should be done for all tests. I wouldn't expect higher results, but they should be more realistic. That way the 8-core app server will align with the 8-core database node.

Another approach would be to replace the 40-core node with 2 x 20-core nodes. :)

Regards,
Nikolche

zloster

unread,
Apr 15, 2017, 10:09:33 AM4/15/17
to Max Gortman, framework-benchmarks
Hi again Max,

On 2017-04-14 22:13, Max Gortman wrote:
> zloster,
> Talking about the way wrk is ran: "-t 8" is there because (I assume)
> load generator has only 8 cores. I would argue that it is impossible
> to apply enough load while benchmarking high performance frameworks
> from 8 core machine when the server is 40/80 cores.
See below for details about the limits on 10 Gigabit Ethernet.

> That way we're not benchmarking the server but the load generator.

I understand that it is very important to know and not hit the load
generator limits. It is easy to verify the limits of the load generator
machine. Here you can a find a hint:
https://github.com/TechEmpower/FrameworkBenchmarks/issues/1432#issuecomment-176902041
> Data from Round 11 shows that Wrk can generate at least 2.2 million
> non-pipelined requests per second and
> 6.8 million pipelined requests per second in the Peak environment. This
> is (we believe) the practical limit of the
> 10 gigabit Ethernet. In other words, we do not presently believe we are
> operating at the performance limits of the
> load generator.
These numbers are for the old more powerful Peak environment. Also they
align well with the quick calculations here:
https://groups.google.com/forum/#!topic/framework-benchmarks/ABe4XZ0Ws-M
The theoretical limit seems to be around 2.5 million non-pipelined
HTTP requests, 8.8 million pipelined HTTP requests. In practice around
2.2 non-pipelined, 7 million pipelined.
So if we want to be sure that the load generator is not the bottleneck
on 10 Gb network, it should be able to achieve the above numbers.

I haven't seen data for what is possible with the current Dell R710 (2x
4-Core E5520 CPUs) load generator. IMO this indeed needs verification
and measurement. See my thoughts about this:

Quick look at the logs and the results from Round 14 preview 3 for
Plaintext. I'm looking at the current top 5:
Peak RPS Second warmup
(8 threads, 16384 connections, without http pipelining)
ulib 4,387,777 372,301
octane 4,164,263 329,538
rapidoid 4,128,648 433,102
tokio-minihttp 4,107,593 354,009
rapidoid-http-fast 4,003,742 349,302

Under 10% variance for the peak numbers. The numbers of the warmup are
also very close.

Let's take a look at the Plaintext with 16384 concurrency:
256 1024 4096 16384 concurrency
rapidoid-http-fast 4,003,742 2,858,038 3,219,736 2,660,987
rapidoid 4,128,648 3,014,073 3,264,176 2,646,354
octane 4,164,263 3,902,860 3,482,331 2,636,165
colossus 3,189,037 3,328,499 3,018,211 2,632,731
netty 1,567,994 3,196,335 3,115,748 2,609,901
libreactor 3,106,476 3,076,824 3,300,754 2,559,007
fasthttp-mysql-prefo 2,633,985 2,995,911 2,643,968 2,543,284
undertow 2,882,901 2,453,822 3,147,735 2,532,220
tokio-minihttp 4,107,593 2,612,418 2,722,007 2,524,006
ulib 4,387,777 2,884,581 3,044,467 2,514,053

10 framework implementations are above 2,500,000 RPS and within 6%.
IMO these numbers are indeed suggesting the load generator is at it
limits (if everything else involved like network equipment is OK). I'll
appreciate the comments of TFB team and the contributors.
If this is indeed the case (of load generator limitation) better
approach seems to use the DB server as second load generator for the
Plaintext and the JSON tests. Proposed switching of the DB and the app
server will produce a picture similar to the cloud environments, but the
results will be dominated by how efficient are the frameworks when
working with limited resources.

> If you're not stressing the server it is hard to compare results for
> different
> frameworks without throwing resource consumption (likely CPU would be
> prime resource) into the picture. I.e. we may have two frameworks
> showing the same throughput numbers while one uses up more CPU than
> the other.

IMO it will be REALLY hard to devise a test for a wide variety of
frameworks and languages and to have similar resource utilisation. Have
you tried to think of how such a test will be implemented?
And if you are stressing the server how will you assure that given
participant is not limited by the resource limitations? Major
requirement for a performance test is to stay away from known
bottlenecks - I/O, memory bandwidth, network bandwidth, CPU utilisation.

A better approach to comparison is to measure the latencies of the
responses at given constant load that you are interested in. And after
this to compare resource utilisation. See this sample result:
https://github.com/haywire/haywire#latency-distribution-with-coordinated-omission-at-35-million-requestssecond
Note "wrk2" not "wrk", also the presented numbers should have embedded
correction for the "coordinated omission". The details are here:
https://github.com/giltene/wrk2
"timeout 766" - you are at some bottleneck. Socket timeouts...

> And server is not stressed, increasing concurrency gives higher
> results (e.g. 417K req/sec with "-c 950"). Also, to counter your point
> on JSON vs plaintext, I ran wrk with pipelining against JSON endpoint
> in my Azure setup and getting slightly less throughput (CPU is ~90% on
> server in both cases):
>
> $ wrk -H 'Host: localhost' -H 'Accept:
> application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7'
> -H 'Connection: keep-alive' --latency -d 15 -c 950 --timeout 8 -t 8
> http://x.x.x.x:8080
> /json -s ~/pipeline.lua -- 16
> Running 15s test @ http://x.x.x.x:8080/json
> 8 threads and 950 connections
> Thread Stats Avg Stdev Max +/- Stdev
> Latency 25.07ms 14.02ms 87.84ms 61.37%
> Req/Sec 51.44k 12.66k 236.97k 95.98%
> Latency Distribution
> 50% 24.77ms
> 75% 36.19ms
> 90% 43.37ms
> 99% 59.58ms
> 5983728 requests in 14.68s, 855.98MB read
> Socket errors: connect 0, read 0, write 0, timeout 2832
> Requests/sec: 407717.30
> Transfer/sec: 58.32MB

"timeout 2832" Here also you have stressed too much something. You are
entering overload situation.

> I would argue that the significant difference between JSON and
> plaintext is due to difference between benchmarks themselves, not
> necessarily the effort required on the server side to handle it.

Let's take a look at another angle. The rust frameworks and the JSON
benchmark from R14 preview 3:
8 16 32 64 128 256 concurrency
tokio-minihttp 98,200 176,714 289,734 419,110 424,384 511,709
nickel 80,837 151,715 259,801 387,165 360,608 355,784
hyper 84,704 157,423 267,849 389,675 361,833 345,725
iron 67,716 130,834 233,618 348,153 346,977 326,982
rouille 25,109 25,447 25,271 24,959 24,255 23,795

Note how nickel, hyper and iron are reaching plateau at 64 concurrency.
At higher concurrences their result is almost the same or with small
drop. Note also that tokio-minihttp is not having such behaviour. So why
tokio-minihttp is scaling but the other fours are not?

Now let's look at the bigger picture in the same benchmark: (sorted by
the results in 256 concurrency):
8 16 32 64 128 256 concurrency
revenj.jvm 112033 189545 299618 382928 509360 665834
servlet 108602 187357 294883 403809 409323 645680
rapidoid-http-fast 99372 182104 292964 399717 416163 638119
undertow 86932 156803 264053 398772 453186 628794
colossus 76674 143996 247673 383180 389195 625942
ulib 115242 182878 294075 399803 435178 620660
s-server 93033 169084 269702 399324 409479 607658
duda i/o 87334 164603 272907 399291 474881 605754
wsgi 67311 129507 213696 341982 439234 589313
blaze 79016 147357 251775 374707 446628 582074
gemini 90782 157512 250925 342519 371832 579290
light-java 95886 172368 274862 390042 544715 578033
netty 98221 180774 291033 407164 434989 558601
grizzly 74665 134777 245294 360010 437142 552021
go-prefork 64694 119308 210777 324737 446720 525139
tokio-minihttp 98200 176714 289734 419110 424384 511709
api-hour+asyncio 40298 106757 187029 282175 420230 493771
vertx-web 83823 148342 238564 347104 428766 489933
vertx 83778 148450 242290 356352 442523 477911
falcon 48780 95150 173254 267762 379454 473483
fintrospect 47086 90970 170086 301889 389763 452617
finatra 53001 104914 200890 315022 383601 441293
finagle 51958 96369 181965 278473 324027 427877
falcon-py3 45290 84597 158548 257434 316949 422235
wheezy.web 44628 79460 161705 252741 348070 413868
ngx_mruby 60720 95878 143329 208442 299295 394047
nodejs 47538 94124 179110 289651 337821 384032
lapis 29755 53998 93423 163908 266700 374436
h2o 107559 190157 293460 366172 423523 373083
libreactor 116591 208652 318894 415287 604531 361348
rapidoid 99349 180073 290922 360646 548210 361135
wicket 78311 134439 212438 336266 329685 358091
nickel 80837 151715 259801 387165 360608 355784
bottle 38521 72705 134051 228086 306488 355209
hyper 84704 157423 267849 389675 361833 345725

There are 16 (sixteen) frameworks that are giving results above 500,000
RPS for the 256 concurrency.
So the load generator seems to be able to deliver quite a lot more load
than the 390,000 RPS the hyper is giving at peak. Note that most of the
frameworks are able to scale with the concurrency, but some have
problems.
For example libreactor:
128 concurrency: 604,531
256 concurrency: 361,348
This is big drop and it should be inspected.

Cheers,
zloster

> On Friday, April 14, 2017 at 12:55:47 AM UTC-7, zloster wrote:
>
>> Hello Max,
>>
>> On 2017-04-14 03:36, Max Gortman wrote:
>>> I'm now looking at "tokio-minihttp" benchmarks closely. I see that
>> for
>>> JSON benchmark for instance, there's a -100K req/sec in throughput
>>
>>> from preview 2 to preview 3. I examined dstat logs for both and
>> there
>>> are two observations (btw good tool to vis dstat quickly:
>>> http://lamada.eu/dstat-graph/# [1]):
>>> - CPU usage is generally lower in preview 3 compared to preview 2
>>> - CPU usage is overall low in both rounds.
>>> Typically when I see this with async frameworks it means not
>> enough
>>> load has been applied. What are your thoughts?
>>>
>>> I also find "hyper" results in preview 3 suspicious: throughput is
>>
>>> about the same for 64, 128 and 256 concurrency values (even
>> degrading
>>> with increased concurrency) - ~389K req/sec. For comparison, I've
>> ran
>>> the hyper JSON benchmark in Azure D3v2 VM (Ubuntu 16.04) and
>> stressed
>>> it with wrk from F16 VM. Here are the results:
>>> $ wrk -t 16 -c 256 -d 1m http://x.x.x.x:8080/json [2]
>>> Running 1m test @ http://x.x.x.x:8080/json [2]
>>> 16 threads and 256 connections
>>> Thread Stats Avg Stdev Max +/- Stdev
>>> Latency 1.98ms 6.60ms 99.33ms 97.97%
>>> Req/Sec 15.17k 2.59k 22.15k 82.25%
>>> 14739331 requests in 1.02m, 2.06GB read
>>> Requests/sec: 241225.18
>>> Transfer/sec: 34.51MB
>>
>> Note 1:
>> I've a quick check of:
>> http://tfb-logs.techempower.com/round-14/preview-3/hyper/out.txt
>> [3]. There
>> is one warning that is bothering me:
>> Setup hyper: install: installing component 'cargo'
>> Setup hyper: install: WARNING: failed to run ldconfig. this may
>>
>> happen when not installing as root. run with --verbose to see the
>> error
>>
>> Since rust is using managed runtime is it possible that this runtime
>> is
>> not picking some essential shared library due to the above warning?
>>
>> Note 2:
>> In each test result folder there is a "raw" file - it contains the
>> output of the wrk client during the test run.
>>
>> ---------------------------------------------------------
>> Concurrency: 256 for hyper
>> wrk -H 'Host: localhost' -H 'Accept:
>>
> application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7'
>>
>> -H 'Connection: keep-alive' --latency -d 15 -c 256 --timeout 8 -t 8
>> http://TFB-server:8080/json [4]
>> ---------------------------------------------------------
>>
>> Running 15s test @ http://TFB-server:8080/json [4]
>> [5]
>>
>> ---------------------------------------------------------
>> Running Warmup hyper
>> wrk -H 'Host: localhost' -H 'Accept:
>>
> text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7'
>>
>> -H 'Connection: keep-alive' --latency -d 15 -c 16384 --timeout 8 -t
>> 8
>> http://TFB-server:8080/plaintext [6]
>> ---------------------------------------------------------
>>
>> Running 15s test @ http://TFB-server:8080/plaintext [6]
>>>> https://www.techempower.com/benchmarks/previews/round14/ [7] [1]
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>
>>> Groups "framework-benchmarks" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>> send
>>> an email to framework-benchm...@googlegroups.com.
>>> Visit this group at
>>> https://groups.google.com/group/framework-benchmarks [8] [2].
>>> For more options, visit https://groups.google.com/d/optout [9]
>> [3].
>>>
>>>
>>> Links:
>>> ------
>>> [1] https://www.techempower.com/benchmarks/previews/round14/ [7]
>>> [2] https://groups.google.com/group/framework-benchmarks [8]
>>> [3] https://groups.google.com/d/optout [9]
>>
>> --
>> Best regards,
>> zloster
>
> --
> You received this message because you are subscribed to the Google
> Groups "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to framework-benchm...@googlegroups.com.
> Visit this group at
> https://groups.google.com/group/framework-benchmarks [8].
> For more options, visit https://groups.google.com/d/optout [9].
>
>
> Links:
> ------
> [1] http://lamada.eu/dstat-graph/#
> [2] http://x.x.x.x:8080/json
> [3] http://tfb-logs.techempower.com/round-14/preview-3/hyper/out.txt
> [4] http://TFB-server:8080/json
> [5]
> http://tfb-logs.techempower.com/round-14/preview-3/hyper/plaintext/raw
> [6] http://TFB-server:8080/plaintext
> [7] https://www.techempower.com/benchmarks/previews/round14/
> [8] https://groups.google.com/group/framework-benchmarks
> [9] https://groups.google.com/d/optout

zloster

unread,
Apr 15, 2017, 10:16:57 AM4/15/17
to Nikolche Mihajlovski, framework-benchmarks
Hi Nikolche,
On 2017-04-15 11:17, Nikolche Mihajlovski wrote:
> I like Fredrik's idea to switch the app server and load generator
> machines, but I think it should be done for all tests. I wouldn't
> expect higher results, but they should be more realistic. That way the
> 8-core app server will align with the 8-core database node.

Please see my notes about this in another message in this thread:
https://groups.google.com/d/msg/framework-benchmarks/sQDY1uELRkY/io1s_K33BwAJ

> Another approach would be to replace the 40-core node with 2 x 20-core
> nodes. :)
> Regards,
> Nikolche
>
> --
> You received this message because you are subscribed to the Google
> Groups "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to framework-benchm...@googlegroups.com.
> Visit this group at
> https://groups.google.com/group/framework-benchmarks [1].
> For more options, visit https://groups.google.com/d/optout [2].
>
>
> Links:
> ------
> [1] https://groups.google.com/group/framework-benchmarks
> [2] https://groups.google.com/d/optout

--
Best regards,
zloster

Brian Hauer

unread,
Apr 17, 2017, 3:02:13 PM4/17/17
to framework-benchmarks
Hi all,

Contrary to my earlier message, we are going to post a Preview 4 using the data from the run that completed over the weekend.  This is in light of some of the changes to concurrency levels in some frameworks.  However, it is still our intent to wrap up Round 14 as soon as possible and proceed very quickly into preview runs for Round 15.

Time permitting, I will follow up later today with Preview 4 details.

Brian Hauer

unread,
Apr 17, 2017, 3:08:32 PM4/17/17
to framework-benchmarks
Hi again,

I apologize.  I was mistaken about the timing of the changes.  We will need to wait until a currently-executing full run completes by approximately this Wednesday in order to prepare a Preview 4.  Thank you for your patience!

Brian Hauer

unread,
Apr 19, 2017, 6:26:30 PM4/19/17
to framework-benchmarks
We just posted Preview 4 of Round 14.

https://www.techempower.com/benchmarks/previews/round14/

This includes (partial, see below) resolution for some frameworks that were using improper concurrency levels for the CPU core counts of the ServerCentral environment.  However, the results are peculiar at a glance, so we are going to investigate a bit and may do additional previews before finalizing.

In particular, the following four tests that were changed to use higher process concurrency saw their plaintext and JSON results worsen since Preview 3 while their database results improved.  We do not yet have a theory for why this would happen.

Bottle: https://www.techempower.com/benchmarks/previews/round14/r14p3-vs-r14p4.html#bottle-nginx-uwsgi.json

Cutelyst: https://www.techempower.com/benchmarks/previews/round14/r14p3-vs-r14p4.html#cutelyst.json

h2o: https://www.techempower.com/benchmarks/previews/round14/r14p3-vs-r14p4.html#h2o.json

Weppy: https://www.techempower.com/benchmarks/previews/round14/r14p3-vs-r14p4.html#weppy-nginx-uwsgi.json

It is my intent and expectation that continuous benchmarking will allow for more rapid progression of rounds in the future.  So although I feel delaying Round 14 for a bit until these peculiar results can be either explained or resolved is warranted at this time, I would prefer for future rounds to be more quickly frozen and completed, regardless of outstanding issues.  Once we can establish a routine of rounds being completed more quickly, I believe we will be more comfortable with "locking in" rounds despite any known issues, given that issues will be visible for a shorter period of time between rounds.

We also intend to gradually increase the test durations (presently at 15 seconds), and accordingly the full run duration (presently at ~72 hours), which may help address some of the observed volatility between runs in some frameworks' results.

Anton Kirilov

unread,
Apr 19, 2017, 7:51:22 PM4/19/17
to framework-benchmarks
Hi Brian,

I have a small correction to make: The concurrency level used by h2o in preview 4 was decreased, not increased; in fact, it is 5 times less than the one used in round 13 (while the result decreased by 39%). After I changed the code to use SO_REUSEPORT (right before preview 3), there was a regression in the JSON results, even though that change was supposed to make load balancing fairer, and hence improve the results. So, I started to suspect that the application was using way too many threads (FYI, the load average values are usually around 4, 7 and 14 respectively), and I decided to go in a different direction. Actually, I would like to try with only 8 threads/processes, but, unfortunately, I don't have access to the same or similar hardware environment, and it is too late in this round for further experiments, so I will just increase the thread count to a value that seems reasonable.

As for the database results, I also decreased the number of database connections, so there was a further regression - the fortunes test seems to benefit from having more connections, while the single and multiple queries seem to be fine with only 64. The significant improvement in the updates test is due to switching back to doing batch updates instead of a single transaction doing a read and a write.

Best wishes,
Tony

green

unread,
Apr 19, 2017, 8:06:25 PM4/19/17
to Brian Hauer, framework-benchmarks
Hi Brian,

Thanks for the post. Bad news for me as my pull request (which fixed the data source issue) doesn't seem to fix the low performance issue on fortunes test (pgsql and mysql). And the data (from https://www.techempower.com/benchmarks/previews/round14/r14p3-vs-r14p4.html) is pretty still the same as preview 3:

pasted1
The fortunes performance is very hard to explain. To make clear here is the chart showning the inconsistence between Fortunes and other tests:
pasted2
As a comparison the data for dropwizard (similar frameworks) with all three database provided:
pasted3
So this should be the right trends, i.e. the fortunes test result shall comply with db and queries result.

I wonder is there any possiblity for me to profile into that running environment to see where is the bottleneck of fortunes test on mysql and pgsql?

--
You received this message because you are subscribed to a topic in the Google Groups "framework-benchmarks" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/framework-benchmarks/sQDY1uELRkY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to framework-benchm...@googlegroups.com.
Visit this group at https://groups.google.com/group/framework-benchmarks.
For more options, visit https://groups.google.com/d/optout.

Naoki INADA

unread,
Apr 20, 2017, 8:15:49 AM4/20/17
to framework-benchmarks
When talking about Python (and many other "prefork" type runtimes), it's very difficult to tune
for both of extreme case (e.g. "plaintext" and "json"), and realistic case which have external
dependency (e.g. "db", "fortune", etc..)

So result of 4th preview seems OK to me.
It shows realistic performance well.

Samuel Colvin

unread,
Apr 20, 2017, 12:15:02 PM4/20/17
to framework-benchmarks
There are still some severe problems with preview 4.

See fortunes aiohttp vs. aiohttp-pg-raw, here aiohttp outperforms aiohttp-pg-raw by roughly 10%. This makes absolutely no sense.

Also running wrk against the same two setups locally the raw setup is ~60% faster.

There are other similar inconsistencies comparing other frameworks but none as simple to compare as these two cases. 

Where does this error come from?

Samuel

Samuel Colvin

unread,
Apr 20, 2017, 1:18:34 PM4/20/17
to framework-benchmarks
Further to my previous message I've done some digging. Dstat shows that in the case of aiohttp-pg-raw most CPUs are idle most of the time and "load avg" is consistently way lower than for aiohttp. (Note the y-axis on the screenshot below)

It seems pretty likely to me that this is a problem with the server applying load - the reason aiohttp-pg-raw is processing fewer requests is that it's receiving fewer requests!

Could this be because aiohttp-pg-raw doesn't implement the json and plaintext tests (they would be duplicates of standard aiohttp) so in some way the load server has been running for less time and isn't "up to speed" in some way?

Max Gortman

unread,
Apr 20, 2017, 4:30:37 PM4/20/17
to framework-benchmarks
Brian,

I really think that overall setup is not able to apply enough load on server. Dstat's CPU sampling is the most telling metric along the observed throughput. I can take two benchmarks that I'm interested in: tokio-minihttp and aspnetcore-linux. For plaintext former shows throughput at 3.5M req/sec, latter - 1.7M req/sec. One might think difference is 2 times while in fact looking at dstat's data shows that with tokio-minihttp CPU is idle for anywhere between 35-75% (amid cores at the same time) during the most demanding benchmark run while for aspnetcore CPU cores are idle for 2-12%. With this in mind it looks like either benchmark becomes pointless at a certain level of throughput (around 3.5M it seems in last round) or there's some system-level issue not allowing to squeeze more out of the setup. It's also the same story with JSON benchmark: 529K req/sec for tokio and 206K req/sec for aspnetcore but again, for former CPU is idle for 70+% while for latter CPU is idle for <10%.

I would suggest to lower number of cores available in server at least three times so that the rest of the setup would more adequately match the capabilities for top performing frameworks and therefore they would be able to cap on CPU under the load that the load generator (+ network) is capable to produce.


Thanks,
Max

Fredrik Widlund

unread,
Apr 21, 2017, 3:22:14 AM4/21/17
to Max Gortman, framework-benchmarks
I think that a requirement, at least for say the top 10 candidates, should be consistent results between runs. If there are anomalies these should be looked at. If the candidate implementation is unstable this should be noted and perhaps even the results removed, but if the benchmark implementation in itself is unstable this is a critical bug that potentially renders all results invalid.

Regarding the setup, the json and plaintext benchmarks are highly symmetrical. Once a session is established the server end will poll, read, process a request, construct a reply, and write. wrk will poll, read, process a reply, construct a the next request, and write. It is basically a ping-pong game. The amount of data/number of packets in either direction is highly symmetrical. Both the client and the server end of the sessions are basically doing the same processing. The tcpip kernal stack processing is the same. Interface interrupts are the same. wrk is optimized but not overly so. An well performing client node will match a well performing server node almost exactly in terms of resource utilization. Idle CPU cycles will be around zero on both nodes unless the network itself is saturated. This is why you really can't benchmark a (4 cpu) 80-core node, with a (2 cpu) 16-core one. Since some of the server implementations are more aggressively optimized than wrk itself, the client node definitely needs have _more_ resources than the server node to ensure that it in itself does not become the bottleneck.

The other benchmarks are highly asymmetrical, the server end of the session will do a lot more work than the client end. Here it seems much less likely to me that the wrk process would become a bottleneck. 

Kind regards,
Fredrik Widlund

Daniel Nicoletti

unread,
Apr 21, 2017, 4:44:09 AM4/21/17
to Fredrik Widlund, Max Gortman, framework-benchmarks
We all want reproducible results between runs,
I was for sure surprised when this didn't happened to my
own framework, but after investigation there are many
things we all might be forgetting to take into consideration.

First forget about the load generator machine, it has
less cores but higher CPU frequency, it would of course
be interesting to see it's CPU load but still WRK work is
a lot easier than what the server should be doing.

Now if it is going to open 80 persistent connections and 
you have 80 threads how is your application going
to balance that?

If they are all listening on the same bound socket chances
are high that only a few workers will get most connections.

One option is to use SO_REUSEPORT, which I learned in this
round thanks to H2O framework, but still it might give
some threads 2 connections, from my experience, but it's
the best option still IMO. Another would be to writing your
own connection balancer, which isn't trivial.

If you don't have a connection balancer then forget about
getting reproducible results on a 80HT core machine, it's
VERY likely you will get most connections on a few threads
due OS scheduler.

Now talking about OS scheduler, the Linux one always tries
to be smart, but smart is not reproducible and might take
some time till it balance well your 80 threads on all the cores,
added to this there is the fact that the server is has a NUMA
which means some memory might be a little slower to access
from another CPU socket, and if you do it frequently you will
see performance degradation.

To help the OS scheduler you can set CPU affinity, so each
worker process or thread is bound to a single CPU, if you
use uwsgi there is --cpu-affinity option that if you set to 1
it will be a 1 (process, not thread!) to 1 (CPU) mapping.

I'm currently experimenting for my preforking tests a
different setup, since CPU bound tasks don't benefit much
from HT systems, I'm spawing 40 process and setting
the CPU affinity to two cores, this way the "real" core has
only one worker and which won't be concurring between
them, with the added benefit of better connection balance:

For example you have 40 connections and 80 threads,
with an evenly balanced load you will get the first 40
threads 1 connection each, if worker threads 1 and 2
are bound to CPU 0 and 1 which are the same "real"
core you will get load only in 20 real cores.

So in theory 40 threads/process bound to 2 CPU
cores evenly balanced is what would give best
throughput, in practice I'm not sure if with
256 connections HT could help a bit...

That's my 2c on what you can/need to do if
you want better and reproducible results.

Now there would be a lot of less room for
fine tuning if the application server was
the same of the load generator due fewer
CPUs, but still having the DB with 80
cores might require the same tuning to
try to distribute the load evenly.

So if you didn't fine tuning have fun :)

Best,



--
You received this message because you are subscribed to the Google Groups "framework-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to framework-benchmarks+unsub...@googlegroups.com.



--
Daniel Nicoletti

KDE Developer - http://dantti.wordpress.com

Fredrik Widlund

unread,
Apr 21, 2017, 5:29:12 AM4/21/17
to Daniel Nicoletti, Max Gortman, framework-benchmarks
First forget about the load generator machine, it has
less cores but higher CPU frequency, it would of course
be interesting to see it's CPU load but still WRK work is
a lot easier than what the server should be doing.

I believe this is wrong, which can easily be validated by running two separate wrk processes, both from the load node and the db node, simultaneously, and look at wether the the combined results improve which would prove this claim. I believe they will.

Discussions about tuning are interesting, but perhaps for another forum. As a side-note my framework, libreactor, uses SO_REUSEPORT balancing, affinity and other strategies since a couple years to improve throughput. Libreactor results in themselves are typically very reproducible and I am not worried about them in particular in this context.

Kind regards,
Fredrik Widlund


To unsubscribe from this group and stop receiving emails from it, send an email to framework-benchmarks+unsubscrib...@googlegroups.com.

Steve Hu

unread,
Apr 21, 2017, 9:57:16 AM4/21/17
to framework-benchmarks
Hi Brian, 

Thanks for your hard work to bring this online. You guys are awesome. Could you please explain what has been changed between preview 3 and preview 4? The absolute numbers for json and plaintext dropped a lot for light-java. Also, when the window will be closed for optimisation? By looking at others' code, I think I can optimise my tests to have better performance but don't know if you still accept pull requests for round 14. 

Thanks,

Steve

light-javajson578,034425,271-26.4%1319-6
light-javaplaintext3,391,4032,834,877-16.4%68-2

On Wednesday, March 22, 2017 at 4:29:18 PM UTC-4, Brian Hauer wrote:
We've posted a first Preview of Round 14 for community review:

https://www.techempower.com/benchmarks/previews/round14/

This preview was captured on our ServerCentral hardware.

New for this round's preview is a differences chart that shows how Round 14 Preview 1 compares to Round 13 before it:

https://www.techempower.com/benchmarks/previews/round14/r13-vs-r14p1.html

Michael Hixson

unread,
Apr 21, 2017, 11:05:11 AM4/21/17
to Steve Hu, framework-benchmarks
Hi Steve,

There were two changes to the toolset/configuration:

1. In Preview 3, there was a MAX_THREADS environment variable available on the server, and its value was 8.  In Preview 4, this was replaced with a CPU_COUNT variable whose value was 80.  Since it appeared as though people using MAX_THREADS thought it represented the number of CPU cores on the server (in the past, it might have), we updated all the code we could find that was using MAX_THREADS to use CPU_COUNT instead.

2. In Preview 3, the client would run wrk with at most 8 threads (using the "-t 8" option).  In Preview 4, the client would use up to 32 threads.  You can compare the raw wrk output from the two previews to see what I mean:


The first change is totally new, while the second change restores the settings we had in Preview 1 and accidentally changed for Previews 2 and 3.

-Michael

--
You received this message because you are subscribed to the Google Groups "framework-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to framework-benchmarks+unsub...@googlegroups.com.

Anton Kirilov

unread,
Apr 21, 2017, 5:20:16 PM4/21/17
to framework-benchmarks, maxim....@gmail.com
Hi Fredrik,

A small remark first:

> wrk will poll, read, process a reply, construct a the next request, and write.
wrk needs to construct the request exactly once, and then it can reuse the same buffer for all future requests because in contrast with responses, requests always have the same content. So, strictly speaking, the client has less work to do than the server.

As for your main point, I don't believe that the client hardware configuration is the bottleneck right now, at least for the JSON test, which I looked into in slightly more detail. The reason is that even though I don't have access to the same hardware environment, I ran some experiments, and 2 wrk threads communicating with 2 h2o threads were pretty much able to saturate a gigabit Ethernet connection (achieving around 450000 RPS) - adding more threads on either side didn't change the results, or lead to regressions. While the machines that I used (with Haswell CPUs) were arguably better except for the network adapters (the other major difference being that I connected the nodes directly with a cable - no switch in the middle), I'd still conservatively estimate that 4 times more threads should be able to achieve at least 2 Gb/s. However, the 5 best performing frameworks in preview 4 are barely able to go over a gigabit (approximately 1.15 Gb/s is the maximum). So, it seems to me that there is either a configuration issue, or a problem somewhere in the middle as opposed to in the number of CPU cores. In fact, if I am not mistaken, raising the number of client threads from 8 to 32 lead to a performance regression (in absolute terms) - there were a couple of frameworks going over 600000 RPS in preview 3, if I remember correctly.

Best regards,
Tony

Fredrik Widlund

unread,
Apr 22, 2017, 4:10:23 AM4/22/17
to Anton Kirilov, framework-benchmarks, Max Gortman
Hi Anton,
 
A small remark first:
> wrk will poll, read, process a reply, construct a the next request, and write.
wrk needs to construct the request exactly once, and then it can reuse the same buffer for all future requests because in contrast with responses, requests always have the same content. So, strictly speaking, the client has less work to do than the server.

My point is that the flow is symmetrical, not identical. The client, wrk, can take shortcuts, but if you look at some of the "frameworks" in the benchmark they actually take similar shortcuts.

I do agree that there seems to be an unknown variable restricting the tests currently. However the amount of client node resources is definitely a bottleneck in itself.

Here are some test done while writing this mail. These are for libreactor on three single (!) Quad-core Xeon (E5-2623 v3 @ 3.00GHz) nodes on a 10GbE network.

1<-2 (running with the wrk setup that performs the best in this case)
# ./wrk -H 'Host: localhost' -H 'Accept:application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 256 --timeout 8 -t 8 http://...:8080/json
Running 15s test @ http://...:8080/json

  8 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   269.68us  132.93us  16.02ms   84.77%
    Req/Sec    97.26k    27.69k  156.69k    64.32%
  Latency Distribution
     50%  242.00us
     75%  313.00us
     90%  410.00us
     99%  682.00us
  11660959 requests in 15.10s, 1.59GB read
Requests/sec: 772272.57
Transfer/sec:    107.53MB​

Total: 772k rps.

1<-(2+3 running simultaneously, same command)

Total: 906k rps

Worth noting
  1. Even here, where client/server resources are the same, two client nodes generating load increase server performance
  2. These are on a single quad core cpu, performance should to some degree scale to the number of cores, even if cpu freq is slightly higher
Kind regards,
Fredrik

Fredrik Widlund

unread,
Apr 22, 2017, 4:31:48 AM4/22/17
to Anton Kirilov, framework-benchmarks, Max Gortman
I will test with a machine on the same network with 2 cpus, same as above, lets call it #4.

1) node 4 (2 cpu, 16 cores w ht) <- node 2 (1 cpu, 8 cores w ht)

# ./wrk -H 'Host: localhost' -H 'Accept:application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 256 --timeout 8 -t 8 http://...:8080/json
Running 15s test @ http://...:8080/json
  8 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   229.49us  245.88us  15.39ms   99.44%
    Req/Sec    95.84k    11.31k  189.42k    79.49%
  Latency Distribution
     50%  241.00us
     75%  298.00us
     90%  324.00us
     99%  396.00us
  11482964 requests in 15.10s, 1.56GB read
Requests/sec: 760481.84
Transfer/sec:    105.89MB

Total: 760k rps

Result does not scale with more server resources, indicating that the load generation is the bottleneck.

2) node 4 (2 cpu, 16 cores w ht) <- node 2+3 (1 cpu, 8 cores w ht, each, running simultaneously)
 
Total: 1292k rps

Proving that 
  1. Performance scales with number of cores
  2. wrk is a clear bottleneck in the first test
F

Nikolche Mihajlovski

unread,
Apr 22, 2017, 12:41:22 PM4/22/17
to framework-benchmarks
Hi all,

I believe the most important rule of measurement is: "measure several times".

Anything could happen during the benchmark (e.g. network problems). Measuring only once maximizes the risk of fake results. Increasing the test duration could only make the affected (wrong) results a bit more correct.

In the latest preview Rapidoid's results for json were cut in half without any changes in the code:

rapidoid-http-fast  json  638,119  318,549  -50.1%

A simple way to produce more stable results for all frameworks is to run the tests multiple times and select the best results.

Regards,
Nikolche

Anton Kirilov

unread,
Apr 22, 2017, 3:10:51 PM4/22/17
to framework-benchmarks, antonv...@gmail.com, maxim....@gmail.com
Hi Fredrik,

Don't get me wrong - I am not saying that the number of CPUs on the client machine could not be a bottleneck; in fact, 8 cores may very well not be enough for wrk to saturate a 10 Gb Ethernet connection (or maybe anything above 2-3 Gbit). My point is that unless the other, more important issue that seems to exist is addressed first in the ServerCentral environment, raising the number of client CPU cores is probably going to lead to a marginal improvement at best. Even the worst of your results are significantly better than those in the ServerCentral environment, so I'd say that you have a different bottleneck, and it is not surprising that you are able to benefit from more processors.

Best wishes,
Tony

Anton Kirilov

unread,
Apr 22, 2017, 3:21:28 PM4/22/17
to framework-benchmarks
Hi Nikolche,

It is not only Rapidoid that has been affected (well, maybe it has been affected the worst) - just look at the top performers, e.g. s-server (I haven't noticed any changes in the implementation or its dependencies). IMHO 32 client threads are too many, and I don't know what behaviour can be expected when the system is saturated like that.

As for your other suggestion - I don't think that there is any difference between several test runs, and a single longer run that has the same overall duration. Reporting just the best result seems wrong to me; in fact, we do get a good amount of information, i.e. average, maximum, and standard deviation, and with a little bit digging in the logs, the latency distribution.

Best wishes,
Tony

Fredrik Widlund

unread,
Apr 22, 2017, 3:31:56 PM4/22/17
to Anton Kirilov, framework-benchmarks, Max Gortman
Hi Anton,

It's basically a numbers game in the JSON benchmark case. There is a base line network induced latency. If you are able to handle a request/reply transaction in 0.5 ms on average, you will course handle 2000 rps on one connection. 256 parallel connections gives you 500k rps. How connection concurrency scales depends on implementation and cpu cycles.

If I would need to guess without being able to troubleshoot the benchmarking environment, it would be that
1) network latency is higher
2) network latency vary over time for some reason (which would be the unknown variable)
3) load generation limits results

(btw, libreactor haltered in the last preview, but this was due to the human factor)

Kind regards,
Fredrik


--

Nikolche Mihajlovski

unread,
Apr 22, 2017, 4:34:26 PM4/22/17
to framework-benchmarks
Hi Anton,

Let's imagine this scenario (benchmarking some framework 2 times with same parameters):

Test run 1: the network is terrible, we measure 100 req / sec.

Test run 2: the network is ideal, we measure 500 req / sec.

I believe 500 is the correct number, because it shows the framework at its best when the environment (e.g. network) is at its best. For the "sub-optimal" results, we can't really examine what is the cause of the degradation: the framework or the environment (it will usually be the environment).

On the other side, a single run as a union of test run 1 and 2 will show the average: 300 req / sec, so I don't agree that's the correct number.

Regards,
Nikolche

Aliaksandr Valialkin

unread,
Apr 23, 2017, 2:58:01 AM4/23/17
to framework-benchmarks
I bet /json, /db and /fortune benchmarks are limited by network latency - data tables clearly show that the absolute majority of the best results stick to the maximum available 256 concurrent connections. Introducing 512 and 1024 connections for these benchmarks could mitigate the network latency.

The /plaintext benchmark is limited by wrk performance. 32 threads on 8-core machine will always be slower than 8 threads. The best solution is to run wrk on a machine with more CPU cores. Another workaround is to increase http pipeline size from 16 to at least 32. Thus should reduce resource consumption on wrk side and simultaneusly increase load on server side.

Brian Hauer

unread,
Apr 26, 2017, 5:56:13 PM4/26/17
to framework-benchmarks
Hi Max,

The situation you describe was a known factor that we contended with at the transition from Round 12 to Round 13—the transition from the homogeneous Peak environment to the heterogeneous ServerCentral environment.  In the Peak environment, all of the machines were approximately the same in specification.  In our ServerCentral environment, the machines are not equivalent, so we would either use the "beefiest" of the machines as the application server, leaving a less capable machine to generate the load or vice-versa.  We elected to use the beefiest server as the application server.  This was done principally in support of the following notions:
  • The "Fortunes" test type is considered by many as the most important test type and we believe that the load generator should be able to keep pace with the application server given the demands on the application server for this test type.
  • We intend to add more computationally-intense test types in future.  Any tests that are computationally intense reduce the likelihood of the application server out-pacing the load generator.
  • We wanted to continue to be able to maximize the performance numbers across as broad a spectrum of test types and implementations as possible.  Bearing in mind that wrk is a very capable load generator, only the pinnacle-performing application servers would exceed the load generation capacity.

Also consider that we are still seeing a fairly well-distributed set of results across all test types.  Way back in Round 8, when we used our in-house 1-gigabit Ethernet network and i7 workstations, we observed results that reached a plateau at approximately, 210,000 JSON responses per second due to the 1-gigabit network being saturated.


I appreciate your question and curiosity about the CPU being idle in the tokio-minihttp plaintext test, and I believe there is a point at which the application server should be able to out-pace the load generator for this test type, but I see the top-performer in plaintext is producing ~800,000 more responses per second and is being measured by the same load generator.  It is possible a portion of that idle CPU time is due to inefficiency in the framework or platform—inefficiency that if resolved could yield performance at least comparable to the top-performing framework.  Still, yes, at some point, the application server will likely be inevitably idle in plaintext due to the load generator running on less-capable hardware.


All that said, for the time being, we are going to retain the present machine roles for Round 14.  But we can consider a role-swap (perhaps rendered as an alternative "environment") in a future round.

Brian Hauer

unread,
Apr 26, 2017, 6:05:29 PM4/26/17
to framework-benchmarks, maxim....@gmail.com
I see that Fredrik has made a similar point to one I made in my reply to Max: the test types that are more demanding on the application server are unlikely to be limited by the load-generator's throughput (within reason, of course; we wouldn't want to run the load generator on a 1-core Atom CPU).

Fredrik also raises the variability that is observed across each run.  Two thoughts on that:
  • I'd like to gradually ramp up the duration of tests from the present 15 seconds to 30 seconds or perhaps 60 seconds.  This will cause the total suite run time to expand significantly, but it should smooth out the variability somewhat.  Some frameworks appear to have more volatility than others.  So while 15 seconds seems sufficient in most cases, it may not be sufficient to get a steady result from some frameworks.
  • As we mature the continuous benchmarking platform, we may also be able to do some smoothing.  For example, we could consider making a "Final" for the Round composed of the last N samples.  Combined with the above, however, this could make for some very long cycles.

I'm not certain of either of these ideas, but love hearing your thoughts.

Brian Hauer

unread,
Apr 26, 2017, 6:19:15 PM4/26/17
to framework-benchmarks
Hi Steve,

Off hand, I am not sure why light-java's results would be so varied between Preview 3 and Preview 4.  I assume nothing changed in your implementation, correct?

We want to wrap up Round 14 as soon as possible, so I would ask that additional optimizations be deferred at this point.  It is our intent to transition quickly into posting Round 15 previews.  We want to cut down the duration between rounds so that contributors feel less stress getting commits wrapped before our (more or less arbitrary) deadlines.  If there were more of such deadlines—more frequently—then we wouldn't be causing as much pain to people who miss an opportunity.  I don't want for the next round to be half a year away.

Gelin Luo

unread,
May 9, 2017, 8:02:13 PM5/9/17
to framework-benchmarks
Hi TechEmpower, 

Can we get a result of another round of preview? There are a few weeks passed by and we being a little bit quite recently ;-)

Cheers,
Green

Julia Nething

unread,
May 10, 2017, 11:47:09 AM5/10/17
to framework-benchmarks
Hi Green,
Sometimes it takes a bit of time to prep the data from the final round for the website, so we're working on that now. There will be no more previews for Round 14. But hopefully Round 15 will have quicker turnarounds because of continuous benchmarking :)

Samuel Colvin

unread,
May 10, 2017, 11:50:31 AM5/10/17
to framework-benchmarks
Thanks for the update, do we have an eta for the round 14 final results?

Anton Kirilov

unread,
May 13, 2017, 11:27:39 AM5/13/17
to framework-benchmarks
Hi Nikolche,

Sorry for the late reply; I must have missed it before.

The reason I don't like just taking the best result is that IMHO we are interested in steady-state behaviour, while the best result (without doing any other analysis) might represent transient behaviour. As for the average number being problematic - well, that's why it is a bad idea to report only an average, or to use an arithmetic mean for that matter.

Best wishes,
Tony
Reply all
Reply to author
Forward
0 new messages