NGINX/NXWEB/Golang Micro Benchmark

Alex Koch

unread,

Sep 28, 2013, 10:10:58 AM9/28/13

to nx...@googlegroups.com

Hi,

As the subject says, this is a micro benchmark I run ("hello world type").

Load Runner and Server-Target Specs:

Intel(R) Xeon(R) CPU E7- 8837 @ 2.67GHz
16 cores | 256GB RAM
1Gbit interface (both connected to the same switch backend)

Load Runner bench tool 'wrk' (https://github.com/wg/wrk)

./wrk -t16 -c400 -d30s http://server-target

CASE1: NGINX-1.4.2

default compile

workers = 16
access_log off
Handler '/' => html/index.html

/ngx/sbin/nginx -V
nginx version: nginx/1.4.2
built by gcc 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC)
configure arguments: --prefix=/root/ngx/

Results:

Running 30s test
16 threads and 400 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.38ms    4.45ms 18.57ms   88.26%
    Req/Sec     7.98k     1.51k   14.00k    68.89%
3622643 requests in 29.99s, 2.86GB read
Requests/sec: 120788.29
Transfer/sec:     97.79MB

RES MEM ~ 22MB

CASE2: NGINX-1.4.2

disabled charset, ssi, userid, autoindex, rewrite, limit_req, limit_conn
workers = 16
access_log off
Handler '/' => html/index.html

/ngx/sbin/nginx -V
nginx version: nginx/1.4.2
built by gcc 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC)
configure arguments: --prefix=/root/ngx --without-http_charset_module --without-http_ssi_module --without-http_userid_module --without-http_autoindex_module --without-http_rewrite_module --without-http_limit_req_module --without-http_limit_conn_mod

Results:

Running 30s test
16 threads and 400 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.38ms    3.92ms 18.32ms   85.78%
    Req/Sec     7.96k     1.38k   15.00k    68.95%
3622161 requests in 29.99s, 2.86GB read
Requests/sec: 120784.78
Transfer/sec:     97.79MB

RES MEM ~ 23KB

CASE3: NXWEB-3.2.0-dev

Results:

Running 30s test
16 threads and 400 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.37ms 174.19us 13.86ms   97.89%
    Req/Sec    11.09k   580.27    13.00k    58.13%
5039484 requests in 29.99s, 3.07GB read
Requests/sec: 168026.84
Transfer/sec:    104.80MB

RES MEM ~ 375MB

CASE4: Golang (http://golang.org/)

Code:

package main

import (
    "flag"
    "net/http"
    "runtime"
)

var port = flag.String("port", "80", "Define what TCP port to bind to")
var root = flag.String("root", "/tmp/html", "Define the root filesystem path")

func main() {
    runtime.GOMAXPROCS(runtime.NumCPU())
    flag.Parse()
    panic(http.ListenAndServe(":"+*port, http.FileServer(http.Dir(*root))))
}

Results:
Running 30s test
16 threads and 400 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.33ms 817.59us 12.25ms   77.34%
    Req/Sec     7.92k   677.44    10.11k    63.18%
3598502 requests in 29.99s, 2.67GB read
Requests/sec: 119981.11
Transfer/sec:     91.20MB

RES MEM ~ 37KB

---

NXWEB is a clear winner over NGINX in this 'micro-benchmark', even though the golang webserver gives it a run for its money. I have the following question to NXWEB DEV, why 375MB of resident memory? The config is attached, have I miss-configured anything?

Furthermore, what is the baseline difference between NGINX event loop to NXWEB?

Thanks,

Alex

nxweb_config.json

Alex Koch

unread,

Sep 28, 2013, 10:13:57 AM9/28/13

to nx...@googlegroups.com

Small correction:

CASE1: RES MEM is ~ 22KB not 22MB

From: alex.k...@outlook.com
To: nx...@googlegroups.com
Subject: {nxweb} NGINX/NXWEB/Golang Micro Benchmark
Date: Sat, 28 Sep 2013 16:10:58 +0200

--
You received this message because you are subscribed to the Google Groups "nxweb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nxweb+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yaroslav

unread,

Sep 28, 2013, 11:36:18 AM9/28/13

to nx...@googlegroups.com

Hi Alex,

Are you running sample hello handler from nxweb distribution package? Or serving static html file? Or python script?

Are you using keep-alive mode?

I have just run the following tests with httpress on my machine (Intel Core i7-2600 CPU @ 3.40GHz 4 cores with HT):

1) benchmark-inprocess - is the fastest hello-style handler in nxweb:

% httpress -n5000000 -c400 -t8 -k http://localhost:8055/benchmark-inprocess -q

TOTALS: 400 connect, 5000000 requests, 4999876 success, 124 fail, 400 (400) real concurrency

TRAFFIC: 20 avg bytes, 150 avg overhead, 99997520 bytes, 749981400 overhead

TIMING: 9.664 seconds, 517344 rps, 85887 kbps, 0.8 ms avg req time

RES MEM 18m (as displayed by top utility during the test)

2) serving index.htm from www/root directory:

% httpress -n5000000 -c400 -t8 -k http://localhost:8055/ -q

TOTALS: 400 connect, 5000000 requests, 5000000 success, 0 fail, 400 (400) real concurrency

TRAFFIC: 15 avg bytes, 211 avg overhead, 75000000 bytes, 1055000000 overhead

TIMING: 10.595 seconds, 471893 rps, 104148 kbps, 0.8 ms avg req time

RES MEM 18m

3) /hello - this handler is not the simplest, it prints all request parameters using printf-style function with HTML-escaping:

% httpress -n5000000 -c400 -t8 -k http://localhost:8055/hello -q

TOTALS: 400 connect, 5000000 requests, 5000000 success, 0 fail, 400 (400) real concurrency

TRAFFIC: 511 avg bytes, 151 avg overhead, 2555000000 bytes, 755000000 overhead

TIMING: 12.466 seconds, 401069 rps, 259285 kbps, 1.0 ms avg req time

RES MEM 17m

4) /py - this is python handler:

% httpress -n50000 -c400 -t8 -k http://localhost:8055/py -q

TOTALS: 400 connect, 50000 requests, 49979 success, 21 fail, 400 (400) real concurrency

TRAFFIC: 682 avg bytes, 165 avg overhead, 34110567 bytes, 8246583 overhead

TIMING: 19.550 seconds, 2556 rps, 2115 kbps, 156.5 ms avg req time

RES MEM 28m

5) I have installed golang 1.0.2 and used the code you provided (changed only port number and file path) to test its performance (used the same index.htm as in test #2 above):

% httpress -n1000000 -c400 -t8 -k http://localhost:8083/index.htm -q

TOTALS: 400 connect, 1000000 requests, 1000000 success, 0 fail, 400 (400) real concurrency

TRAFFIC: 15 avg bytes, 184 avg overhead, 15000000 bytes, 184000000 overhead

TIMING: 20.730 seconds, 48238 rps, 9374 kbps, 8.3 ms avg req time

RES MEM 18m

I am not familiar with Go language so I simply used the following command to run it:

go run server.go

My results are quite different from yours. Golang is much faster than Python but still is quite far from pure C. Are you sure that performance of your benchmarking tool (wrk) is not the limiting factor? Try httpress: https://bitbucket.org/yarosla/httpress/wiki/Home

what is the baseline difference between NGINX event loop to NXWEB?

There is no fundamental difference. NGINX is versatile cross-platform server, while NXWEB is lightweight with limited functionality. NGINX is multi-processed, NXWEB is multi-threaded.

Yaroslav

Alex Koch

unread,

Sep 28, 2013, 12:49:31 PM9/28/13

to nx...@googlegroups.com

Hi Yaroslav,

I was initially running the 'hello' module, but I repeated the test now using httpress and serving in the index.html file in NXWEB.

1)nxweb

RES 418M

./httpress -n5000000 -c400 -t8 -k http://test1 -q

TOTALS: 400 connect, 5000000 requests, 5000000 success, 0 fail, 400 (400) real concurrency

TRAFFIC: 0 avg bytes, 154 avg overhead, 0 bytes, 770000000 overhead
TIMING: 9.776 seconds, 511439 rps, 76915 kbps, 0.8 ms avg req tim

2) golang

RES (same as before)

./httpress -n5000000 -c400 -t8 -k http://test1 -q

TOTALS: 400 connect, 5000000 requests, 5000000 success, 0 fail, 400 (400) real concurrency

TRAFFIC: 612 avg bytes, 185 avg overhead, 3060000000 bytes, 925000000 overhead
TIMING: 42.310 seconds, 118173 rps, 91976 kbps, 3.4 ms avg req time

3) nginx

RES (same as before)

./httpress -n5000000 -c400 -t8 -k http://edgitha -q
TOTALS: 50204 connect, 5000000 requests, 5000000 success, 0 fail, 400 (400) real concurrency
TRAFFIC: 612 avg bytes, 236 avg overhead, 3060000000 bytes, 1184750960 overhead
TIMING: 41.555 seconds, 120322 rps, 99753 kbps, 3.3 ms avg req time

I have attached a pmap output of the nxweb process - I noticed just on start, it allocates the ~ 415MB as RES - and the connections overhead when running the benchmark are little.

Each anon page allocation is 10M - I would be supprised if nxweb is making all these allocations but I have not debugged it - any ideas? the system nxweb was compiled on is running Centos6.4 (Linux 2.6.32-279.22.1.el6.x86_64 x86_64).

Thanks,

Ali

Date: Sat, 28 Sep 2013 19:36:18 +0400
Subject: Re: {nxweb} NGINX/NXWEB/Golang Micro Benchmark
From: yar...@gmail.com
To: nx...@googlegroups.com

nxweb-pmap

Yaroslav

unread,

Sep 28, 2013, 1:11:45 PM9/28/13

to nx...@googlegroups.com

Could you please email nxweb's error_log.

Alex Koch

unread,

Sep 28, 2013, 1:39:26 PM9/28/13

to nx...@googlegroups.com

the error log only contains the following

2013-09-28 19:35:40 1 [50136:0x7f989d787700]: === LOG OPENED ===
2013-09-28 19:35:40 1 [50137:0x7f989d787700]: nxweb binding :80 for http
2013-09-28 19:35:40 1 [50137:0x7f989d787700]: can't find filter 'image' specified for routing record #1, filter record #2
2013-09-28 19:35:40 1 [50137:0x7f989d787700]: NXWEB startup: pid=50137 net_threads=16 pg=4096 short=2 int=4 long=8 size_t=8 evt=56 conn=2064 req=360 td=2584 max_fd=64000 max_core=0
2013-09-28 19:35:40 1 [50137:0x7f989d787700]: handler hello [1000] registered for url: /h
2013-09-28 19:35:40 1 [50137:0x7f989d787700]: handler sendfile [2000] registered for url: /
2013-09-28 19:35:40 1 [50137:0x7f989d787700]: module cache [0] successfully initialized
2013-09-28 19:35:40 1 [50137:0x7f989d787700]: using default request dispatcher

Date: Sat, 28 Sep 2013 21:11:45 +0400

Yaroslav

unread,

Sep 28, 2013, 2:16:06 PM9/28/13

to nx...@googlegroups.com

When I run nxweb under valgrind (simply start and stop) it reports:

total heap usage: 20,825 allocs, 16,897 frees, 9,873,589 bytes allocated

Just about 10M altogether. That's for nxweb compiled with ImageMagick, GNUTLS and Python. I think these contribute significantly to above numbers.

The problem seems to be related to the number of threads and to how OS allocates memory to them. I am attaching my nxweb's memory map. I am on Ubuntu 12.10.

As nxweb starts it launches network threads. One for each CPU core. Each network thread launches 16 worker threads by default. On your server there will be 256 threads right after start. And you have 271 anon blocks of size 1024Kb. These look like stack frames for the threads.

I have 8 network threads, meaning I have 128 threads total. And there are 136 anon blocks of size 8192 (this is exactly what my ulimit -s reports for stack size) in my pmap. The difference is in the amount of this allocation that goes to RES memory. On Ubuntu most thread stacks contribute only 8K to RES. While on CentOS they tend to contribute as much as 2048K per stack.

In order to reduce memory usage you can decrease the number of network threads (NXWEB_MAX_NET_THREADS in nxweb_config.h), as well as the number of worker threads per network thread (NXWEB_START_WORKERS_IN_QUEUE in nx_workers.h). You might not need worker threads at all but I am not sure if you can harmlessly reduce them to zero. Needs checking.

pmap-x-18952.txt

Yaroslav

unread,

Sep 28, 2013, 3:56:48 PM9/28/13

to nx...@googlegroups.com

I have just committed a change of NXWEB_START_WORKERS_IN_QUEUE reduced to zero. There will be no worker threads until server starts using them. There is no need in worker threads in many cases. In the list of provided modules only Python uses them.

Alex Koch

unread,

Sep 29, 2013, 5:54:10 AM9/29/13

to nx...@googlegroups.com

*sweet* - on the topic of worker threads, for filesystem I/O which do not use "sendfile", do you recommend using worker threads or simply doing short buffered reads/write do not block the network thread?

Also if doing blocking I/O, is it then better to start more network threads and do the blocking I/O in the network thread or few network threads and more worker threads? At the end it should not be much different right? I am just curious what makes more sense from an architecture perspective. For example nodejs handles all network I/O in one thread and dispatch blocking I/O to the pool of threads.

(1 network thread) => (worker threads)

while NXWEB AFAIK, is (multiple network threads) => (worker threads).

Date: Sat, 28 Sep 2013 23:56:48 +0400

Yaroslav

unread,

Sep 29, 2013, 11:48:16 AM9/29/13

to nx...@googlegroups.com

On Sun, Sep 29, 2013 at 1:54 PM, Alex Koch <alex.k...@outlook.com> wrote:

*sweet* - on the topic of worker threads, for filesystem I/O which do not use "sendfile", do you recommend using worker threads or simply doing short buffered reads/write do not block the network thread?

There are situations when sendfile cannot be used. It is when the content has to be processed before it is sent out. Eg., SSL-encrypted, gzipped, or parsed as template. Nxweb has the means of dealing with these situations: read into memory for small files, mmap for larger ones. It always does it on a network thread. For regular disk filesystems this should not be a problem.

Using worker threads has its costs (mutexing, signalling, etc.). There is also currently a limitation for workers in nxweb: they cannot stream data, a worker must complete its job before nxweb starts sending response to client.

Also if doing blocking I/O, is it then better to start more network threads and do the blocking I/O in the network thread or few network threads and more worker threads? At the end it should not be much different right?

For asynchronous event-driven architectures doing blocking IO on network thread is prohibited. Always use worker threads for that. Worker threads can be created dynamically, while number of network threads is fixed. While network thread is blocked the server does not work. You need to create more network threads then, which leads us to classical threaded server model. In case your server only performs blocking tasks that model might be more appropriate, no need for async mechanics.

I am just curious what makes more sense from an architecture perspective. For example nodejs handles all network I/O in one thread and dispatch blocking I/O to the pool of threads.

(1 network thread) => (worker threads)

while NXWEB AFAIK, is (multiple network threads) => (worker threads).

Nxweb uses several network threads to keep all CPU cores busy, i.e. improve performance.

Javascript on the other hand is single-threaded by design. It simply cannot use more than one network thread.

Alex Koch

unread,

Oct 1, 2013, 2:47:34 PM10/1/13

to nx...@googlegroups.com

| Nxweb has the means of dealing with these situations: read into memory for small files, mmap for larger ones

Even if the small files are on NFS and not local disk?

Date: Sun, 29 Sep 2013 19:48:16 +0400

Yaroslav

unread,

Oct 4, 2013, 3:31:40 AM10/4/13

to nx...@googlegroups.com

I have not tried it with NFS. Not sure about the best method for it.

Reply all

Reply to author

Forward