The configuration parameters for the loop are:
[{ip, "0.0.0.0"}, {port, 8000}, {backlog, 256}, {max, 16384}, {nodelay, true}, {acceptor_pool_size, 256}]
with an older mochiweb this was the configuration:
[{ip, "0.0.0.0"}, {port, 8000}, {backlog, 200}, {max, 16384}]
The issue I'm seeing is that with the old mochiweb I was able to handle well over 3000 req/sec on a single server.
With the new one, although during a test it was able to handle even more, after I reach 1000/1200 req/sec I start getting timeouts.
I could think that it's in my code but even the simple ping page fails to answer and there's nothing going on in it, it immediately returns pong.
I don't see anything particular in the erlang logs that could make me think about an internal timeout between erlang processes due to my code, I do see a lot of cases where the client disconnected from the server and when I try to send a response it fails in gen_tcp:send().
with a log like this in sasl:
=CRASH REPORT==== 4-Nov-2011::00:00:03 ===
crasher:
initial call: mochiweb_acceptor:init/3
pid: <0.12016.168>
registered_name: []
exception exit: enotconn
in function mochiweb_request:send/2
in call from mochiweb_request:start_raw_response/2
in call from mochiweb_request:respond/2
in call from dyno_web:loop/1
in call from mochiweb_http:headers/5
ancestors: [dyno_web,dyno_sup,<0.66.0>]
messages: []
links: [<0.89.0>,#Port<0.4613065>]
dictionary: [{random_seed,{1320,1117,9871}},
{mochiweb_request_path,
"/view/5108159f8eebac4d22a074c1ca82f8b1"},
{mochiweb_request_cookie,
[{"foobar","42a98512d10520496684b53c0b29a938"}]}]
trap_exit: false
status: running
heap_size: 1597
stack_size: 24
reductions: 5122
neighbours:
This problem unfortunately seem to happen only in production so for now I've switched the load back to the older servers. Anyone can help me shed any light on this?
--
Valentino Volonghi
http://www.adroll.com
> --
> You received this message because you are subscribed to the Google Groups "MochiWeb" group.
> To post to this group, send email to moch...@googlegroups.com.
> To unsubscribe from this group, send email to mochiweb+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mochiweb?hl=en.
>
>
The "new" loop is at the bottom of the mail, the one running on the older mochiweb is a bit different and not practical to put it here, however it's essentially based on the same code, dispatch is simply using the erlang function dispatch (like I was doing in the older loop) and Path is a list of path segments.
I haven't tried with the default configuration options. I'm using the very latest version of erlang R14B04 and the platform is Ubuntu 11.10 on EC2.
loop(MochiReq) ->
Req = dyno:create(mochiweb, MochiReq),
Path = Req:path(),
case dynospatch:dispatch(Path, Req) of
no_dispatch_match ->
Req:not_found(),
stats:sample({unknown}, Req:start_time());
not_implemented ->
Req:respond({501, [{"Content-Type", "text/plain"}],
<<"Not yet implemented">>}),
stats:sample({unknown}, Req:start_time());
{M, F, A} ->
try erlang:apply(M, F, A) of
{ok, Logged} ->
stats:sample({M, F, Logged}, Req:start_time());
_ ->
stats:sample({M, F, nil}, Req:start_time())
catch
exit:closed ->
Microsecs = timer:now_diff(erlang:now(), Req:start_time()),
error_logger:error_report(
"web request closed too early after: ~p usecs", [Microsecs]);
Type:What ->
Report = ["web request failed",
{path, Path},
{type, Type},
{what, What},
{req, Req},
{trace, erlang:get_stacktrace()}],
error_logger:error_report(Report),
Req:respond({500, [{"Content-Type", "text/plain"}],
"request failed, sorry\n"}),
stats:sample({M, F, error}, Req:start_time())
end;
_What ->
Req:respond({501, [{"Content-Type", "text/plain"}],
<<"Not yet implemented">>}),
stats:sample({unknown}, Req:start_time())
end.
Also if useful I'd add that the parameters I use to start the VM are:
-pa ebin deps/*/ebin +K true +A 64 +W w -smp enable -env ERL_MAX_PORTS 65536
-boot start_sasl -sname dyno_dev -s inets -s timer -s dyno
and
{kernel, [{error_logger, {file, "/mnt/ebs/adrollserver/logs/adserver-kernel.log"}},
{inet_default_listen_options, [{nodelay,true},{sndbuf,16384},{recbuf,4096}]},
{inet_default_connect_options, [{nodelay,true}]}]}
and these are sysctl for linux:
kernel.printk = 4 4 1 7
fs.inotify.max_user_watches = 524288
net.ipv4.conf.default.rp_filter=1
net.ipv4.conf.all.rp_filter=1
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_max_tw_buckets = 360000
net.ipv4.tcp_max_syn_backlog = 262144
net.core.netdev_max_backlog = 30000
vm.min_free_kbytes = 65536
vm.swappiness = 0
net.ipv4.ip_local_port_range = 1024 65535
net.core.somaxconn = 262144
fs.file-max = 173273
With the ancient mochiweb, are all settings and the environment
exactly the same?
> The first thing I would try is to see what happens if you don't tweak
> nodelay and the buffer sizes. We're not seeing anything like what you
> are, although our setup is a little different (CentOS, R14B01, 64-bit,
> not virtualized).
>
> With the ancient mochiweb, are all settings and the environment
> exactly the same?
environment is ubuntu 10.10 and behaved smoothly since 8.04. The rest of the parameters to the erlang vm are the same.
I'm gonna try using the defaults and probably getting rid of one module of the server that could be responsible for extra load.
Then I could also try using the older mochiweb.
> I think using {acceptor_pool_size, 256} is probably madness. The
> default of 16 is almost certainly more than enough. It doesn't control
> how many concurrent requests you can handle, just how many processes
> are blocking on gen_tcp:accept/2. The old version of mochiweb is more
> similar to having an acceptor_pool_size of 1 or maybe 2. It's a bit
> harder to compare because the new mochiweb uses the acceptor process
> to handle the request rather than starting a new process as soon as
> the accept happens.
Fair enough, I'll try to test under production dark load or something like that.
I'm also gonna remove the fixed buffer sizes and let linux determine that.
> Fair enough, I'll try to test under production dark load or something like that.
>
> I'm also gonna remove the fixed buffer sizes and let linux determine that.
Ok… I've done a few more tests and found out something interesting…
Behind nginx 1.0.5 the system is a lot slower (half the speed), however without nginx in front, even though the system is faster, after a few rounds of testing I start getting timeouts and connection resets… These are of course symptoms of too many open sockets and so on. Although it would be interesting to understand if this is a problem that doesn't affect nginx because it's too slow or because it does something else.
I have all sorts of options set to avoid the exhaust of sockets and ports such as:
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_max_tw_buckets = 360000
net.ipv4.tcp_max_syn_backlog = 262144
net.core.netdev_max_backlog = 30000
net.ipv4.ip_local_port_range = 1024 65535
net.core.somaxconn = 262144
fs.file-max = 173273
this in limits.conf:
* hard nofile 999999
* hard memlock unlimited
I call ulimit -n before starting the server process and nginx is setup with
worker_rlimit_nofile 20000;
events {
worker_connections 26000;
use epoll;
}
On one side nginx is definitely slowing down the system but allows the benchmark to finish, on the other side if I hit mochiweb directly the benchmark doesn't finish and starts timing out down the line.
For example the number of connections in TIME_WAIT is wildly different, under mochiweb it goes up to 11000 while under nginx it never really crosses 5000.
However it's not really possible to lower the TIME_WAIT even more than 10 seconds.
I'm gonna try and test a different a nginx version and an older linux version or just a different one.
Here are some of the numbers:
Behind nginx:
Server Software: nginx/1.0.5
Server Hostname: foobar.amazonaws.com
Server Port: 80
Document Path: /perf
Document Length: 35 bytes
Concurrency Level: 1000
Time taken for tests: 59.983 seconds
Complete requests: 100000
Failed requests: 86
(Connect: 0, Receive: 0, Length: 86, Exceptions: 0)
Write errors: 0
Non-2xx responses: 86
Total transferred: 45640768 bytes
HTML transferred: 4061752 bytes
Requests per second: 1667.14 [#/sec] (mean)
Time per request: 599.829 [ms] (mean)
Time per request: 0.600 [ms] (mean, across all concurrent requests)
Transfer rate: 743.06 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 2 13 119.4 2 3035
Processing: 3 547 1296.4 166 15026
Waiting: 3 547 1296.3 166 15026
Total: 5 561 1306.8 173 18058
Percentage of the requests served within a certain time (ms)
50% 173
66% 190
75% 210
80% 229
90% 3041
95% 3188
98% 3479
99% 5420
100% 18058 (longest request)
Directly hitting mochiweb:
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking ec2-107-22-69-62.compute-1.amazonaws.com (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests
Server Software: MochiWeb/1.0
Server Hostname: ec2-107-22-69-62.compute-1.amazonaws.com
Server Port: 8000
Document Path: /perf
Document Length: 35 bytes
Concurrency Level: 500
Time taken for tests: 33.848 seconds
Complete requests: 100000
Failed requests: 0
Write errors: 0
Total transferred: 46600000 bytes
HTML transferred: 3500000 bytes
Requests per second: 2954.42 [#/sec] (mean)
Time per request: 169.238 [ms] (mean)
Time per request: 0.338 [ms] (mean, across all concurrent requests)
Transfer rate: 1344.49 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 2 73 488.7 2 9085
Processing: 3 92 58.2 72 1770
Waiting: 2 92 58.1 71 1769
Total: 4 166 493.7 82 9345
Percentage of the requests served within a certain time (ms)
50% 82
66% 123
75% 128
80% 132
90% 147
95% 190
98% 3023
99% 3109
100% 9345 (longest request)
> Is your benchmark using HTTP Keep-Alive? Nginx does not support it for
> talking to backends
It's ab but without the -k option so Keep-Alive is not active.
I've tested a self-compiled version of nginx which didn't seem to show the slowness and managed to not slow down mochiweb 3000+ req/sec
Maybe Ubuntu 11.10 ships with a version of nginx compiled with too many plugins that slow down the server itself or maybe it's compiled with very conservative settings… Unfortunately while I was about to test this more deeply I got side tracked.
I still get a high 100% response time (timeout) basically in the last request, not sure what it could be but if it reaches at least 3000 req/sec on a large instance then it's good enough, it's at least as fast as the current system.
On 4 Nov 2011, at 20:05, Valentino Volonghi wrote:
> On 4 Nov 2011, at 20:03, Bob Ippolito wrote:
>
>> Is your benchmark using HTTP Keep-Alive? Nginx does not support it for
>> talking to backends
--
Valentino Volonghi
http://www.adroll.com
Digging a bit deeper I've tried to run mochiweb and my app syntetically with this function:
bench(Url, Reps) ->
MochiReq = mochiweb_request:new(nil, 'GET', Url, {1, 1},
mochiweb_headers:make([{"Accept", "text/html"}])),
dru:benchcall(fun() -> dyno_web:loop(MochiReq) end, Reps).
I've also modified mochiweb a little bit to avoid sending on a nil socket or getpeer on a nil socket (The only 2 changes).
Without any other modification I could get 10000 calls per second on a very simple url.
Then I've run the function above under fprof:apply() with 500 reps.
In the results most of the functionality took around 100 ms cumulative within the profiler, except for one function:
string:to_lower/0 called by mochiweb_headers:normalize/1
Of course I don't know if this is related to the performance differences I'm seeing (it could be that in my new code I'm stressing this area more by fetching more headers and even though the code is not different it's affected) however this change alone affected about 60% of the performance.
Changing mochiweb_headers:normalize/1 to return just K instead of string:to_lower(K) I could run 16000 calls per second.
Another guilty part is lists:flatten although I doubt anything can be done about this.
This is the full result of the profile call.
https://gist.github.com/1366277
It's sad to see that what is 10000 calls/sec goes down to less than half that once sockets and such come into play :).
> Header parsing code is a "bottleneck" in these kinds of microbenchmarks but that part of mochiweb has not changed at all so can't possibly explain any of your problem.
That's true and I still think my main bottleneck is between nginx and erlang, although I can't quite figure out why nginx goes from handling 20000 req/sec of the simple transparent gif that it serves to handle 1800 req/sec in front of a server that can handle 3100 req/sec by itself. It uses a lot of CPU to handle those 1800, the same it uses for the transparent gif (or even more).
However my older code did GeoIP within erlang while the new code lets nginx do GeoIP and pass everything as headers. This means that for each request my new code is doing 11 more header lookups for each request compared to just 3 or 4 total before, since this is a bottleneck I could speed up my code a little bit by changing this part.
I'll try some more tests tomorrow.
It is an HTTP/1.0 proxy without the ability for keep-alive requests
yet. (As a result, backend connections are created and destroyed on
every request.) Nginx talks HTTP/1.1 to the browser and HTTP/1.0 to
the backend server. As such it handles keep-alive to the browser.
Ngoc
> I think it is because Nginx disconnect and reconnect all the time with
> the backend server.
> From Nginx doc (http://wiki.nginx.org/HttpProxyModule):
>
> It is an HTTP/1.0 proxy without the ability for keep-alive requests
> yet. (As a result, backend connections are created and destroyed on
> every request.) Nginx talks HTTP/1.1 to the browser and HTTP/1.0 to
> the backend server. As such it handles keep-alive to the browser.
I've run more performance tests between different versions of:
Ubuntu 11.04 and 11.10
Nginx 0.8.54 and 1.0.5
Erlang VM R14B03 and R14B04 (compiled both with same options)
This mixed in with my new code and my old code, just using ab, not a great tool to know actual performance but good to value differences.
I've also tried connecting directly to the mochiweb server and through nginx separately. I haven't run all of them yet but I have some numbers at the bottom.
With those numbers I don't think it's possible for nginx to be responsible for the slowdown, if at all considering that the new nginx has slightly smaller impact on performance when used even though it also does GeoIP. So I'd say nginx got slightly faster.
At this point it's either erlang that got slower between R14B03 and R14B04 or Ubuntu 11.10 that got slower than 11.04. The new mochiweb and my new code are considerably faster than my old code with old mochiweb even though the new code does more things by default, such as generating cookies, fetching more headers and so on. I'm gonna first try with a new erlang and if that proves still slow then I'll either stick with the old (but soon unsupported version of Ubuntu or change distribution entirely).
Given that Ubuntu 11.10 comes with Linux 3.0 which has several big changes among which Xen support, new process scheduler, big changes in memory management that maybe cause these slow downs.
All ran 100000 requests at 500 concurrency.
Erlang R14B03:
Ubuntu 11.04 - New Code - Direct Connection:
- Requests per second: 4271.01 [#/sec] (mean)
- Latency: 95% 112, 98% 262ms, 99% 3059, 100% 9286
Ubuntu 11.04 - New Code - Through Nginx 0.8.54:
- Requests per second: 2460.42 [#/sec] (mean)
- Latency: 95% 212, 98% 3091, 99% 3135, 100% 9214
Ubuntu 11.04 - Old Code - Direct Connection:
[Required one retry to finish due to timeouts]
- Requests per second: 3318.86 [#/sec] (mean)
- LatencyL 95% 133, 98% 1063, 99% 3077, 100% 17182
Ubuntu 11.04 - Old Code - Through Nginx:
- Requests per second: 1897.05 [#/sec] (mean)
- Timeouts: 21
- Latency: 95% 191, 98% 3118, 99% 3163, 100% 15006
Erlang R14B04:
Ubuntu 11.10 - New Code - Direct Connection:
- Requests per second: 3306.48 [#/sec] (mean)
- Latency: 95% 158, 98% 404, 99% 3086, 100% 9875
Ubuntu 11.10 - New Code - Through Nginx 1.0.5:
- Requests per second: 1977.81 [#/sec] (mean)
- Latency: 95% 244, 98% 3100, 99% 3168, 100% 9317
Ubuntu 11.10 - Old Code - Direct Connection:
[Required several retries to finish due to timeouts]
- Requests per second: 2966.25 [#/sec] (mean)
- Latency: 95% 123, 98% 783, 99% 3058, 100% 23976
Ubuntu 11.10 - Old Code - Through Nginx 1.0.5:
- Requests per second: 1880.54 [#/sec] (mean)
- Timeouts: 34
- Latency: 95% 215, 98% 3100, 99% 3288, 100% 15870
> --
> You received this message because you are subscribed to the Google Groups "MochiWeb" group.
> To post to this group, send email to moch...@googlegroups.com.
> To unsubscribe from this group, send email to mochiweb+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mochiweb?hl=en.
>
>
--
BR,
\|/ Kunthar
One thing that I re-checked is the machine that is clearly faster than the others was using a different CPU.
Re-running tests with a new instance and same CPU showed roughly the same results as the others, slightly better than old code overall, not 30% more, more like a few percentage points only and this sent me back to square one or rather -1 since now it's becoming a general performance issue...
> - Did you check file descriptor size and soft limits of processes by
> any chance?-
Yes.
New instances of Ubuntu have:
fs.file-max = 209708
net.ipv4.netfilter.ip_conntrack_max = 300000
net.ipv4.ip_local_port_range = 1025 65535
net.core.somaxconn = 32768
net.core.netdev_max_backlog = 262144
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_fin_timeout = 10
while older instances have a similar:
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_max_tw_buckets = 360000
net.core.netdev_max_backlog = 30000
net.ipv4.ip_local_port_range = 1025 65535
net.core.somaxconn = 262144
fs.file-max = 173273
nginx is configured with
worker_processes 3;
worker_rlimit_nofile 100000;
events {
worker_connections 100000;
use epoll;
}
limits.conf contains:
* - nofile 999999
* - memlock unlimited
and the erlang VM is started with:
+K true +A 64 +W w -smp enable -env ERL_MAX_PORTS 65536
through a script that does:
#!/bin/sh
ulimit -n 1000000
COMMAND="erl -args_file vm.args $@"
exec ${COMMAND}
> - Also is this possible to dump tcp traffic with tcpdump to see latency values?
I guess I can give this a try.
> - Do you use alogger/lager etc. type disk based loggers?
nginx has logging disabled (except for timeouts, exceptions, error responses, and so on, if everything goes well it doesn't log anything).
erlang for the url tested doesn't log anything at all, otherwise for actual production code it does have disk based logging, this is practically the same code as the old server and this is the code path for a logging call:
internal_write(State=#log{handle=H}, Data) ->
ok = file:write(H, Data),
State.
where H is opened with:
{ok, H} = file:open(Fullpath, [read, write, binary, delayed_write]),
Data is usually around 600 bytes per call and by default delayed_write waits for 64K before flushing.
I've thought of other potential issues like wrong balancing across the cores but I've started a new instance with 8 CPUs and the load spreads evenly across all of them and can handle more data than apparently its network lets through, running even 2 ab locally on this machines yields more than 12000 req/sec (and 2 cores are busy with ab). Non locally instead I get high latency on the 100% and 8000 req/sec or so max even with 5 other instances stressing the machine.
On one physical Intel(R) Core(TM)2 CPU 6600 @2.40GHz that I have here with 4GB of RAM running Ubuntu 10.10 and erlang R14B03 I get the new code to do about 6000 req/sec with 99% latency at 14 and 100% latency at 3010.
If that's the case then I'm simply reaching the limit of the underlying network capacity for the instance and I need to increase the number of instances to fix all these issues.