[erlang-questions] is anyone else experiencing reliability issues with R15?

68 views
Skip to first unread message

Rapsey

unread,
Sep 19, 2012, 4:11:51 PM9/19/12
to erlang-q...@erlang.org
We run a network of custom built streaming servers doing video streaming and transcoding of IPTV channels. 
On R14 everything runs great. But switching to R15, gen_servers inexplicably block and don't respond to messages, even the console blocks and does not respond to input for 30s or so, processes baloon taking up large amounts of memory for no reason. All at random times, but gets much worse once there are more users connected to the server doing a lot of req/s or receiving a lot of data.
We're running ubuntu server and start erlang with these switches:
erl +Bd +S 4 +P 1000000 -env ERL_MAX_PORTS 100000 +K true +A 32

Are we alone having problems with R15? We tried R15B01 and R15B02.



Sergej

Zabrane Mickael

unread,
Sep 19, 2012, 4:18:39 PM9/19/12
to Rapsey, erlang-q...@erlang.org
Hi Rapsey,

I can confirm lower performances using R15B01/R15B02 compared to R14, but no reliability issue.

Regards,
Zabrane
> _______________________________________________
> erlang-questions mailing list
> erlang-q...@erlang.org
> http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Geoff Cant

unread,
Sep 19, 2012, 6:40:23 PM9/19/12
to Rapsey, erlang-q...@erlang.org
Gah, I'm just in the middle of upgrading production from R14B3 to R15B02.

Do you have process_info(Pid, backtrace) or similar info for these blocked processes?


One of the changes from R14 to R15 that I hazily recall was the 'don't scan the message queue if you're waiting on a reply with a ref that was just generated' thing that avoids examining long message queues.

> • receive statements that can only read out a newly created reference are now specially optimized so that it will execute in constant time regardless of the number of messages in the receive queue for the process. That optimization will benefit calls togen_server:call(). (See gen:do_call/4 for an example of a receive statement that will be optimized.)
>
> Own Id: OTP-8623
>

-Geoff

Fred Hebert

unread,
Sep 19, 2012, 9:02:04 PM9/19/12
to Rapsey, erlang-q...@erlang.org
We've had problems in R15B01 with particular statistics functions related to schedulers, as described in http://erlang.org/pipermail/erlang-bugs/2012-July/002964.html

To date there is no solution and we just stopped using these functions, going back to run queues.

We also have seen a non-negligible increase in CPU usage from R14B* versions, easily around 20% or so during regular workload, although it didn't seem to affect heavy overload situations too negatively for us (no precise measurements were made for this, just casual observations). It remained high no matter what arguments we gave to the VM.

We have noticed nodes getting locked-up in R15B01 from time to time when memory on the server is getting rare, taken by other applications -- it seemed we had a lot of contention on proc_tab mutexes, but nothing came out of it. We eventually reduced memory usage in other applications and things have been rather stable since then.

Other than that, everything appeared normal, and none of the blocking incidents could be directly attributed to issues you appear to have. We haven't seen memory ballooning except in occasional error logger cases, but most of our processes are extremely short-lived (well under <150ms).

On 12-09-19 4:11 PM, Rapsey wrote:
We run a network of custom built streaming servers doing video streaming and transcoding of IPTV channels.�
On R14 everything runs great. But switching to R15, gen_servers inexplicably block and don't respond to messages, even the console blocks and does not respond to input for 30s or so, processes baloon taking up large amounts of memory for no reason. All at random times, but gets much worse once there are more users connected to the server doing a lot of req/s or receiving a lot of data.
We're running ubuntu server and start erlang with these switches:
erl +Bd +S 4 +P 1000000 -env ERL_MAX_PORTS 100000 +K true +A 32

Are we alone having problems with R15? We tried R15B01 and R15B02.



Sergej


Ali Sabil

unread,
Sep 20, 2012, 2:43:31 AM9/20/12
to Fred Hebert, erlang-q...@erlang.org
We have experienced similar issues with R15B01 where the I/O will get
completely blocked, but we haven't really been able to track it down,
the suspect we had was the usage of sendfile.

On Thu, Sep 20, 2012 at 3:02 AM, Fred Hebert <mono...@ferd.ca> wrote:
> We've had problems in R15B01 with particular statistics functions related to
> schedulers, as described in
> http://erlang.org/pipermail/erlang-bugs/2012-July/002964.html
>
> To date there is no solution and we just stopped using these functions,
> going back to run queues.
>
> We also have seen a non-negligible increase in CPU usage from R14B*
> versions, easily around 20% or so during regular workload, although it
> didn't seem to affect heavy overload situations too negatively for us (no
> precise measurements were made for this, just casual observations). It
> remained high no matter what arguments we gave to the VM.
>
> We have noticed nodes getting locked-up in R15B01 from time to time when
> memory on the server is getting rare, taken by other applications -- it
> seemed we had a lot of contention on proc_tab mutexes, but nothing came out
> of it. We eventually reduced memory usage in other applications and things
> have been rather stable since then.
>
> Other than that, everything appeared normal, and none of the blocking
> incidents could be directly attributed to issues you appear to have. We
> haven't seen memory ballooning except in occasional error logger cases, but
> most of our processes are extremely short-lived (well under <150ms).
>
> On 12-09-19 4:11 PM, Rapsey wrote:
>
> We run a network of custom built streaming servers doing video streaming and
> transcoding of IPTV channels.

Lukas Larsson

unread,
Sep 20, 2012, 6:34:06 AM9/20/12
to Ali Sabil, erlang-q...@erlang.org
Hello,

If you ever suspect that the Erlang VM is blocking for some reason,
first make sure that it is not something in Erlang space which is
wrong. i.e. a process waiting for a message which never arrives or
something like that.

When you are sure it is the VM that is the problem the most
informative thing to do (IMO) is to either attach with gdb to that
process or dump a core using kill -ABRT.

Once you have gdb attached do info threads or a core, do:

(gdb) info threads

and then for each thread do:

(gdb) thread ${ThreadId}
(gdb) bt

This will give you a bunch of information about what the emulator is doing.

There are also a couple of tools which can help you debug specific
things within the emulator. For instance if you do

(gdb) source $ERL_TOP/erts/etc/unix/etp-commands
(gdb) etp-help

you get a list of helpfull command which can print all sorts of
interesting data. One example is etp-stacktrace, which given a Process
* will print the stacktrace of that process. eg:

(gdb) bt
#0 0x00007ffff6aa19a8 in __GI___poll (fds=0x7ffff67bac08, nfds=2,
timeout=<optimised out>) at ../sysdeps/unix/sysv/linux/poll.c:83
#1 0x000000000062c9b0 in check_fd_events (tv=0x7fffffffbf00,
ps=0x7ffff67ba100, max_res=<optimised out>) at
sys/common/erl_poll.c:1974
#2 erts_poll_wait_nkp (ps=0x7ffff67ba100, pr=0x7fffffffb700,
len=0x7fffffffbf10, utvp=<optimised out>) at
sys/common/erl_poll.c:2087
#3 0x000000000062f528 in erts_check_io_nkp (do_wait=<optimised out>)
at sys/common/erl_check_io.c:1173
#4 0x00000000006259de in erl_sys_schedule (runnable=<optimised out>)
at sys/unix/sys.c:2734
#5 0x0000000000551de5 in scheduler_wait (rq=0x7ffff687b080,
esdp=0x7ffff687b2c0, fcalls=<synthetic pointer>) at
beam/erl_process.c:2195
#6 schedule (p=<optimised out>, calls=<optimised out>) at
beam/erl_process.c:6377
#7 0x00000000005c8867 in process_main () at beam/beam_emu.c:1229
#8 0x000000000051334c in erl_start (argc=10, argv=<optimised out>) at
beam/erl_init.c:1493
#9 0x00000000004f55f9 in main (argc=<optimised out>, argv=<optimised
out>) at sys/unix/erl_main.c:29
(gdb) f 7
#7 0x00000000005c8867 in process_main () at beam/beam_emu.c:1229
(gdb) etp-stacktrace c_p
% Stacktrace (22): <the non-value>.
#Cp<user:do_io_request/5+0x78>.
#Cp<user:server_loop/2+0x5a0>.
#Cp<user:catch_loop/3+0x90>.
#Cp<terminate process normally>.
(gdb) p c_p
$1 = (Process *) 0x7ffff7e87348
(gdb)

If the VM seems to block for a while and then continues to run it
could be because all schedulers are hitting the same lock at the same
time. Use an emulator with --enable-lock-counter[1] to figure out
which lock it is that is causing the issue. Also gprof and oprofile
can be very useful when used correctly, though their output is at
times quite hard to interpret.

If when investigating you find something that seems fishy, try to
limit the scope of the potential bug as much as you can. The more
specific you are in the description of the (miss)behaviour you are
experiencing, the more likely it will be that we can help you.

Lukas

[1]: http://www.erlang.org/doc/apps/tools/lcnt_chapter.html

Rickard Green

unread,
Sep 21, 2012, 5:41:47 AM9/21/12
to Erlang Questions
This mail isn't aimed at anyone specific, but more of a general statement.

Loss of performance of an application or bad behavior of an application when going up to a newer release, do not imply that there is a problem with the new release. The only conclusion that can be made is that this combination doesn't work well. It might be a problem with the new release, but it might also be a problem with the application. As an example, any arbitrary optimization improving performance, may trigger a logical bug in the application due to different timing, or overwhelm a consumer due to faster producers in the absence of flow control, etc. Another example, if performance critical parts of the application makes use of functionality that we at OTP do not see as performance critical you may also end up in trouble. I don't know how big of an issue this later example is in reality, but I think I have to say some words about it.

When optimizing the system we sometimes choose to degrade performance of functionality that is not performance critical in order to gain performance of performance critical functionality, or gain overall performance. If you've made use of such functionality in the critical path, you'll lose overall performance of your application. It may also cause memory issues. When working with scalability improvements this seems to be choices that we have to make more and more often. I suspect we need to be better at informing about such changes.

One such change that I realized isn't mentioned anywhere is the implementation of erlang:memory(). In R15 this call is much more heavyweight from the callers perspective, but from an overall performance perspective much more lightweight.

Preferably these kind of changes wont come as surprises. It is, however, hard to give an exhaustive answer to the question of what we do and do not consider as performance critical functionality, and I will not try to do that. I think common sense will get you far. We do consider core functionality of the language, such as for example message passing, as performance critical. Functionality that pulls out miscellaneous information about the internal state of something is typically not considered as performance critical. Apart from erlang:memory(), process_info() is a good example of such functionality.

Anyhow, if you get into trouble when you upgrade to a new release, you need to find out why. It might be an issue with the release, but it might also very well be an issue with your application. In some cases we might be able to help, but in some cases not. We are not that many people working here at OTP. The more relevant information you provide, the better chance of getting help. We are also very interested in finding potential issues with OTP, but as I already said we have limited resources.

Regarding increased CPU utilization in R15. When schedulers run out of work, they busy wait for a while before going to sleep. Waking up a busy waiting thread is much faster than waking up a sleeping thread. Due to the rewrites of memory allocation in R15, schedulers are more frequently woken, which cause more busy wait, which in turn cause an increase in CPU utilization when schedulers frequently run out of work (you will at least see some decrease of CPU utilization due to this in R16). When not running out of work there will be no busy wait at all. That is, the increase in CPU utilization does not translate into loss of performance. The busy waiting is there since it shortens the average time to wake up a scheduler, and by this reduces average communication latency between processes. Depending on application the reduced latency might also translate into improved throughput. If the increase in CPU utilization is unwanted, one can as of R15B02 shorten the busy wait threshold
(+sbwt command line argument). Note that by shortening the busy wait threshold, there will be an increased average latency.

Regards,
Rickard Green, Erlang/OTP, Ericsson AB
Reply all
Reply to author
Forward
0 new messages