Groups

[erlang-questions] How to avoid long_schedule issue ?

25 views

Skip to first unread message

Alex Feng

unread,

Dec 27, 2016, 4:45:18 AM12/27/16

to erlang-q...@erlang.org

Hi,

I was reading some documents about scheduler, to my understanding there will be no long_schedule issue since BIF has trap mechanism. Then I found this, I couldn't understand, the reduction is keeping increasing, why the BIF(lists:seq) is not interrupted for more than 10ms ?

Could someone please explain this ?

1> erlang:system_monitor(self(),[{long_schedule, 10}]).

undefined

2> Pid = spawn(fun()->lists:seq(1,2000000000) end).

<0.65.0>

3>flush().

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,21},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,38},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,17},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,21},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,23},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,31},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,30},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,43},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,53},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,56},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,51},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,81},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,17},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,18},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,110},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,74},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,24},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,76},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,112},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,43},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,130},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,205},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

Shell got {monitor,<0.65.0>,long_schedule,

[{timeout,77},

{in,{lists,seq_loop,3}},

{out,{lists,seq_loop,3}}]}

20> process_info(Pid, reductions).

{reductions,4166000}

21> process_info(Pid, reductions).

{reductions,7150000}

22> process_info(Pid, reductions).

{reductions,8328000}

23> process_info(Pid, reductions).

{reductions,12330000}

24> process_info(Pid, reductions).

{reductions,14354000}

25> process_info(Pid, reductions).

{reductions,24774000}

Br,

Alex

Max Lapshin

unread,

Dec 28, 2016, 10:23:26 AM12/28/16

to Alex Feng, Erlang-Questions Questions

I'm also very interested in how to properly interpret these warnings =(

Dániel Szoboszlay

unread,

Dec 28, 2016, 6:20:08 PM12/28/16

to Max Lapshin, Alex Feng, Erlang-Questions Questions

Hi,

First of all, lists:seq/2 is not a BIF, it is a pure Erlang function. And the 20-40 ms "long schedule" events are absolutely normal, you should use a much larger threshold (I would recommend at least 500 ms) to filter for the real outliers only.

Now let's see what more can I tell you about long schedules!

When you monitor long schedules, the schedulers will simply note the wall clock time when they schedule in a process and compare it with the wall clock time when they schedule it out. If more time has passed than the threshold you set, you get a message. This unfortunately can be quite a bit misleading, as it may include times you wouldn't expect. Just like:

Other OS processes running on the CPU. The OS can preempt the scheduler thread and give the CPU to some other process for 10-20 ms or more. And the scheduler will not know about this interruption. I'm pretty sure this is the reason for your long schedules, Alex.
And this is fine as long as the host does not actually become CPU limited. If there are many OS processes fighting for the CPU, you will see horrible long schedules all over the place.
The OS performing some time consuming task for you. Like a page fault that requires reading from the disk swap. Whoops, your scheduler is suspended for tens of milliseconds without noticing it!
The OS performing some time consuming interrupt handling while your scheduler has the CPU.
This is my personal favourite, because this means some totally unrelated OS code (like crappy NIC drivers) runs in your process' context, and can log scary looking messages that all seem to come from the pure beam process. Not to mention long schedules, of course.
The time it takes to grab some internal locks in the VM. If the lock is held by a long scheduling process, every other process waiting for the same lock will also long schedule.
In our production system for example ~50% of long schedules come from a single monitoring process that periodically collects process info from other processes. Of course if a process long schedules with e.g. 1200 ms, the monitoring process will have to wait up to 1200 ms as well to grab the lock on it required for fetching process info.

It is also good to know that not all BIF-s are preemptable, and those that are, will calculate their reduction cost in very ad-hoc looking ways. For example, it looks like that the lists:reverse/2 BIF can process 40 list elements per reduction, while lists:keyfind/3 can search 10 list elements per reduction. Do you think that reverting 40 list elements and looking through 10 list elements would take exactly the same wall clock time? Probably not. And they probably won't take exactly as much time either as an average reduction when executing your application's Erlang code. These reduction cost estimates work fine in most cases, but can be inaccurate when you give huge inputs to these functions. If they happen to be too low estimates on your system, you may still see long schedules when all your BIF-s and NIF-s are nice and preemptable.

Now if you still have interesting long schedules that you want to debug, you need to keep in mind that the schedule in and schedule out functions are not necessarily the point where the time is wasted. For example if you have a gen_server that - when handling one particular message - calls an unfriendly BIF/NIF which doesn't update the reduction count, you will typically see that both the schedule in and schedule out points are gen_server:loop/6. Nothing will point to the BIF/NIF in the event, so good luck finding the offender! You have to consider all execution pathes that may lead from the schedule in point to the schedule out point. The offender can be any of the functions used on any of these pathes.

Finally, a bit about finding the ideal long schedule threshold. 10 ms is typically too low: it basically means every time the OS schedules out your VM thread you will get a long schedule. But you need to consider the latency requirements of your application: if you're doing high frequency trading or ad bidding or whatever, maybe a 10 ms pause would be too much for you. In this case you can use such a low threshold, but be sure to turn off swapping and pin your schedulers to cores that are exclusively used by the VM, and where you have disabled interrupts, tick handling etc. In general, for a system where you need to keep latency under T, it makes sense to monitor long schedules with a threshold of ~0.5 T - 0.8 T or so.

Both heart and the net ticktime of the distribution protocol give you such latency requirements: heart needs a heart beat message every 60 seconds and the distribution protocol sends one hear beat message every 15 seconds. So long schedules in the 15,000 ms range start to interfere with the distribution protocol, and above 60,000 ms can kill your node. (These limits may sound crazy, but I regularly see ~20,000 ms long schedules in our systems. Unfortunately.)

Hope this helps!

Daniel

On Wed, 28 Dec 2016 at 16:23 Max Lapshin <max.l...@gmail.com> wrote:

I'm also very interested in how to properly interpret these warnings =(

_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Alex Feng

unread,

Dec 29, 2016, 4:23:03 AM12/29/16

to Dániel Szoboszlay, Erlang-Questions Questions

Hi Daniel,

The answer is very helpful, I have learnt a lot from your answer.

Thank you very much and Happy New Year.

Br,

Alex

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu