Re: [erlang-questions] +swt very_low doesn't seem to avoid schedulers getting

Rickard Green

unread,

Oct 10, 2012, 3:12:23 PM10/10/12

to Scott Lystig Fritchie, Erlang Questions

> Hi, all. According to my private mailing list archive, there hasn't
> been much mention of erl's "+swt" flag since about April 2012.
>
> I just witnessed a case of where using "+swt very_low" with Riak
> 1.2.1rc2, using Erlang/OTP R15B01, get "stuck" where CPU consumption was
> only 200% on an 8 core AWS instance. The other nodes in that Riak
> cluster were running at over 600% CPU utilization (on average).
>
> When I ran this:
>
> {io:format("before..."), erlang:system_flag(schedulers_online, 1),
> timer:sleep(1000), erlang:system_flag(schedulers_online, 8),
> io:format("after\n")}.
>
> ... then average CPU utilization on that node immediately shot up from
> 200% to about 760%.

This is very much expected. Since you have work that can load 2
schedulers full time and shuts down all but one, the run-queue will
grow. When you later release all schedulers, there will be lots of work
to pick from.

>
> I'd heard a rumor that "+swt very_low" was supposed to avoid whatever
> weird scheduler problem/bug that caused some schedulers to appear as if
> they weren't active. But I was on site today and witnessed this
> first-hand and verified that the "+swt very_low" flag was indeed being
> used.
>

The runtime system tries to compact the load on as few schedulers as
possible without getting run-queues that build up. The runtime system
wont wake up new schedulers unless some overload has accumulated. This
overload either show up as a quickly growing run-queue or a small
run-queue over a longer time. The +swt flags sets the threshold that is
used for determining when enough overload has accumulated to wake up
another scheduler.

This compaction of load onto fewer schedulers is there in order to
reduce communication overhead when there aren't enough work to fully
utilize all schedulers. The performance gain of this compaction depends
on the hardware.

We have gotten reports about problems with this functionality, but we
have not found any bugs in this functionality. We have only found that
it behaves as expected. That is, if more schedulers aren't woken this is
due to not enough accumulated overload. The +swt switch was introduced
in order to give the user the possibility do define what is enough
overload for his or her taste.

The currently used wakeup strategy is very quick to forget about
previously accumulated overload that has disappeared. Maybe even too
quick for my taste when "+swt very_low" is used. I've therefore
implemented an alternative strategy that most likely will be the default
in R16. As of R15B02 you can try this strategy out by passing "+sws
proposal" as a command line argument. In combination with "+swt
very_low", the runtime system should be even more eager to wake up
schedulers than when only using "+swt very_low".

> I'm not certain what exact Linux distribution and kernel was used. I'll
> ask the customer to send me that info so I can forward it to the list.
>
> Has anyone else seen this behavior? Unlike Knut Nesheim's report on
> this list back in February 2012, Riak does not use the halfword
> emulator. We are using some NIFs, but this customer isn't using the
> most evil one, the Riak eleveldb NIF library. Instead, they're using
> the Bitcask backend (which has a NIF component but isn't as evil as
> eleveldb's NIF) and the merge_index backend (which is pure Erlang).
>
> -Scott

Regards,
Rickard
--
Rickard Green, Erlang/OTP, Ericsson AB.
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Scott Lystig Fritchie

unread,

Oct 16, 2012, 12:20:51 AM10/16/12

to Rickard Green, Erlang Questions

Rickard Green <ric...@erlang.org> wrote:

rg> This is very much expected. Since you have work that can load 2
rg> schedulers full time and shuts down all but one, the run-queue will
rg> grow. When you later release all schedulers, there will be lots of
rg> work to pick from.

Hi, Rickard. Sorry I didn't reply earlier ... Basho kept me busy with
an all-hands meeting and conference out in San Francisco.

Perhaps I wasn't all that clear about the problem that I saw and that
several other customers have witnessed.

1. One node in a Riak cluster is consuming significantly lower CPU than
the other nodes in the cluster. The imbalance is not due to
application layer workload imbalance, as far as we can tell. (In the
case that I personally witnessed, it was a lab environment with an
artificial & deterministic load generator talking to all Riak nodes
equally (or trying very hard to)).

2. As soon as we do one of two things, the CPU imbalance disappears:
a. Restart the Riak app on the slow node.
b. Use the erlang:system_flag(schedulers_online, 1) hack and then
back to 8 using the same BIF.

In situations described by customers, this seems to happen after a day
or more of load, where the peak workload is substantially higher than
off-peak workload. In the lab environment that I witnessed, the load
generators were cycling through 100%-off and 100%-on states.

rg> This compaction of load onto fewer schedulers is there in order to
rg> reduce communication overhead when there aren't enough work to fully
rg> utilize all schedulers. The performance gain of this compaction
rg> depends on the hardware.

What you describe seeems to be exactly what's happening ... except that
when input workload rises again, the idled schedulers aren't waking up,
ever. Or we force them to wake up with the system_flag() BIF.

rg> We have gotten reports about problems with this functionality, but
rg> we have not found any bugs in this functionality. We have only found
rg> that it behaves as expected. That is, if more schedulers aren't
rg> woken this is due to not enough accumulated overload. The +swt
rg> switch was introduced in order to give the user the possibility do
rg> define what is enough overload for his or her taste.

Hrm, well, we've seen it both with "+swt very_low" and without any +swt
flag at all. And it's extremely irritating. :-)

What info would you need gathered from the field when this bugaboo
strikes next time?

-Scott

Vance Shipley

unread,

Oct 16, 2012, 7:08:12 AM10/16/12

to Rickard Green, Erlang Questions

Rickard,

Maybe you can offer some insight into a problem I'm having. I have
a linked in driver which uses port level blocking with many ports.
The driver uses driver_set_timer() to wake up in so many milliseconds
(e.g. 10ms) to do a small amount of work in it's timeout() callback.
What we see is that despite a low CPU usage and multiple schedulers
often the timeout will be called quite late (e.g. 10-20ms late).

I have spent quite a bit of time looking at the implementation of the
schedulers but am not sure what's going on. I am beginning to think
the shceduler is going off to do aux work.

We've tried changing +swt without effect. What do you think?

On Wed, Oct 10, 2012 at 09:12:23PM +0200, Rickard Green wrote:
} This compaction of load onto fewer schedulers is there in order to
} reduce communication overhead when there aren't enough work to
} fully utilize all schedulers. The performance gain of this
} compaction depends on the hardware.

--
-Vance

Patrik Nyblom

unread,

Oct 16, 2012, 9:44:50 AM10/16/12

to erlang-q...@erlang.org

Hi!

On 10/16/2012 01:08 PM, Vance Shipley wrote:
> Rickard,
>
> Maybe you can offer some insight into a problem I'm having. I have
> a linked in driver which uses port level blocking with many ports.
> The driver uses driver_set_timer() to wake up in so many milliseconds
> (e.g. 10ms) to do a small amount of work in it's timeout() callback.
> What we see is that despite a low CPU usage and multiple schedulers
> often the timeout will be called quite late (e.g. 10-20ms late).

What does the other callbacks in your driver do? Anything that could
take 10-20 ms? In that case it's normal, your port is locked for what
would be considered a very long time in a driver.

> I have spent quite a bit of time looking at the implementation of the
> schedulers but am not sure what's going on. I am beginning to think
> the shceduler is going off to do aux work.
>
> We've tried changing +swt without effect. What do you think?
>
> On Wed, Oct 10, 2012 at 09:12:23PM +0200, Rickard Green wrote:
> } This compaction of load onto fewer schedulers is there in order to
> } reduce communication overhead when there aren't enough work to
> } fully utilize all schedulers. The performance gain of this
> } compaction depends on the hardware.

Cheers,
/Patrik

Vance Shipley

unread,

Oct 16, 2012, 11:42:15 PM10/16/12

to Patrik Nyblom, erlang-q...@erlang.org

On Tue, Oct 16, 2012 at 03:44:50PM +0200, Patrik Nyblom wrote:
} What does the other callbacks in your driver do? Anything that
} could take 10-20 ms? In that case it's normal, your port is locked
} for what would be considered a very long time in a driver.

On 10/16/2012 01:08 PM, Vance Shipley wrote:
} The driver uses driver_set_timer() to wake up in so many milliseconds
} (e.g. 10ms) to do a small amount of work in it's timeout() callback.

^^^^^^^^^^^^^^^^^^^^^^
Much less than a millisecond of work.

--
-Vance

Patrik Nyblom

unread,

Oct 18, 2012, 10:26:50 AM10/18/12

to erlang-q...@erlang.org

On 10/17/2012 05:42 AM, Vance Shipley wrote:
> On Tue, Oct 16, 2012 at 03:44:50PM +0200, Patrik Nyblom wrote:
> } What does the other callbacks in your driver do? Anything that
> } could take 10-20 ms? In that case it's normal, your port is locked
> } for what would be considered a very long time in a driver.
>
> On 10/16/2012 01:08 PM, Vance Shipley wrote:
> } The driver uses driver_set_timer() to wake up in so many milliseconds
> } (e.g. 10ms) to do a small amount of work in it's timeout() callback.
> ^^^^^^^^^^^^^^^^^^^^^^
> Much less than a millisecond of work.

And the driver has no other callbacks?

Rickard Green

unread,

Oct 18, 2012, 2:13:00 PM10/18/12

to Scott Lystig Fritchie, Erlang Questions

I can of course not guarantee that there isn't a hard to find bug here,
but based on previous experience of similar situations and actually
having tried hard to find a bug here, my guess is still that there
actually are not appearing much more work for schedulers to do, and that
this is the reason to why no more schedulers are woken up. The reason
for this is more or less impossible for me to guess, since I don't know
the implementation of your application.

My recommendation is to try out the "+sws proposal +swt very_low" combo
in R15B02.

> What info would you need gathered from the field when this bugaboo
> strikes next time?
>

Currently, one needs to hack the emulator in order to pull out the most
important information, and, unfortunately, I don't have the time to do this.

However, you can call "statistics(run_queues)" (note that the argument
should be 'run_queues', and not 'run_queue') repeatedly, say once every
100 ms (perhaps even more frequent than this) for say 10 seconds when
the system is in this state . That information will at least give us a
good hunch of what is going on. statistics(run_queues) returns a tuple
containing the run queue length of each run queue as elements.

Regards,
Rickard
--
Rickard Green, Erlang/OTP, Ericsson AB.

Rickard Green

unread,

Oct 18, 2012, 2:14:42 PM10/18/12

to Vance Shipley, Erlang Questions

On 10/16/2012 01:08 PM, Vance Shipley wrote:

> Rickard,
>
> Maybe you can offer some insight into a problem I'm having. I have
> a linked in driver which uses port level blocking with many ports.
> The driver uses driver_set_timer() to wake up in so many milliseconds
> (e.g. 10ms) to do a small amount of work in it's timeout() callback.
> What we see is that despite a low CPU usage and multiple schedulers
> often the timeout will be called quite late (e.g. 10-20ms late).
>

Note that the system doesn't give you any guarantees about how late the
timeout will be triggered. It should however not be triggered before the
timeout you have set.

I don't want to give you a figure of how late I think is reasonable and
not reasonable. This both since it depends on a lot of things out of the
runtime systems control, such as OS, user implemented drivers, etc, and
also since there still exist operations in the runtime system that when
used incautiously can monopolize a scheduler thread for a long time
(such as a large gc).

> I have spent quite a bit of time looking at the implementation of the
> schedulers but am not sure what's going on. I am beginning to think
> the shceduler is going off to do aux work.
>

Even if there are a substantial amount of aux work to do, schedulers
will repeatedly break out from execution of aux work and check if other
work (such as triggering timers) also need to be done.

> We've tried changing +swt without effect. What do you think?
>
> On Wed, Oct 10, 2012 at 09:12:23PM +0200, Rickard Green wrote:
> } This compaction of load onto fewer schedulers is there in order to
> } reduce communication overhead when there aren't enough work to
> } fully utilize all schedulers. The performance gain of this
> } compaction depends on the hardware.
>

In order to know exactly what is happening you need to inspect the
execution somehow. For example, stracing or hacking the emulator.

Regards,
Rickard
--
Rickard Green, Erlang/OTP, Ericsson AB.

Scott Lystig Fritchie

unread,

Nov 3, 2012, 6:36:19 PM11/3/12

to Rickard Green, erlang-q...@erlang.org

A few weeks ago, Rickard Green <ric...@erlang.org> wrote:

> However, you can call "statistics(run_queues)" (note that the argument
> should be 'run_queues', and not 'run_queue') repeatedly, say once every
> 100 ms (perhaps even more frequent than this) for say 10 seconds when
> the system is in this state . That information will at least give us a
> good hunch of what is going on. statistics(run_queues) returns a tuple
> containing the run queue length of each run queue as elements.

Hiya. We have some new data from three customer machines running
Riak 1.2.1 with R15B01 that all hit what appears to be this same
"schedulers getting stuck" problem.

The machines were fixed before I was aware of them, so I didn't
get a chance to rummage around. We do not have the output that
you suggested, statistics(run_queues). However, we do have
samples of the traces that are generated by:

erlang:trace(all, true, [running,scheduler_id])

When stuck, the output looks like this, with tuples of
{scheduler #, # of samples}

(riak@node1)1> schedstat:run().
<0.16760.459>
=== in scheduler count===
[{1,264},
{2,257},
{3,0},
{4,0},
... and repeating zero samples all the way to
to scheduler 64.

When unstuck, the output looks like this:

(riak@node1)1> schedstat:run().
<0.3422.460>
=== in scheduler count===
[{1,65},
{2,5},
{3,0},
{4,0},
{5,14},
{6,73},
{7,0},
{8,0},
{9,0},
{10,0},
{11,0},
{12,0},
{13,159},
{14,182},
{15,6},
{16,0},
... and repeating zero samples all the way to
to scheduler 64.

I do not know if the +swt flag was used on these machines,
sorry.

Raw output, courtesy of Kelly McLaughlin, is available at
https://gist.github.com/4009035. The generating script is
by Jon Meredith at https://gist.github.com/a460a9dbb11698cf01a6.

The make-it-unstuck method is this:

%% Get current number of online schedulers
Schedulers = erlang:system_info(schedulers_online).

%% Reduce number online to 1
erlang:system_flag(schedulers_online, 1).

%% Restore to original number of online schedulers
erlang:system_flag(schedulers_online, Schedulers).

It isn't clear yet if the next release of Riak will use R15B02
or remain with R15B01. We were bitten by performance regressions
(not caught during our pre-release testing) when releasing the
packages that moved from R14B04 to R15B01. There's the devil
we're getting to know versus the devil that takes a heck of a lot
more time to get to know....

-Scott

Rickard Green

unread,

Nov 3, 2012, 9:05:32 PM11/3/12

to Scott Lystig Fritchie, erlang-q...@erlang.org

This does, however, not show that anything is wrong. The statistics only show that a couple of hundred processes were selected for execution on the same scheduler during some timeframe. If there are no buildup in run-queue length, everything is as it should.

Regards,
Rickard

Reply all

Reply to author

Forward