[erlang-questions] Some facts about Erlang and SMP

Kenneth Lundin

unread,

Sep 16, 2008, 5:34:03 AM9/16/08

to erlang-questions Questions

Here are some short facts about how the Erlang SMP implementation
works and how it
relates to performance and scalability.

There will be a more detailed description of how multi core works and
on the future plans available
in a couple of weeks. I plan to include some of this in my
presentation at the ICFP2008, Erlang Workshop in Victoria BC,
September 27

The Erlang VM without SMP support has 1 scheduler which runs in the
main process thread. The scheduler
picks runnable Erlang processes and IO-jobs from the run-queue and
there is no need to lock data structures since
there is only one thread accessing them.

The Erlang VM with SMP support can have 1 to many schedulers which are
run in 1 thread each. The schedulers pick runnable Erlang processes
and IO-jobs from one common run-queue. In the SMP VM all shared data
structures are
protected with locks, the run-queue is one example of a data structure
protected with locks.

>From OTP R12B the SMP version of the VM is automatically started as
default if the OS reports more than 1 CPU (or Core) and with the same
number of schedulers as CPU's or Cores.

You can see what was chosen at the first line of printout from the
"erl" command. E.g.
Erlang (BEAM) emulator version 5.6.4 [source] [smp:4] [asynch-threads:0] .....

The "[smp:4]" above tells that the SMP VM is run and with 4 schedulers.

The default behaviour can be overridden with the
"-smp [enable|disable|auto]" auto is default
and to set the number of schedulers, if smp is set to enable or auto
"+S Number" where Number is the number of schedulers (1..1024)

Note ! that it is normally nothing to gain from running with more
schedulers than the number of CPU's or Cores.
Note2 ! On some operating systems the number of CPU's or Cores to be
used by a process can be restricted
with commands. For example on Linux the command "taskset" can be used
for this. The Erlang VM will
currently only detect number of available CPU's or Cores and will not
take the mask set by "taskset" into account.
Because of this it can happen and has happened that e.g. only 2 Cores
are used even if the Erlang VM
runs with 4 schedulers. It is the OS that limits this because it take
the mask from "taskset" into account.

The schedulers in the Erlang VM are run on one OS-thread each and it
is the OS that decides if the threads are
executed on different Cores. Normally the OS will do this just fine
and will also keep the thread on the same Core throughout the
execution.

The Erlang processes will be run by different schedulers because they
are picked from a common run-queue by
the first scheduler that becomes available.

Performance and scalability
------------------------------------

- The SMP VM with only one scheduler is slightly slower than the non
SMP VM. The SMP VM need to to use all the locks inside but as long as
there are no lock-conflicts the overhead caused by locking is not
significant (it is the lock conflicts that takes time). This explains
why it in some cases can be more efficient to run several SMP VM's
with one scheduler each
instead on one SMP VM with several schedulers. Of course the running
of several VM's require that the application can run
in many parallel tasks which has no or very little communication with
each other.

- If a program scale well with the SMP VM over many cores depends very
much on the characteristics of the program, some programs scale
linearly up to 8 and even 16 cores while other programs barely scale
at all even on 2 cores.
This might sound bad, but in practice many real programs scale well on
the number of cores that are common on the
market today, see below.

- Real telecoms products supporting a massive number if simultaneously
ongoing "calls" represented as one or several
Erlang processes per core have shown very good scalability on dual and
quad core processors. Note, that these products
was written in the normal Erlang style long before the SMP VM and
multi core processors where available and they
could benefit from the Erlang SMP VM without changes and even without
need to recompile the code.

SMP performance is continually improved
------------------------------------------------------

The SMP implementation is continually improved in order to get better
performance and scalability. In each service release
R12B-1, 2, 3, 4, 5 , ..., R13B etc. you will find new optimizations.

Some known bottlenecks
---------------------------------

- The single common run-queue will become a dominant bottleneck when
the number of CPU's or Cores increase.
Will be visible from 4 cores and upwards, but 4 cores will probably
still give ok performance for many applications.
We are working on a solution with one run-queue per scheduler as the
most important improvement right now.

- Ets tables involves locking. Before R12B-4 there was 2 locks
involved in every access to an ets-table, but
in R12B-4 the locking of the meta-table is optimized to reduce the
conflicts significantly (as mentioned earlier it is the conflicts that
are expensive).
If many Erlang processes access the same table there will be a lot of
lock conflicts causing bad performance especially if these processes
spend a majority of their work accessing ets-tables.
The locking is on table-level not on record level.
Note! that this will have impact on Mnesia as well since Mnesia is a
heavy user of ets-tables.

...

Our strategy with SMP
-----------------------------

Already from the beginning when we started implementation of the SMP
VM we decided on the strategy:
"First make it work, then measure, then optimize".
We are still following this strategy consistently since the first
stable working SMP VM that we released in May 2006 (R11B).

There are more known things to improve and we address them one by one
taking the one we think gives most
performance per implementation effort first and so on.

We are putting most focus on getting consistent better scaling on many
cores (more than 4).

Best in class
-----------------

Even if there are a number of known bottlenecks
the SMP system already has good overall performance and scalability
and I believe we are best in class
when it comes to letting the programmer utilize multi -core machines
in an easy productive way.

/Kenneth Erlang/OTP team, Ericsson
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions

Vlad Dumitrescu

unread,

Sep 16, 2008, 6:05:10 AM9/16/08

to erlang-questions

Sorry, this should have gone to the list...

Thanks, Kenneth!
/Vlad

On Tue, Sep 16, 2008 at 11:55 AM, Vlad Dumitrescu <vlad...@gmail.com> wrote:

> On Tue, Sep 16, 2008 at 11:34, Kenneth Lundin <kenneth...@gmail.com> wrote:
>> Here are some short facts about how the Erlang SMP implementation
>> works and how it
>> relates to performance and scalability.
>

> Hi and thank you for the detailed explanations!
>
> This may be a silly question, but does SMP interact with the async thread pool?
>
> best regards,
> Vlad

Hi,

The async thread pool works exactly the same in both the SMP VM and
the non SMP VM i.e I suppose
you can say that they don't interact.

The asynch thread pool is only used by the file driver in the code we
deliver but there is a documented
interface where you as a user can implement your own driver which also
used the asynch thread pool.

/Kenneth

Kevin Scaldeferri

unread,

Sep 16, 2008, 1:04:41 PM9/16/08

to Kenneth Lundin, erlang-questions Questions

On Sep 16, 2008, at 5:34 AM, Kenneth Lundin wrote:

> Here are some short facts about how the Erlang SMP implementation
> works and how it
> relates to performance and scalability.

...

Kenneth,

Do you have any insight into why I might be seeing much higher CPU
cost for SMP on linux vs. Mac? Is there some difference in the lock
implementations that might be relevant?

-kevin

Dave Smith

unread,

Sep 16, 2008, 1:56:13 PM9/16/08

to erlang-questions Questions

On Tue, Sep 16, 2008 at 3:34 AM, Kenneth Lundin
<kenneth...@gmail.com> wrote:
> Even if there are a number of known bottlenecks
> the SMP system already has good overall performance and scalability
> and I believe we are best in class
> when it comes to letting the programmer utilize multi -core machines
> in an easy productive way.

+1000. Erlang (and the SMP VM in particular) really shines when it
comes to building system that naturally scale out. It makes me so
happy to see apps distribute reasonably evenly across 8 cores, without
me having to spend a lot of time focused on locking/sharing/etc. It
Just Works (TM).

Thanks, Kenneth and the OTP team. Your hard work is much appreciated. :)

D.

Edwin Fine

unread,

Sep 16, 2008, 4:59:10 PM9/16/08

to Kenneth Lundin, erlang-questions Questions

Kenneth,

Thank you very much for answering many, if not all, of my questions. It is extremely useful to get an authoritative answer like yours, upon which one can base objective performance-related decisions.

For example, as you stated,

>> Of course the running
>> of several VM's require that the application can run
>> in many parallel tasks which has no or very little communication with
>> each other.

If using -smp disable (not +S 1, which I mistakenly thought was equivalent to -smp disable), there are applications that could benefit considerably, such as those that have many thousands of concurrent TCP/IP-initiated processes (what Joe Armstrong calls "naturally concurrent") that are not related to each other and therefore have no significant cross-VM IPC. A non-database backed content-serving web site is one example. Partitioning what would amount to numerous HTTP GET requests across multiple Erlang -smp disabled VMs, one per core, using a hardware load balancer, is likely to show performance benefits over the homogeneous SMP approach. Note that this scenario is consistent with your requirements of (a) little or no cross-VM IPC and (b) little or no heavy Mnesia or ETS table sharing.

One could potentially make good use of a CPU affinity command such as taskset (as you mentioned). This would have applicability where, for example, one wanted to "reserve" some CPUs for non-Erlang usage, such as to run some special-purpose high-priority application that would suffer performance-wise if it were to be preempted by the OS. If using -smp disable, then the single thread of a VM can safely be affinitied to a specific CPU or core and will not "migrate" to other processors. In addition, it is arguably possible that in this and similar scenarios, keeping a VM on a single core will benefit from improved processor and data cache hit rates and reduced or eliminated cache coherency storms. The part that is arguable is that if the VM itself is preempted, the contents of the processor cache may become invalidated anyway. This suggests that the VM should be run at an elevated priority and be kept busy so that it is always the top contender for the OS run queue and seldom gets rescheduled.

Of course, this is advanced performance tweaking and will probably be unnecessary for the majority of Erlang applications. However, for those who are trying to wring the last iota of performance out of their applications, this information can be invaluable.

In more common situations, this information can help Erlang-based system designers avoid lock contention by taking it into account when designing around database and ETS tables. In some cases it may make sense to have a process that owns a private table, and have all other processes interact with the table via this "gatekeeper". Careful measurement and thought will be needed to establish whether the bottleneck created by serializing access through an Erlang gen_server-style process is better or worse than the bottleneck created by serializing access through an ETS table lock, for example.

One final thought is that whenever some product is invented for a specific purpose, and excels at that purpose, as it becomes more popular it begins to get used in ways that its original designers never anticipated. Some of these uses are in line with the original design intent, but push the envelope and need improvements to the product implementation, and others are simply stretching the capabilities of the product in inappropriate directions. This has been seen in numerous cases, and seems to stem from the desire to have one tool that can do everything, and not have to know and support multiple tools. Then we have a hammer, and the world starts to look like a collection of nails.

As far as I can see, the answer is always the same: use the tool that is right for the job. Don't fall into the trap of "100% Pure Erlang" everywhere. It's less convenient, but if you want performance, offload the heavy crunching stuff to 'C' drivers. For example, instead of using xmerl for heavy XML processing, do what ejabberd does: use the 'C' expat parser via a linked-in driver. The Ruby community had exactly the same issue with XML parsing. Need heavy use of regular expressions? Use re, not regexp. Need heavy database processing? Maybe Mnesia is not the right tool for this particular job, although it may excel in the dimension for which it was designed: soft real-time telecomms applications. And so on. True, it is less convenient to have to step outside the environment in which one has become comfortable, and have architecture-specific applications (and risks to the robustness of the VM if the LID has any crashworthy bugs), but sometimes if you need the performance, and your algorithms are already appropriate, you have to use the right tool for the job and take the risks.

As support for this argument, the AXD301 project has "a couple of million" lines of Erlang code, and at one point had over 1 million lines of C/C++ code. The exact numbers seem to be elusive or outdated (the Erlang FAQ is way out of date, and other documents I found are vague as to how much C/C++ is currently used); if someone (e.g. Ulf Wiger) could provide more accurate information I'd appreciate it. How much of the C/C++ code is used for device interfaces and is hence unavoidable, how much for efficiency considerations, and how much for other purposes is not known to me. Still, the point is that this is not a "100% pure Erlang" implementation, and I would be very surprised if at least some of that C/C++ code was not there for efficiency.

Thank you again.

Regards,
Edwin Fine

Kenneth Lundin

unread,

Sep 17, 2008, 3:38:40 PM9/17/08

to Kevin Scaldeferri, erlang-questions Questions

> ...
>
> Kenneth,
>
> Do you have any insight into why I might be seeing much higher CPU cost for
> SMP on linux vs. Mac? Is there some difference in the lock implementations
> that might be relevant?

In general I don't think the measured CPU cost on different operating
systems is a reliable way of measuring how well a program performs. I
think is is more reliable to create measurements where the wall clock
is used , e.g how long time does a certain operation take or how many
calls per second can a system handle etc.

Anyway the lock mechanisms used in the VM are:

pthread mutex, for the run-queue
pthread rwlock, for ets-tables

inline spinlock and inline atomics
these are part of the VM source code, for x86 32 and 64 bit, ppc 32
bit, sparc 32 and 64 bit. For other architectures the build will fall
back to pthread spinlock and if that
does not exist there is another fall back to pthread mutex.

Linux and Mac OSx for x86 will use exactly the same spinlock and atomics but the
pthread mutex and rwlock will be OS specific as well as the thread
scheduling in general and this can of course give differences in
performance even on the same HW.

/Kenneth Erlang/OTP, Ericsson

Matthew Sackman

unread,

Sep 20, 2008, 2:08:00 PM9/20/08

to erlang-q...@erlang.org

On Tue, Sep 16, 2008 at 11:34:03AM +0200, Kenneth Lundin wrote:
> - The SMP VM with only one scheduler is slightly slower than the non
> SMP VM. The SMP VM need to to use all the locks inside but as long as
> there are no lock-conflicts the overhead caused by locking is not
> significant (it is the lock conflicts that takes time). This explains
> why it in some cases can be more efficient to run several SMP VM's
> with one scheduler each
> instead on one SMP VM with several schedulers. Of course the running
> of several VM's require that the application can run
> in many parallel tasks which has no or very little communication with
> each other.

I'm having some difficulties with the "it's only when locks conflict
that locking time is significant" idea. My understanding with the
pthreads locks is that you start with a CAS, and if that fails then you
spin for a while (with CAS), and if you spin for too long then you go a
kernel lock. If that's wrong then I'm sorry, as most of the following
will also be wrong!

Now assuming the above is correct (or at least close enough for jazz),
in the conflict case, sure, you're going to block, which is the point,
and then some time in the future either the kernel will wake you up
(pretty slow), or the spinning finally succeeds in which case the
wakeup should be really fast.

If there's not a conflict then the first CAS will succeed. But a CAS is
many hundreds of cycles as you've got to do cache-coherency and then
really go all the way out to RAM and load the values in fresh,
especially in a multisocket box. So hundreds of cycles does not seem
like "not significant", at least to me. Or have I missed something
obvious here?

I rather expect that you'll end up with lock-free per-cpu task-queues
with workstealing algorithms which as a colleague has just pointed out,
has been used for these purposes for some time (at least 10 years);
though I do appreciate the "get it out working and then tune it"
approach.

Matthew

Bjorn Gustavsson

unread,

Sep 24, 2008, 4:07:10 AM9/24/08

to erlang-q...@erlang.org

On Sat, Sep 20, 2008 at 8:08 PM, Matthew Sackman <mat...@wellquite.org> wrote:

I'm having some difficulties with the "it's only when locks conflict
that locking time is significant" idea. My understanding with the
pthreads locks is that you start with a CAS, and if that fails then you
spin for a while (with CAS), and if you spin for too long then you go a
kernel lock. If that's wrong then I'm sorry, as most of the following
will also be wrong!

Now assuming the above is correct (or at least close enough for jazz),
in the conflict case, sure, you're going to block, which is the point,
and then some time in the future either the kernel will wake you up
(pretty slow), or the spinning finally succeeds in which case the
wakeup should be really fast.

If there's not a conflict then the first CAS will succeed. But a CAS is
many hundreds of cycles as you've got to do cache-coherency and then
really go all the way out to RAM and load the values in fresh,
especially in a multisocket box. So hundreds of cycles does not seem
like "not significant", at least to me. Or have I missed something
obvious here?

It depends on the context.

Compared to not having any locks at all (as in the non-SMP emulator), the cost for taking
a lock without any conflict is significant. (That's why the non-SMP emulator is faster
than SMP emulator running with only one scheduler.)

Compared to a lock conflict when multiple scheduler threads cannot do any useful work
while waiting for a lock, the cost for taking a lock directly without any conflict is not
significant.

/Bjorn
--
Björn Gustavsson, Erlang/OTP, Ericsson AB

Reply all

Reply to author

Forward