From my understanding, there are four main ways to interface Erlang
with C:
* C Node
* Port
* Linked-In Driver
* NIF (Native Implemented Function)
My problem is if I have, for example, spawned 20,000 Erlang processes
and I want them all to execute concurrently but they need to call C
code, how can I have that C code run concurrently without having to
spawn 20,000 threads in C (which would probably crash the OS) or using
obscene amounts of memory?
I've been reading over the examples of a C node, and it seems if
20,000 processes all send a message to the C node, the node will
process them one-by-one and not concurrently, so it becomes a
serialized bottleneck. Spawning 20,000 C nodes on a single machine
isn't feasible, because of the amount of memory that would require.
A port suffers from the same problem, since the Erlang processes would
be communicating with a single external program, and again, I can't
create 20,000 instances of that program.
Reading the documentation for a linked-in driver, it says:
http://www.erlang.org/doc/tutorial/c_portdriver.html
"Just as with a port program, the port communicates with a Erlang
process. All communication goes through one Erlang process that is the
connected process of the port driver. Terminating this process closes
the port driver."
But on the driver documentation page:
http://www.erlang.org/doc/man/erl_driver.html
"A driver is a library with a set of function that the emulator calls,
in response to Erlang functions and message sending. There may be
multiple instances of a driver, each instance is connected to an
Erlang port. Every port has a port owner process. Communication with
the port is normally done through the port owner process."
So this also seems to have the same problem as C nodes and ports,
since in order to maintain concurrency I would need 20,000 instances
of the same driver.
Finally, we have NIFs. These have potential, but when I read the
documentation:
http://www.erlang.org/doc/man/erl_nif.html
"Avoid doing lengthy work in NIF calls as that may degrade the
responsiveness of the VM. NIFs are called directly by the same
scheduler thread that executed the calling Erlang code. The calling
scheduler will thus be blocked from doing any other work until the NIF
returns."
So if one Erlang process calls a NIF, does this mean the other 19,999
processes are blocked until the NIF returns (or the subset of
processes a scheduler manages)? If so, this won't work either.
Does anyone have a solution to this that still allows you to use C
(I'm using C for the parts that are intensive number crunching)? Or
will I have to implement everything in Erlang?
Thanks!
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions
You'd most likely want to implement anything that blocks in Erlang. Typically, Erlang runs as many threads as there are cores (can be controlled with the +S erl argument).
The question is, what is it that you want to do in that C code? Because the hardware is not able to do more things concurrently than it has cores..
If you want to do I/O in the c code, then you're probably better off writing it in Erlang, or use one of the existing linked in drivers designed for doing just that.
NIFs are best suited to do stuff that requires heavy computation, or something that needs access to a shared data structure, or leveraging some existing code. ...but NIFs should not run for too long since they'll hold up all the other Erlang processes on that scheduler, and NIFs should generally not do I/O.
So if you have a large body of existing c code which does I/O, then there's no easy way to combine that with the benefits of erlang's inexpensive processes.
Sent from my iPad
I definitely would not be doing any disk I/O in the C code. It would
be intense number crunching, so it would be CPU (and perhaps memory)
bound. Everything I've read states Erlang is not good at number
brunching (Cesarini mentions this in his "Erlang Programming" book) so
I'm considering writing the code to do that in C.
If I call a NIF, only the particular scheduler that manages that
Erlang process would be blocked and no other scheduler, right? So for
example, if I have a CPU with eight cores, and an Erlang scheduler
thread is running on each core, and say the third scheduler is
executing an Erlang process that calls a NIF (and so blocks), only
that scheduler would be blocked until the NIF finishes executing,
correct?
I'm debating which solution would be better. Erlang would be slower at
number crunching, but is extremely efficient at managing concurrent
executing processes, meaning each would gradually make progress every
X units of time since they'll all get a turn to execute. But I wonder
if having a single process execute NIF code until it finishes (and so
all the processes managed by a single scheduler execute serially)
would be faster than implementing it all in Erlang and having
processes execute concurrently within a single scheduler (albeit the
code would be slower to execute). There would be less overhead of
Erlang process context switching (although admittedly that isn't much
to begin with) and the C code would be faster at number crunching. I
suppose there's only one way to find out! :)
I was also thinking about writing the number crunching code in some
other language than C, such as OCaml. OCaml has a reputation for being
as fast as C, yet not nearly as low-level. Maybe that would be a good
fit with Erlang.
Example benchmarks:
http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=hipe&lang2=gpp
http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=ocaml
The Erlang benchmark was using HiPE as well.
Thanks for the suggestion!
There are few things here I may add.
As Kresten said, the number of real parallel threads you can run on a
computing element depends on the number of cores. Nevertheless, making
your code to run in threads may bring an advantage in certain cases.
Now, what I missed from your e-mails is how would you like the
information from Erlang to be processed. Meaning, is the information
processed by an Erlang thread linked to another running thread? Or each
thread has its own distinct information? These questions need to be
answered before you proceed further in designing your application.
In case of linked information, than serializing the data seems not so
bad idea (depending of the level of relation in between threads data).
Otherwise, in the case of independent data per thread, you don't need to
worry about creating 20k threads in C (using NIF's), but just create a
dynamic library .so (shared object) which you have to load it before
starting your Erlang threads. Linux will take care of the rest (it will
create as many data instances within your library as required). You just
need to take care that your library to be thread safe (mainly, no memory
leaking and not to try to use more memory in the buffer than you have
physically).
If you wonder if to use Erlang or simple C (or any programming language
for that matter), think firstly about what you need in the end. All of
us would like to have super-speedy applications by squeezing the maximum
the computational power from our hardware, but we all need to make some
compromises. What I perceived from using Erlang is that this is not
suitable for regular desktop applications, but, instead, it's a very
handy tool when developing applications such as non-blocking complex
data processing and fast network applications (and may be more, but I
used Erlang only for that for the time being). It's not that you cannot
obtain all those by writing your applications in C, but why reinventing
the wheel when you can just have it? Erlang is robust enough to give you
a nice environment for these kinds of applications.
Concluding, using Erlang is just a matter of taste and how comfortable
you feel yourself with such a programming language. Searching for
benchmarks of a programming language doesn't help you too much because
they are usually made for certain conditions which, in 90% of the cases,
do not fit your needs. In this case, you need high concurrency, I
suggest you to consider more cores of lower frequency better than fewer
cores of higher frequency (or, if you can afford GPU instead of CPU).
Keep in mind that whatever you will choose, you will always be
restricted by your hardware and for the few milliseconds you may gain
per process you need to work hours if not days.
Good luck!
Cheers,
CGS
When I see a number like 20k processes my mind automatically skips to what is the load per process?
Do I need a full CPU for each, or a fractional CPU, or maybe I need 20k cores at peak?
If the issue is you have 20k anything that need to be scheduled, you need 20k x ( process time + switching cost) / process. Hardware architecture determines a lot about both. On Linux you can get away with hundreds of native processes on Intel but you may not be able to eek out enough processing time per core to do useful work.
If you are running on multicore ARM you can eek out maybe 10 processes per core, before just the switching cost alone kills your CPU. This is why we are seeing a move towards 128 core and 256 core ARM processors. If you need 20k cores and can afford around $32-64k in hardware, there are a couple companies that will have products shipping next year.
Generally my preferred solution to this problem is event driven C talking over a socket connection to Erlang. Using kqueue or epoll you can easily handle a few thousand socket connections per core on the C side, and Erlang can easily scale out as a command and control infrastructure.
If you are clever using consistent hashing to manage system memory across your nodes and job scheduling can make a handful of cores (24ish) perform like a 20k node cluster.
But the specifics of your project and budget will determine if it is even possible :)
Dave
-=-=- da...@nexttolast.com -=-=-
This may not be true - I wrote some crypto stuff in Erlang with bignums
and it turned out to be faster than some C I had. I guess this was because
I could write a more advance algorithm than in C - but I never
investigated why. I would expect small fixed types and array type algorithms
to be faster in C, but not necessarily bignum computations.
Also bear in mind that the C and Erlang will not be solving the same problem
In C you might have to protect the code from buffer overflow attacks
but In Erlang this would not be necessary. Also Erlang is slower *by design*
to allow for code-changes on-the-fly which C cannot do.
So saying Erlang is "not good at number crunching" is only a first
approximation to the truth ... true for most things, but not a universal truth
for which you have to read the small print ...
> I'm considering writing the code to do that in C.
Just curious - what type of "intense number crunching?" - there are
different types of number crunching - things like digital image
processing involve identical computations on a grid - so could be done
on a GPU - other operations may or may not be suitable to a GPU. If
the CPU demands are very-variable upping the number of cores and
changing to a Tilera might help.
The time and memory properties of the C are also interesting - do the
C tasks always take the same time/memory or are they highly variable?
This can effect the scheduling strategy - you might get CPU or memory
starvation.
Although number crunching might be faster in C than Erlang the round-trip
times become important if you do relatively little work in C. You might spend
more time in communication than the time you save in being faster in C.
I'd start by making a pure Erlang solution and then measuring to see where
the problems are - Guessing where the time goes is notoriously difficult - even
if the the pure Erlang solution is not fast enough the code can provide
a useful reference implementation to start with and should be up-and-running
quicker than if you start coding NIFS etc.
Virtually every time I've had a program that was too slow, and I've guessed
where the problem was I've been wrong - so I'd build a reference
implementation first then measure - then optimize.
Cheers
/Joe
In a nutshell, our goal is take a portfolio of securities (namely
bonds and derivatives), and calculate a risk/return analysis for each
security. For risk, interest rate shock, and for return, future cash
flows. There are different kinds of analyses you could perform.
Here's a more concrete example. Pretend you're an insurance company.
You have to pay out benefits to your customers, so you take their
money and make investments with it, hoping for a (positive) return, of
course. Quite often insurance companies will buy bonds, especially if
there are restrictions on what they can invest in (e.g., AAA only).
You need to have an idea of what your risk and return are. What's
going to happen to the value of your portfolio if yields rise or fall?
Ideally you want to know what your cash flows will look like in the
future, so you can have a reasonable idea of what shape you'll be in
depending on the outcome.
One such calculation would involve shocking the yield curve (yields
plotted against maturity). If yields rise 100 basis points, what
happens to your portfolio? If they fall far enough how much would
yields need to fall before any of your callable bonds started being
redeemed?
Part of the reason why I think Erlang would work out well is the
calculations for each security are independent of each other -- it's
an embarrassingly parallel problem. My goal was to spawn a process for
each scenario of a security. Depending on how many securities and
scenarios you want to calculate, there could be tens or hundreds of
thousands, hence why I would be spawning so many processes (I would
distribute these across multiple machines of course, but we would have
only a few servers at most to start off with).
Because Erlang is so efficient at creating and executing thousands of
processes, I thought it would be feasible to create that many to do
real work, but the impression I get is maybe it's not such a great
idea when you have only a few dozen cores available to you.
CGS, could you explain how the dynamic library would work in more
detail? I was thinking it could work like that, but I wasn't actually
sure how it would be implemented. For example, if two Erlang processes
invoke the same shared library, does the OS simply copy each function
call to its own stack frame so the data is kept separate, and only one
copy of the code is used? I could see in that case then how 20,000
Erlang processes could all share the same library, since it minimizes
the amount of memory used.
David, the solution you described is new to me. Are there any
resources I can read to learn more?
Joe (your book is sitting on my desk as well =]), that's rather
interesting Erlang was purposely slowed down to allow for on-the-fly
code changes. Could you explain why? I'm curious.
We are still in the R&D phase (you could say), so I'm not quite sure
yet which specific category the number crunching will fall into (I
wouldn't be surprised if there are matrices, however). I think what
I'll do is write the most intensive parts in both Erlang and C, and
compare the two. I'd prefer to stick purely with Erlang though!
We have neither purchased any equipment yet nor written the final
code, so I'm pretty flexible to whatever the best solution would be
using Erlang. Maybe next year I can pick up one of those 20K core
machines =)
Given your description above, I'd probably just write the first
version of your application in Erlang. Normally I'm all for the NIF's
but your scenario doesn't strike me as a the best fit (without first
measuring the native Erlang which will be easier to code and maintain
initially).
The reason here is that there's a noticeable cost to passing data
across the Erlang/(Driver|NIF|CNode) boundary so anything you're doing
on the C side should be fast enough to more than make up for this. A
good example here is from Kevin Smith's talk at the last Erlang
Factory SF on using CUDA cards for numerical computations (he's
illustrating the CUDA memory transfer overhead, but the same basic
idea applies to passing data from Erlang to C).
Given that your examples (sound to my non-financially familiar brain)
to be small calculations on lots of data, you might be pleasantly
surprised on the performance you'll get just from using Erlang across
a large number of cores. And even if you find out in the future that
you can write a small NIF that does your calculation in C using a
request queue, that's just as well because you'll have tested that you
need it and will know exactly how much you're saving by using C and so
on.
Paul
I said "slow by design" - perhaps an unfortunately choice of words -
What I meant was that there was design decision to allow code changes
on the fly and that a consequence of this design decision
means that all intermodule calls have one extra level of indirection
which makes them slightly slower to implement then calls to code which
cannot be changed on the fly.
Suppose you have some module x executing some long-lived code
(typically a telephony transaction) - you discover a bug in x. So you
fix the bug. Now you have two versions of x. The x that is still
currently executing, and the modified x that you will use when you
start new
transactions.
We want to allow all the old processes running the old version of x to
"run to completion" - new processes will get the next version of x.
This is achieved as follows: if you call x:foo/2 you always call the
latest version of the code, but inlined calls call the current version
of the code.
Let me give an example:
Imagine the following:
-module(foo).
fix_loop(N) ->
...
fix_loop(N+1).
dynamic_loop(N) ->
...
foo:dynamic_loop(N+1)
In the above fix_loop and dynamic_loop have *entirely different behaviors *
if we compile and reload a new version of foo, then any existing processes
running fix_loop/1 inside x will continue running the old code.
Any old processes running dynamic_loop/1 will jump into the new
version of the code when they make the (tail) call to
foo:dynamic_loop/1
To implement this requires one level of indirection in making the subroutine
call. We can't just jump to the address of the code for loop, we have to
call the function via a pointer. The ability to change code on the fly
introduces
a slight overhead in all function calls where you call the function
with an explicit module name - if you omit the module name then the
call will be slightly
fast, since the address cannot be changed later. so calling fix_loop/1
in the above is slightly faster than calling dynamic_loop/1.
Why do we want to do all this anyway?
We designed Erlang for telecomms applications - we deploy applications that
run for years and want to upgrade the software wihout disrupting services.
If a user runs some code in a transaction that takes a a few minutes and
we change the code we don't want to kill ongoing transactions using
the old code - nor can we wait until all transactions are over before
introducing new code (this will never happen).
Banks turn off their transactions systems while upgrading the software -
(apart from Klarna :- ) - aircraft upgrade the software while the
planes are on the ground (I hope) - but we do it as we run the system
(we don't want to loose calls just because we are upgrading the
software)
Now suppose you discover a fault in your software that causes to you
buy or sell shares at a catastrophically bad rate - what do you do -
wait for everything to stop before changing the code? - or pump in new
code to fix the bug in mid session. Just killing everything might
leave (say) a data base in an inconsistent state and make restarting
time-consuming.
Dynamic code change is useful to have under your feet just in case you need
it one day - in the case on online banking companies like Klarna use
this for commercial advantage :-)
/Joe
On Mon, Oct 3, 2011 at 10:05 PM, John Smith <> wrote: > We are still in the R&D phase (you could say), so I'm not quite sure > yet which specific category the number crunching will fall into (I > wouldn't be surprised if there are matrices, however). I think what > I'll do is write the most intensive parts in both Erlang and C, and > compare the two. I'd prefer to stick purely with Erlang though!
From time to time the "Erlang is poor at number crunching" can be heard.
Mainly this revolves around Erlang being bad at the kind of number crunching needed for linear algebra/image processing etc.
Having a similar requirement as John for a current project I thinking a lot recently how to use Erlang maybe together with another language or system together for this (other requirements are quite in favor of Erlang for my project).
When pondering this I noticed that if Erlang would have a flexible n-dim array type with good performing matrix/vector manipulation functions I would not need to integrate some external system with all the complexity required to make the concurrency impedances match.
There are several systems where efficient matrix manipulation is added to languages that would not be considered for numerical calculations without them. Examples are NumPy and pdl.perl.org.
Usually these revolve around a basic matrix "buffer" which is reference counted. These buffers are referred to by structures of metadata containing dimensionality, matrix slice info (they support the concept of having slices of matrices which share the buffers without copying).
The buffers are quite similar to Erlangs binaries, and the no-copy slicing looks quite functional to me.
If we would build a matrix library on top of binaries with only a few basic operations as NIF's quite powerful things could be built on top of this using pure Erlang.
I would invest some of my time in a library like this, probably not enough to make it general purpose. But if others want to participate, maybe we won't have to repeat the Erlangs to for (all) number crunching sentiment too often.
Cheers
-- Peer Stritzinger
Hasn't something like this already been done? I'm sure I remember reading
about it.
Yeah, I remember reading the paper with keen interest, but not sure the
code was ever published:
"High-Performance Technical Computing with Erlang"
http://www.erlang.org/workshop/2008/Sess23.pdf
Personally I'd consider OCaml/MLton (running as port program over stdio)
for that kind of task, but then I may be missing the point of this thread
(sorry, didn't follow closely).
BR,
-- Jachym
Whatever the current Erlang system actually does, the overhead of remote
calls need in principle be no more than the overhead of dynamic dispatch
in a language like C++.
That overhead is actually surprisingly high (and yet people *willingly*
write Java, go figure). There is an indirect cost to the indirection,
namely that dynamic calls can't be inlined. For C++ there is an answer:
link-time analysis can find calls (often lots and lots of them) that
don't actually need to be polymorphic (e.g., because the declared class
turns out not to have any subclasses that override the method in question)
and those calls can be inlined after all. In languages which allow new
code to be added at run time (like Java and Erlang) it's not that easy.
Some years ago I proposed that Erlang could distinguish between
"detachable" and "non-detachable" parts, so that a group of modules could
be bound together in such a way that they would have to be replaced _as a
unit_. The idea has not been taken up because it's very far from being
Erlang's most pressing problem.
To John Smith, what on earth does "shocking the yield curve" mean?
One thing about architecture. Joe raised an interesting question.
"Now suppose you discover a fault in your software that causes to you
buy or sell shares at a catastrophically bad rate - what do you do -
wait for everything to stop before changing the code?"
My question is, "how could you structure your system so that if it
TRIES to buy or sell at a catastrophically bad rate it CAN'T?" A couple
of years ago I came up with an idea for a potential PhD candidate who
ended up going somewhere else. That was inspired by a true event here,
where an electricity company cut off supply to a house where there was
an extremely sick woman who depended on some machine to keep her alive
(I forget what kind). Needless to say, she died. And of course it was
one of those stories where the computer noticed the bill hadn't been
paid recently and sent out a notice to a technician who dutifully went
out and turned the power off without asking any awkward questions. So
what can we do to stop that? (The customer had informed the company of
their special needs.) The answer I came up with turns out to be
quite similar in spirit to Joe's UBF.
You have a GENERATOR of actions,
a CRITIC of actions, and
an EFFECTOR of actions.
(Come to think of it, there's a link here to Dorothy L. Sayers' "The
Mind of the Maker.") The generator of actions receives inputs and
decides on things to do, but doesn't actually do them. It passes
its proposals on to the critic, which watches out for bad stuff.
Things that the critic is happy with are passed on to the effector to
be carried out.
In the electricity case, the critic would use rules like
"If the proposal is to disconnect supply
and the customer has registered a special need
and there is no record of a court order
REJECT"
In the trading case, the critic's rules would say something about the
amount of money.
The generator should not rely on the critic; if everything is working
well you won't be able to tell if the critic is there or not.
A rejection by the critic indicates an error in the generator
requiring corrective programming. This is where it gets similar
to UBF: UBF contract checking isn't there to make good things
happen normally, it's there to stop bad things happening and make
sure they're noticed.
This is one way to use multicore: spend some of the extra cores doing
more checking.
Here's an example: imagine you've plotted a curve of US Treasury
yields and their maturities:
http://en.wikipedia.org/wiki/File:USD_yield_curve_09_02_2005.JPG
You do this for 360 months (30 years) and have a yield for every
month. Now obviously there aren't data points for every month (there
are no 12.5-year Treasuries) so you have to come up with data points
for those months (but we can ignore that detail).
Now you've constructed your yield curve and you want to shock it. What
that means is you either shift the curve up or down by a fixed amount
of basis points for every yield point. If you shock the curve 100
basis points up (100 basis points equals 1 percent), you move every
yield point up by 100 basis points, and now you have your shocked
yield curve (shocking normally occurs at major intervals, e.g., 25,
50, 100). You can then evaluate how your portfolio would fare in this
environment.
I have to admit I was not aware of this. OTOH it seems not to be
available, can't find anything except the paper and EEP7 which is the
foreign function interface to the external number crunching libraries
they invented.
However I was thinking along different lines, the approach of "HPTC
with Erlang" (and also NumPy) is to slap on the big chunk of proven
numerical routines as some external library. Which is the way to go
if you want to do serious number crunching, since its quite hard to
develop trusted and efficient numerical routines.
The price you have to pay for the slapped on heavyweight library is
that these usually don't scale up to the number of processes Erlang
can handle. Therefore the need of the impedance adaption I mentioned.
Keeping a pool of numerical processes to keep the cores busy but not
too many of them that the OS is upset. Having work queues that adapt
these to the 20k processes. BTW @John: this would be one solution for
your problem.
What I was suggesting is a more integrated and lightweight way to make
some number crunching available. The suggested n-dim matrix type
(e.g. a record containing the metadata and a binary for the data)
combined with some NIFs on these that speed up the parts where Erlang
is not so fast. Keeping in mind not to do too much work in the NIFs
at one time not to block the scheduler.
This is for the use cases where there is some numerical stuff needed
but having real time responsiveness and Erlang process counts in mind.
The use case I have e.g. is some neuronal networks stuff combined
with a lot of symbolic computing to prepare the input. And its
embarrassingly parallel and needs only some simple vector times matrix
and n-dim array slicing.
For real heavy numerical stuff I think the best way is to do this in
the systems are built for this and interface them somehow to erlang
with ports or sockets. Or try to get the code released from the HPTC
paper Jachym mentioned.
For interfacing with BLAS and their ilk some more native Erlang
numerical capabilities would also be nice to have. Since they also
use a kind of binary buffer with some metadata approach it would not
be too hard to interface efficiently.
Absolutely - move compliance into the software - If traders expose a
bank to too much risk the
trades should be stopped automatically. This might have prevented
(say) the collapse
of Baring bank. There is a trade of between security - (we have to
spend a few CPUs cycles
checking for illegality) and speed here.
If a bank collapses due to trading violations that could have been
detected by software and no
such software was in place I imagine legal action could be taken. So
there should be a strong argument for such software.
/Joe
http://www.calxeda.com/products.php
Rumor has it they have it working as a SoC.
Has a functional SoC that is MIPS based
While no suitable for Erlang
Has the most impressive offering I've played with. But this is a embedded forth programmer's chip.
If you want to get into parallel number crunching applications chips like:
http://www.clearspeed.com/products/csx700.php
Almost take the fun out of it.
Even Intel demoed an 80 core chip a year ago.
We are going to see an explosion of small SoC fabless integrators pumping out massively parallel architectures all with a green energy pitch. It all comes down to simple physic butting up against economics.
Too much die is left idle in most modern deep pipeline superscalar designs, single thread performance is I/O bound for most real world problems, thermal effects have made low power desirable on a macro scale.
Cheapest solution shove a bunch of smaller cores on a die with a on chip bus, and shift the problem to software :)
-=-=- da...@nexttolast.com -=-=-
Hi, I'm one of the authors of the "HPTC with Erlang" work.
You're right, nothing was publicly released except for the FFI
implementation described in EEP 7 --- and since the project ended
last year, I believe that nothing else will be released in the
future. The last bits were a (prototypal) NIF-based FFI
implementation [1], together with a request to withdraw EEP 7
(since it was clearly superseded by NIFs) [2].
[1] http://muvara.org/hg/erlang-ffi/
[2] http://erlang.org/pipermail/eeps/2010-July/000292.html
> The price you have to pay for the slapped on heavyweight
> library is that these usually don't scale up to the number of
> processes Erlang can handle.
IMHO it mostly depends on:
1. the size of the operands you're working on;
2. the complexity of the foreign functions you're going to
call.
Our project was primarily focused on real-time numerical
computing, and thus we needed a method for quickly calling
"simple" numerical foreign functions (such as multiplications of
relatively small (portions of) matrices). Those functions, taken
alone, would usually return almost immediately: in other words,
their execution time was similar to that of regular BIFs. We
used BLAS because its optimized implementations are usually
"fast enough", but (if necessary) we could have developed
our own optimized C code.
When more complicated formulas are assembled with repeated FFI
calls to those simple functions, then the Erlang scheduler can
kick in several times before the final result is obtained, thus
guaranteeing VM responsiveness (albeit reducing the general
numerical throughput).
> Keeping a pool of numerical processes to keep the cores busy
> but not too many of them that the OS is upset. Having work
> queues that adapt these to the 20k processes.
If the native calls performed by those 20k Erlang processes are
not "heavy" enough, then introducing work queues may actually
increase the Erlang VM load and internal lock contention, thus
decreasing responsiveness (wrt plain NIF calls). I suspect that
some comparative benchmarking could be useful.
> The suggested n-dim matrix type (e.g. a record containing the
> metadata and a binary for the data) combined with some NIFs on
> these that speed up the parts where Erlang is not so fast.
> Keeping in mind not to do too much work in the NIFs at one time
> not to block the scheduler.
This is exactly what we did for interfacing BLAS and other
numerical routines (except that we used our FFI, since NIFs were
not yet available).
Maybe a next-generation, general-pourpose numerical computing
module for Erlang could adopt different strategies depending on
the size of the operands passed to its functions:
1. if the vectors/matrices are "small enough", then the native
code could be called directly using NIFs;
2. otherwise, the operands could be passed to a separate worker
thread, which will later send back its result to the waiting
Erlang process (using enif_send()).
In the second case, the future NIF extensions planned by OTP
folks may be very useful --- see Rickard Green's talk at the SF
Bay Area Erlang Factory 2011: http://bit.ly/eH61tX
> For real heavy numerical stuff I think the best way is to do
> this in the systems are built for this and interface them
> somehow to erlang with ports or sockets.
Sure, but the problem with this approach is that you may need to
constantly (de)serialize and transfer large numerical arrays
among the Erlang VM and the external number crunching systems,
thus wasting processor cycles, and memory/network bandwidth.
Regards,
--
Alceste Scalas <alc...@muvara.org>
Its a pity we have to start over with this. I fear the number of
people interested in numerical (realtime) capabilities is limited, so
we will invent part of the wheel again and again.
>> The price you have to pay for the slapped on heavyweight
>> library is that these usually don't scale up to the number of
>> processes Erlang can handle.
>
> IMHO it mostly depends on:
>
> 1. the size of the operands you're working on;
>
> 2. the complexity of the foreign functions you're going to
> call.
I agree. Thats one thing that makes it hard to find a generic
solution to numerics needs in Erlang.
> Our project was primarily focused on real-time numerical
> computing, and thus we needed a method for quickly calling
> "simple" numerical foreign functions (such as multiplications of
> relatively small (portions of) matrices). Those functions, taken
> alone, would usually return almost immediately: in other words,
> their execution time was similar to that of regular BIFs. We
> used BLAS because its optimized implementations are usually
> "fast enough", but (if necessary) we could have developed
> our own optimized C code.
I was looking at more heavyweight BLAS implementations which do
internal thread management to use the cores. Should have looked at
simpler BLAS implementations which are thread safe and single
threaded.
> When more complicated formulas are assembled with repeated FFI
> calls to those simple functions, then the Erlang scheduler can
> kick in several times before the final result is obtained, thus
> guaranteeing VM responsiveness (albeit reducing the general
> numerical throughput).
The unavoidable trade off between real-time and throughput optimization.
> If the native calls performed by those 20k Erlang processes are
> not "heavy" enough, then introducing work queues may actually
> increase the Erlang VM load and internal lock contention, thus
> decreasing responsiveness (wrt plain NIF calls). I suspect that
> some comparative benchmarking could be useful.
I'm currently experimenting with a n-dim array module in Erlang that
uses the metadata + binary buffer approach. Building all operations I
need in pure Erlang first and find places to optimized in NIF's.
Probably won't use a external library since my numerics needs are
pretty specialized (e.g. lots of multiplying bit vectors with float
matrices).
> Maybe a next-generation, general-pourpose numerical computing
> module for Erlang could adopt different strategies depending on
> the size of the operands passed to its functions:
>
> 1. if the vectors/matrices are "small enough", then the native
> code could be called directly using NIFs;
This automation would probably be machine dependent. I can imagine
that the basic matrix operations can be handled like this, probably
auto-split in sub-matrix operations. Probably needs some learning
phase to find the characteristics of the machine. Basically the ratio
NIF overhead vs. BLAS-speed has to be measured.
> 2. otherwise, the operands could be passed to a separate worker
> thread, which will later send back its result to the waiting
> Erlang process (using enif_send()).
>
> In the second case, the future NIF extensions planned by OTP
> folks may be very useful --- see Rickard Green's talk at the SF
> Bay Area Erlang Factory 2011: http://bit.ly/eH61tX
This would be useful for an intermediate runtime of routines.
>> For real heavy numerical stuff I think the best way is to do
>> this in the systems are built for this and interface them
>> somehow to erlang with ports or sockets.
>
> Sure, but the problem with this approach is that you may need to
> constantly (de)serialize and transfer large numerical arrays
> among the Erlang VM and the external number crunching systems,
> thus wasting processor cycles, and memory/network bandwidth.
For runtimes in the minutes to hours range and very complicated code
this is probably still the way to go. There is always the question of
Erlang VM stability for the heavy numerical stuff. Ports are very
nice from the dependability standpoint. Which is probably a issue for
the trading system example that initiated the thread.
For my application I'll start from the Erlang side trying to define a
nice API for n-dimensional fixed element size (sub byte sizes allowed)
matrices with some basic operations defined for them. Then I'll look
at the minimum amount of NIF support needed to make this run at least
at modest speeds. I'll publish my code early hoping others might join
in.
Regards
Peer Stritzinger
Just watched the video and read the slides of this talk.
This would be perfect to handle more heavyweight numerical library
support. Especially the dirty schedulers look very promising for
this.
Also the native processes which make it easy to run untrusted large
libraries in their own node.
> 1. if the vectors/matrices are "small enough", then the native
> code could be called directly using NIFs;
> 2. otherwise, the operands could be passed to a separate worker
> thread, which will later send back its result to the waiting
> Erlang process (using enif_send()).
This won't be necessary if dirty schedulers are used if you look at slide 25 of
http://www.erlang-factory.com/upload/presentations/377/RickardGreen-NativeInterface.pdf
"Example of dirty scheduler use" you can see this two step solution
working with dirty schedulers without sending of messages and other
complications.
This is perfect for the numerical libraries.
Cheers
-- Peer Stritzinger